CN108920456B - Automatic keyword extraction method - Google Patents

Automatic keyword extraction method Download PDF

Info

Publication number
CN108920456B
CN108920456B CN201810611476.7A CN201810611476A CN108920456B CN 108920456 B CN108920456 B CN 108920456B CN 201810611476 A CN201810611476 A CN 201810611476A CN 108920456 B CN108920456 B CN 108920456B
Authority
CN
China
Prior art keywords
candidate
keywords
word
technical standard
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810611476.7A
Other languages
Chinese (zh)
Other versions
CN108920456A (en
Inventor
吕学强
董志安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Information Science and Technology University
Original Assignee
Beijing Information Science and Technology University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Information Science and Technology University filed Critical Beijing Information Science and Technology University
Priority to CN201810611476.7A priority Critical patent/CN108920456B/en
Publication of CN108920456A publication Critical patent/CN108920456A/en
Application granted granted Critical
Publication of CN108920456B publication Critical patent/CN108920456B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a keyword automatic extraction method, which comprises the following steps: extracting common words in the technical standard, extracting candidate keywords, filtering the common words aiming at the candidate keywords, calculating a candidate keyword weight score by integrating the position characteristics, the word co-occurrence characteristics and the context semantic characteristics, calculating a dynamic threshold according to the candidate keyword weight score range, and determining result keywords by using the dynamic threshold. The automatic keyword extraction method provided by the invention integrates the position characteristic, the word co-occurrence characteristic and the context semantic characteristic to extract the keywords, and comprehensively considers the weight influence of the internal position of the document and the context semantic characteristic on the keywords, so that higher accuracy and recall rate are achieved, the technical standard retrieval quality of the 3GPP is improved, the labor cost is reduced, and the requirement of practical application can be well met.

Description

Automatic keyword extraction method
Technical Field
The invention belongs to the technical field of automatic extraction of keywords, and particularly relates to an automatic extraction method of keywords facing to a 3GPP technical standard.
Background
The explosive development of mobile communication technology brings epoch-making changes to the human society. As a standard maker of The leading technology in The field of communications, The 3rd Generation Partnership Project (3 GPP) is dedicated to generalizing 3G standards based on evolved global system for mobile communications (GSM) core networks including WCDMA, TD-scdma, EDGE, etc.
In recent years, there are many cases of patent infringement litigation disputes among large-scale communication technology companies, and the stability of patent rights of the invention is challenged unprecedentedly. The 3GPP technical standards play an irreplaceable important role in the work of examining communication patents.
The 3GPP technical standard is a scientific non-patent document specific to the patent examination work in the communication field, and is usually used as a comparison document to measure the creativity and novelty of the patent application in the communication field.
The typical 3GPP technical standard Cover mainly includes a standard number, a release number, a document title and version number information, a form part explains the version number, a Scope part declares the application Scope, a Reference part gives a Reference list, Definitions and abbrevatients parts list important Definitions and abbreviations of documents, a Topic body section specifically introduces the technical background and details, and Annex mainly relates to the version change history.
In addition, there is a mutual citation association between the 3GPP technical standard and the patent literature, and the differences from the patent literature are shown in table 1.
Table 1 patent document differs from 3GPP technical standard
Figure GSB0000175921600000021
As can be seen from table 1, the 3GPP technical standard has its own unique organization and type. Of major interest in the actual patent examination are Technical Specifications (TS), Technical Reports (TR), and conference files. The technical specification and the technical report collectively describe relevant regulations, principles, simulation and experimental results of the technology, and the conference file mainly records specific conference information of each working group. In contrast, the technical specification and the technical report content format are similar, and the carried core technical information is richer, and greater mining value is accumulated.
In actual patent review, the search of 3GPP technical standards is mainly based on keywords manually selected by examiners. The quality of the retrieval result often depends on the quality of the keywords, and the traditional mode is time-consuming and labor-consuming and is difficult to ensure the hit rate of the comparison files. Compared with patent documents, the 3GPP technical standard has the characteristics of wide coverage, large information amount, irregular format and weak readability, and the characteristics directly determine that the 3GPP technical standard has higher difficulty in automatically extracting keywords than the patent documents. Therefore, the effect of automatically extracting the keywords of the 3GPP technical standard is improved, so that the method is not only beneficial to improving the examination efficiency of the communication patents, but also has great significance for maintaining the authorization stability of the patents.
The automatic keyword extraction method has a great deal of relevant research at home and abroad, and generally comprises two major branches of a supervised learning method and an unsupervised learning method. The supervised learning method generally converts the keyword extraction problem into a two-class or multi-class problem in machine learning, and mainly relates to classification models such as Naive Bayes (Naive Bayes), Maximum entropy (Maximum entropy), Support Vector Machines (Support Vector Machines) and the like. Although the method has a good prediction effect to a certain extent, the extraction effect often depends on the labeling quality of the training corpus and the scale of the training corpus, excessive human input cannot be avoided, and the method is difficult to adapt to practical applicationMedium-volume data scenes. The unsupervised learning method has the most obvious advantage over the supervised learning method in that the labor cost is greatly saved, and can be divided into an extraction method based on statistics, an extraction method based on a topic model and an extraction method based on a word graph model according to the algorithm idea. Wherein, the extraction method based on statistics generally combines the word frequency (term frequency) information, the word frequency-inverse document frequency (TF-IDF) and the χ 2 The weight of the candidate keywords is measured by statistical indexes such as values, and the method is sensitive to frequency and is easy to omit part of important low-frequency words. The most classical representative of the extraction method based on the topic model is an LDA (latent Dirichlet allocation) model algorithm, the LDA model infers the probability distribution of 'document-topic' and the probability distribution of 'topic-term' from a known 'document-term' matrix through analyzing the training corpus, and the extraction effect of the method depends on the topic distribution characteristics of the training set. The extraction method based on the word graph is most widely applied to a TextRank algorithm, the thought of the TextRank algorithm is derived from a PageRank algorithm of Google, sentences or words in a text set form graph nodes, the similarity between the nodes (sentences or words) is used as the weight of edges, an iterative voting mechanism is adopted to carry out importance sequencing on the nodes in the graph model, and the method is not dependent on the number of texts, but has the limitation that only the internal information of the texts is considered and the distribution characteristics of vocabularies among different texts are ignored. The mainstream method at the present stage usually performs keyword extraction by fusing the advantages of different methods for specific problems, and has the following defects: the method has the defects of poor recognition effect, poor low-frequency keyword recognition effect and the like due to the lack of consideration on semantic features.
Disclosure of Invention
In view of the problems in the prior art, an object of the present invention is to provide an automatic keyword extraction method that can avoid the technical defects.
In order to achieve the above object, the present invention provides the following technical solutions:
an automatic keyword extraction method comprises the following steps: extracting common words in technical standards, extracting candidate keywords, filtering the common words aiming at the candidate keywords, calculating a candidate keyword weight score by integrating position characteristics, word co-occurrence characteristics and context semantic characteristics, calculating a dynamic threshold according to the candidate keyword weight score range, and determining result keywords by using the dynamic threshold.
Further, the automatic keyword extraction method comprises the following steps:
step 1) removing text noise in the 3GPP technical standard;
step 2) extracting common words in the technical standard;
step 3) extracting candidate keywords and filtering common words based on the syntactic analysis tree;
and 4) comprehensively considering the position characteristics, word co-occurrence characteristics and context semantic characteristics of the candidate keywords in the document, calculating weight scores and sequencing, finally calculating a dynamic threshold according to the actual score range of the technical standard, and adding the candidate keywords with the scores exceeding the threshold into a result keyword set.
Further, the step 1) is specifically as follows: and removing text noise in the 3GPP technical standard by adopting the Apache POI analysis technical standard.
Further, the text noise includes pictures, tables, formulas, special symbols, and illegal characters.
Further, step 2) comprises: extracting common words in the technical standard based on the word frequency-document distribution entropy, wherein the word frequency-document distribution entropy refers to uncertainty measurement of the distribution state of the words w in the technical standard set; let a document set consisting of n technical standards be denoted as D ═ D 1 ,d 2 ...d i ...d n And recording the word frequency of the word w and the document distribution entropy as H (w), and then H (w) has a calculation formula of
Figure GSB0000175921600000041
Wherein, P (w, d) i ) For the occurrence of the word w in the technical standard d i I is 1. ltoreq. n, according to the maximum likelihood estimation method, P (w, d) i ) Is calculated by the formula
Figure GSB0000175921600000042
Wherein, f (w, d) i ) For words w in technical standards d i The number of occurrences in (c).
Further, extracting candidate keywords based on the dependency parsing tree includes:
step 1: traversing the technical standard set D, for each technical standard D in D i Dividing the words into sentences according to punctuations, and representing the divided sentence sets as
Figure GSB0000175921600000043
n s As a document d i The number of Chinese sentences;
step 2: for set sequences (d) i ) Each sentence in the tree is subjected to dependency syntax analysis by using a Stanford Parser syntax analyzer to obtain a corresponding dependency syntax analysis tree set Trees (d) i ) Memory for recording
Figure GSB0000175921600000044
Wherein T is i Indicates technical standard d i The dependency parsing tree corresponding to the ith sentence;
step 3: cyclic read dependency parse tree set Trees (d) i ) For any dependency syntax tree T i ∈Trees(d i ) Taking the words and corresponding parts of speech in the syntactic dependency tree as a whole as leaf nodes, traversing T in a medium-order and ordered mode i If the current node is a leaf node, judging whether the part of speech of the node is a noun, a verb or an adjective, if the condition is met, adding the node into the candidate keyword set, otherwise, jumping to the next node; if the current node is not a leaf node, judging whether the current node is a noun phrase, if so, continuing to recursively traverse the right subtree of the current node until no non-leaf node taking the noun phrase as a parent node exists in the subtree, and at the moment, adding the child nodes of the noun phrase into a candidate keyword set as a whole;
step 4: and further filtering the candidate keyword set by using the extracted general words, and removing an element containing the general words from the candidate keyword set if the element exists in the candidate keyword set.
Further, the calculation method of the position feature weight comprises the following steps: respectively dividing sentence subsets by using punctuations as boundaries aiming at text parts corresponding to titles of different levels of the 3GPP technical standard, sequentially numbering the sentences in the sentence sets from 1, and recording the technical standard d i Middle candidate keyword set CK (d) i )={ck 1 ,ck 2 ...ck i ...ck n In which ck i For any candidate keyword in the set, n is the number of the candidate keywords, and the special position set is recorded as
SP={Title,Scope,Reference,Definitions,Abbrevations,NOTE},
locate(ck i ) Representing candidate keywords ck i Position of occurrence, defining a characteristic function Pos (ck) i ) Representing candidate keywords ck i The weight in the dimension of the appearance position is assigned
Figure GSB0000175921600000051
Wherein, Sno cki Representing candidate keywords ck i Number of sentence in it, Snu cki Representing candidate keywords ck i Number of sentences in text paragraph, len (ck) i ) Representing candidate keywords ck i The number of words contained; the weights appearing at different positions are averaged and are recorded as W (Pos (ck) i ) Represents the average of the position weights, then
Figure GSB0000175921600000052
Of these, fre (ck) i ) Representing candidate keywords ck i The frequencies occurring in the same technical standard.
Further, the word co-occurrence feature weight calculation method comprises the following steps:
remember allThe set of candidate keywords for the technical standard is CK ═ CK (d) 1 ),CK(d 2 )...CK(d i )...CK(d n ) H.for technical standard d i Any one of the candidate keywords ck i Memory component ck i Are respectively cw 1 ,cw 2 …cw i …cw m M is ck i Number of words contained, note cw i Co-occurring word set of (1) is cocur i ={wco 1 ,wco 2 …wco i …wco p P is the size of the co-occurrence set, wherein wco j Representing the word cw i Wco j ∈CK(d i ) And satisfies wco 1 ∩wco 2 ∩…∩wco j ∩…∩wco p ={cw i J is more than or equal to 1 and less than or equal to p, then cw i For candidate keywords ck i Is expressed as
Figure GSB0000175921600000061
Of these, fre (cor) j ) Representing the word cw i Co-occurrence of wco j Frequency of occurrence, len (wco) j ) Indicating co-occurrence of wco j The number of words contained; when the candidate keyword ck i When a plurality of words are included, candidate keywords ck are calculated i The weight calculation formula on the dimension of word co-occurrence is
Figure GSB0000175921600000062
Further, the context semantic feature weight calculation method comprises the following steps:
the computing task is decomposed into the maximum probability value of each word forming the context (w) which is respectively and independently predicted by the current word w, and the target function is
Figure GSB0000175921600000063
Wherein c is i E.g. context (w), D is technical standard corpus, theta is model parameter, conditional probability P (c) i | w) is expressed as
Figure GSB0000175921600000064
Wherein,
Figure GSB0000175921600000065
and v w Are respectively the word c i And w, c' is all non-repeating words in the corpus, v c′ Vector representation as c'; each technical standard D in the technical standard set D i Viewed as being composed of a series of words w 1 …w i …w n Composition, assuming mutual independence between words, to technical criteria d i In each candidate keyword ck i If the word type is used, the formula for calculating the prediction probability is
Figure GSB0000175921600000071
If the phrase type is used, the calculation formula is
Figure GSB0000175921600000072
logP (w) on the left after taking the logarithm of both sides of the above formula 1 …w i …w n |ck i ) As a measure of candidate keywords ck i The weight metric in the semantic dimension is denoted as w (Sem (ck) i ) logP (w) 1 …w i …w n |ck i ) Approximately as logP (c) 1 …c i …c n |ck i ) Wherein w is 1 …w i …w n As candidate keywords ck i A Context within the scope of a model window, abbreviated as Context (ck) i ) Then W (Sem (ck) i ) Is calculated as
Figure GSB0000175921600000073
Further, the step 4) comprises:
to technical standard d i Any one of the candidate keywords ck i Comprehensively considering the position feature, the word co-occurrence feature and the context semantic feature, calculating the candidate keywords ck i The weight scores in the three characteristic dimensions are formulated as
W(ck i )=W(Pos(ck i ))+W(Coo(ck i ))+W(Sem(ck i ));
Note d i Of each candidate keyword ck i Corresponding score
Score(d i )={W(ck 1 )...W(ck i )...W(ck n ) For Score (d) } for Score (d) i ) The scores in the step (a) are ranked from high to low, a dynamic threshold lambda is set as the average value of all the scores, and the calculation formula is
Figure GSB0000175921600000074
If d is i The medium candidate keyword satisfies W (ck) i ) When the k is more than or equal to lambda, ck is i And adding the result into the result keyword set.
The automatic extraction method of the keywords, provided by the invention, integrates the position characteristics, the word co-occurrence characteristics and the context semantic characteristics to extract the keywords, comprehensively considers the weight influence of the internal positions of the documents and the context semantic characteristics on the keywords, achieves higher accuracy and recall rate, improves the retrieval quality of the 3GPP technical standard, reduces the labor cost and can well meet the requirements of practical application.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a diagram of a dependency parse tree;
FIG. 3 is a comparison of the CBOW model and Skip _ gram model frameworks.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides an automatic keyword extraction method, which comprises the steps of firstly extracting common words in a 3GPP technical standard based on a word frequency-document distribution entropy method, then extracting candidate keywords based on an algorithm of a dependency syntax analysis tree, calculating a candidate keyword weight score by integrating position characteristics, word co-occurrence characteristics and context semantic characteristics after filtering the common words aiming at the candidate keywords, calculating a dynamic threshold according to a candidate keyword weight score range of each technical standard, and finally determining result keywords by using the dynamic threshold. Specifically, as shown in fig. 1, an automatic keyword extraction method includes the following steps:
step 1) preprocessing the 3GPP technical standard, which mainly comprises adopting an Apache POI analysis technical standard to remove text noises such as pictures, tables, formulas, special symbols and illegal characters in the technical standard;
step 2) extracting common words in all technical standards based on the word frequency-document distribution entropy;
step 3) segmenting each technical standard into sentence subsets, performing dependency syntax analysis on each sentence, extracting candidate keywords based on a dependency syntax analysis tree and filtering common words;
and 4) comprehensively considering the position characteristics, word co-occurrence characteristics and context semantic characteristics of the candidate keywords in the document, calculating weight scores and sequencing, finally calculating a dynamic threshold according to the actual score range of the technical standard, and adding the candidate keywords with the scores exceeding the threshold into a result keyword set.
The 3GPP technical standards include not only simple stop words such as "if", "at", "not", "or", but also general words throughout most technical standards, such as "Figure", "version", "general", "seven", etc. which are specific to the technical standards and have no representativeness and importance. It has been observed that both simple stop words and generic words that are unique within a technical standard occur at different frequencies in different versions and types of technical standards, with a high degree of currency, and generally without the ability to generalize or abstract out the content of a particular technical standard. These words are collectively referred to as common words.
Clearly, coverage is not comprehensive enough if only the manually gathered common deactivation vocabulary is selected. Therefore, in order to reduce the interference of the general words to the keyword extraction task as much as possible, the concept of word frequency-document distribution entropy is introduced by combining the information entropy principle to automatically obtain the technical standard general words. The information entropy is introduced into an information theory by Shannon at first and is used for measuring the uncertainty of the discrete random variable, and the larger the information entropy value is, the larger the uncertainty corresponding to the random variable is. Similarly, regarding the word w as a random variable, the definition of the word frequency-document distribution entropy is given as follows.
Definition 1 term frequency-document distribution entropy refers to a measure of uncertainty in the state of distribution of a word w in a set of technical standards.
Let a document set consisting of n technical standards be denoted as D ═ D 1 ,d 2 ...d i ...d n And (5) recording the word frequency of the word w and the document distribution entropy as H (w), then H (w) is calculated as shown in formula (1),
Figure GSB0000175921600000091
wherein, P (w, d) i ) For the occurrence of the word w in the technical standard d i I is 1. ltoreq. n, according to the maximum likelihood estimation method, P (w, d) i ) Can be calculated from the formula (2),
Figure GSB0000175921600000092
wherein, f (w, d) i ) Is the word w isTechnical Standard d i The number of occurrences in (c). It can be seen that if the technical standard containing w is richer and w is distributed in the technical standard set more uniformly, the word frequency-document distribution entropy h (w) is larger, and at the same time, indicates that w is distributed in the technical standard set D with greater uncertainty, so that w is more likely to be a general word without importance in the technical standard set.
Statistics show that most keywords are real word phrases such as nouns, verbs and adjectives generally and do not contain stop words without practical meanings and universal words distributed uniformly in technical standards. Thus, the keyword category is defined as verbs, adjectives, nouns, and noun phrases after the removal of the common word. In order to extract candidate keywords with semantic consistency and syntactic modifier completeness, dependency syntactic analysis is firstly carried out on sentences in the 3GPP technical standard, and then noun phrases, verbs, adjectives and nouns meeting the syntactic modifier consistency are extracted by combining a dependency syntactic analysis tree and are added into a candidate keyword set. The NP with the minimum granularity in the syntactic analysis tree is used as a candidate keyword aiming at the noun phrase. And finally, filtering the common words according to the candidate keyword set. For example: the syntax analysis of the sentence "local channels are SAPs between MAC and RLC" shows the result as shown in FIG. 2.
As can be seen from fig. 2, the adjectives "local" modify the nouns "channels", "local" and "channels" to form a Noun Phrase (NP), and "SAPs" and "MAC and RLC" are Noun Phrases (NP), while "SAPs beta ween MAC and RLC" is also a Noun Phrase (NP) as a whole, but in the dependency parsing tree, the noun phrases "MAC and RLC" and "beta ween" form preposition phrases PP and "SAPs" which are child nodes of NP and have sibling node relationship therebetween. Clearly, the granularity is smaller for the noun phrase "MAC and RLC" than for the noun phrases "SAPs between MAC and RLC". Therefore, "local", "channels", "local channels", "are", "SAPs", "MAC", "RLC", and "MAC and RLC" are selected as candidate keywords in the example sentence, and the candidate keywords are filtered using the extracted common words. According to the above analysis, the candidate keyword extraction algorithm based on the dependency parsing tree comprises the following steps:
step 1: traversing the technical standard set D, for each technical standard D in D i Dividing into sentences according to punctuations, and representing the divided sentence sets as
Figure GSB0000175921600000101
n s As a document d i The number of Chinese sentences.
Step 2: for set sequences (d) i ) Each sentence in the tree is subjected to dependency syntax analysis by using a Stanford Parser syntax analyzer to obtain a corresponding dependency syntax analysis tree set Trees (d) i ) Memory for recording
Figure GSB0000175921600000102
Wherein T is i Indicates technical standard d i And (4) the dependency parsing tree corresponding to the ith sentence.
Step 3: cyclic read dependency parse tree set Trees (d) i ) For any dependency syntax tree T i ∈Trees(d i ) Taking the words and corresponding parts of speech in the syntactic dependency tree as a whole as leaf nodes, and traversing the T in a medium-order and orderly mode i If the current node is a leaf node (not the last leaf node), judging whether the part of speech of the node is a noun, a verb and an adjective, adding the node into the candidate keyword set if the conditions are met, and otherwise, jumping to the next node; if the current node is not a leaf node, judging whether the current node is a Noun Phrase (NP) or not, if the current node is the noun phrase and the right subtree is not empty, continuing to recursively traverse the right subtree of the current node until no non-leaf node taking the NP as a father node exists in the subtree, and adding the child nodes of the NP into the candidate keyword set as a whole.
Step 4: because some technical standard common words still exist in the candidate keywords extracted in the previous step, the extracted common words are required to be used for further filtering the candidate keyword set, and if an element containing the common words exists in the candidate keyword set, the element is removed from the candidate keyword set.
By analyzing the characteristics of the 3GPP technical standard, the portions of Scope, Reference, Definitions and Abbreviations except the text can be found to have important Reference values for the whole document, and should be considered as the key positions. Each chapter content of the body part is generally expanded around the nearest title, so that the titles can be regarded as the concentration of the core content of the corresponding paragraph, and the candidate keywords appearing at the position should be given higher weight. Similarly, what appears in the section NOTEs (NOTEs) generally serves as additional emphasis or supplementary description herein and should therefore also be treated as a special location.
Therefore, the position of the candidate keyword appearing in the 3GPP technical standard is taken as a weight influence factor. The method comprises the steps of respectively dividing sentence subsets by using punctuations as boundaries aiming at text parts corresponding to titles of different levels of the 3GPP technical standard, numbering the sentences in the sentence subsets from 1 in sequence, and if the number of the sentence in which a candidate keyword is positioned is smaller, the keyword is closer to the title, which indicates that the keyword is more likely to be the key word. Recording technical standard d i Middle candidate keyword set CK (d) i )={ck 1 ,ck 2 ...ck i ...ck n In which ck i For any candidate keyword in the set, n is the number of the candidate keywords, and the special position set is recorded as
SP={Title,Scope,Reference,Definitions,Abbrevations,NOTE},
locate(ck i ) Representing candidate keywords ck i Position of occurrence, defining a characteristic function Pos (ck) i ) Representing candidate keywords ck i The weight in the dimension of the occurrence position is assigned, then Pos (ck) i ) Can be expressed as shown in equation (3).
Figure GSB0000175921600000111
Wherein, Sno cki Representing candidate keywords ck i Number of sentence in it, Snu cki Representing candidate keywords ck i Number of sentences in text paragraph, len(ck i ) Representing candidate keywords ck i The number of words contained. Denominator plus len (ck) i ) To avoid the situation that the position weight appears 0, due to the candidate keyword ck i In technical Standard d i May occur multiple times at different locations, and thus the weights occurring at different locations are averaged, denoted W (Pos (ck) i ) Is) represents the average value of the position weights, the calculation method thereof is as shown in equation (4).
Figure GSB0000175921600000121
Of these, fre (ck) i ) Representing candidate keywords ck i Frequency of occurrence in the same technical standard. The processing mode of taking the average value can enhance the candidate keywords ck i The weight with lower frequency but appearing in a special position weakens the deviation caused by calculating the weight of the candidate keyword only by the frequency characteristic.
Word co-occurrence characteristics are a factor that is not negligible in keyword extraction. By observing the extracted candidate keywords of the 3GPP technical standard, it is found that there is a phenomenon that constituent words of one candidate word repeatedly appear in other candidate words of different lengths, for example: for three candidate keywords, "MCH", "MCH transmission", "MCH subframe allocation", which occurs in two other candidate keywords of different lengths, the "MCH transmission" and the "MCH subframe allocation" may be regarded as co-occurring words of the "MCH", which often express more specific information than a single constituent word. Therefore, if a word constituting a candidate keyword has many co-occurring words, the word is considered to have a richer meaning and should be given a higher weight. According to the analysis, the co-occurrence word frequency and the word length of the candidate keyword forming words are used as word co-occurrence characteristics to calculate the weight of the candidate keyword.
The candidate keyword set with all technical standards is CK ═ CK (d) 1 ),CK(d 2 )...CK(d i )...CK(d n ) H.for technical standard d i Any one of the candidate keywordsck i Memory component ck i Are each cw 1 ,cw 2 …cw i …cw m M is ck i Number of words contained, note cw i Is cocur i ={wco 1 ,wco 2 …wco i …wco p P is the size of the co-occurrence word set (i.e., the number of co-occurrences in the co-occurrence word set), wherein wco j Representing the word cw i Wco j ∈CK(d i ) And satisfies wco 1 ∩wco 2 ∩…∩wco j ∩…∩wco p ={cw i J is more than or equal to 1 and less than or equal to p, then cw i For candidate keywords ck i The contribution of (c) can be expressed by equation (5);
Figure GSB0000175921600000131
wherein, fre (cor) j ) Representing the word cw i Co-occurrence of wco j Frequency of occurrence, len (wco) j ) Indicating co-occurrence of wco j The number of words contained. When the candidate keyword ck i When a plurality of words are included, candidate keywords ck are calculated i The weight calculation method in the dimension of word co-occurrence is shown in equation (6).
Figure GSB0000175921600000132
It can be seen that when the candidate keyword ck i The more frequently occurring component words in the co-occurrence words, the candidate keyword ck of each component word pair i Is greater, so the candidate keyword ck i The greater the weight in this dimension of the word co-occurrence characteristic.
Keywords generally highly condense the core content of technical standards, and often have the commonality of centralizing the gist of technical standards from different semantic levels. Therefore, the influence of semantic features of the candidate keywords in the context on the weights cannot be ignored. Considering that the Word vector can well represent semantic characteristics, Word2vec is introduced to calculate the weight of the candidate keywords in the dimension of the semantic characteristics.
Word2vec is a scheme implementation tool for solving the problems of lack of model generalization force, dimension disaster and the like in the calculation process of a statistical language model, which is provided by *** based on the deep learning idea. Word2vec comprises two training models of CBOW and Skip _ gram, in order to reduce the complexity of model solving, two training optimization methods of hierarchy software max (HS) and Negative Sampling (NS) are provided at the same time, and a training frame is formed by combining the training models and the optimization methods. As shown in fig. 3, the training frames formed by the two models have a common point in that they both include an Input Layer (Input Layer), a Projection Layer (Projection Layer), and an Output Layer (Output Layer), and the difference is that the training frame based on the CBOW model predicts the current word w according to the context semantic environment in which the words appear, and the training frame based on the Skip _ gram model predicts the context semantic information according to the current word w.
In order to solve the problem of predicting context (w) (window c) by current word w, the Skip _ gram model decomposes the calculation task into the probability maximum of predicting each word forming context (w) by current word w independently, and the objective function is
Figure GSB0000175921600000141
Wherein c is i E.g. context (w), D is technical standard corpus, theta is model parameter, conditional probability P (c) i | w) is expressed by Softmax normalization, as shown in equation (7);
Figure GSB0000175921600000142
wherein,
Figure GSB0000175921600000145
and v w Are respectively the word c i And w, c' is all non-repeating words in the corpus, the number of the words is large, and Hierachy Softmax or Negative Sampling can be adoptedOptimization, v c′ Is a vector representation of c'. Each technical standard D in the technical standard set D i Viewed as being composed of a series of words w 1 …w i …w n Composition, assuming mutual independence between words, to technical standard d i Of each candidate keyword ck i If the prediction probability is a word type, calculating the prediction probability by using a formula (8), and if the prediction probability is a phrase type, calculating by using a formula (9);
Figure GSB0000175921600000143
Figure GSB0000175921600000144
wherein P (w) j |ck i ) By using the calculation of equation (7) for the variable substitution, it can be seen that when the probability P (w) is predicted 1 …w i …w n |ck i ) The larger the candidate keyword ck is, the larger the candidate keyword ck is i The better the effect of predicting context information, the more likely it is a keyword that characterizes full-text information. In order to avoid as far as possible the occurrence of extremely small errors due to the excessively small conditional probability in the continuous multiplication calculation, the logP (w) on the left side is obtained by taking the logarithm of both sides of the above equation 1 …w i …w n |ck i ) As a measure of candidate keywords ck i The weight measure in the semantic dimension is denoted as W (Sem (ck) i ) And simultaneously considering that the relation is established for similar words when Word2vec training corpus is considered, logP (w) is used for simplifying calculation 1 …w i …w n |ck i ) Approximately as logP (c) 1 …c i …c n |ck i ) Wherein w is 1 …w i …w n As candidate keywords ck i Context within the scope of the model window, abbreviated as Context (ck) i ) Then W (seq (ck) i ) The calculation method is shown in formula (10);
Figure GSB0000175921600000151
to technical standard d i Any one of the candidate keywords ck i Comprehensively considering the position feature, the word co-occurrence feature and the context semantic feature, and calculating the candidate keywords ck by adopting a formula (11) i Weight scores in three feature dimensions.
W(ck i )=W(Pos(ck i ))+W(Coo(ck i ))+W(Sem(ck i )) (11)。
The influence of the insufficiency of a single characteristic factor on the extraction effect of the key words can be avoided by fusing three different characteristics, and d is recorded i In each candidate keyword ck i Corresponding Score (d) i )={W(ck 1 )...W(ck i )...W(ck n ) For Score (d) } for Score (d) i ) The scores in (1) are ranked from high to low, and a dynamic threshold lambda is set as the average value of all the scores, and the calculation mode is shown as a formula (12);
Figure GSB0000175921600000152
if d is i The middle candidate keyword satisfies W (ck) i ) When the k is more than or equal to lambda, ck is i And adding the result into the result keyword set. The reason why the fixed threshold is not selected is that different technical standards have differences in length, and the candidate keyword score ranges calculated by the different technical standards are different, so that a dynamic threshold is set for the actual score range of a single technical standard.
The method is used for carrying out experiments, experimental data are selected from 2016 technical standards (including technical specifications and technical reports) on a 3GPP website, and 8000 pieces of experimental data are obtained in total after de-noising is carried out again. The effective series (series) number range of the technical standards is 01-12, 21-38, 41-46, 48-52 and 55, and 42 series are provided in total, each series comprises a plurality of versions and is 14G in size, and each technical standard consists of Cover, form, Scope, Reference, Definitions and Abbrevations, Topic body and Annex parts.
In the experiment, three evaluation indexes of accuracy (P), recall rate (R) and F-value (F-Score) which are commonly used in natural language processing tasks are adopted to evaluate the extraction effect of the keywords, and the calculation methods are respectively shown in formulas (13) to (15).
Figure GSB0000175921600000161
Figure GSB0000175921600000162
Figure GSB0000175921600000163
Extracting technical standard common words from the preprocessed technical standard by using a method based on the word frequency-document distribution entropy, obtaining the optimal threshold value of the word frequency-document distribution entropy to be 5.42 through multiple experiments, selecting words larger than the threshold value as technical standard common words, obtaining 13566 common words in total, wherein part of common word extraction results are shown in table 2.
Table 2 partial common word extraction results
Serial number Common words H(W) Serial number General words H(W)
1 version 10.9665 11 all 9.9539
2 should 10.8165 12 possible 9.8908
3 latest 10.7022 13 foreword 9.8543
4 approve 10.6394 14 through 9.8097
5 specifiction 10.5639 15 modify 9.7739
6 update 10.4934 16 restriction 9.6978
7 present 10.2963 17 this 9.6536
8 within 10.1056 18 available 9.6281
9 be 10.0572 19 release 9.5941
10 further 10.0188 20 when 9.5148
As can be seen from table 2, the algorithm based on the term frequency-document distribution entropy can extract not only the common stop words "all", "this", "while", and the like, but also common words in the technical standard, for example: "version", "specification", "release", and the like. By adopting the method, most technical standard common words can be effectively obtained.
And after filtering the candidate keyword set in each technical standard by using the general word list, respectively calculating the weights corresponding to the position characteristic, the word co-occurrence characteristic and the context semantic characteristic. When the context semantic features are calculated, a Skip-Gram model in Word2vec and a Huffman Softmax optimization method are selected for training 14G technical standards in experiments, a context window is set to be 10, vector dimensions are set to be 200, and 965.1M model files are obtained after 10 iterations. To analyze the effect of different features on the extraction of technical standard keywords, the combination of experimentally set comparison features is shown in table 3.
TABLE 3 combination of features
Figure GSB0000175921600000171
And (3) respectively calculating the scores of the candidate keywords of each technical standard under different feature combinations by combining formulas (3) to (11), calculating a dynamic threshold value by utilizing a formula (12), and screening out the candidate keywords meeting the conditions as the identified keywords. And simultaneously randomly extracting 1000 technical standards containing different series and versions from 8000 technical standards, and screening 2, 4, 6, 8 and 10 keywords from each technical standard as a reference keyword set in a form of intersection of three-person cross labeling. And respectively carrying out morphological reduction on the recognized keywords and the manually labeled reference keyword set, then comparing, if the recognized keywords and the labeled keywords have the same morphological shapes or are short and full names of each other, considering the recognized keywords as the correct recognition condition, meanwhile, counting the correct rate, the recall rate and the F-value of the recognized keywords under different keyword numbers by different feature combinations, and showing the experimental results in a table 4.
TABLE 4 extraction results of key words under different feature combinations
Figure GSB0000175921600000172
Figure GSB0000175921600000181
As can be seen from table 4, when the number of key words is 2, the Feature recognition recall rates of Feature1, Feature4, Feature5, and Feature7 are higher than those of other Feature combinations. This is because when the number of keywords is small, those candidate keywords appearing in a particular position are more likely to be correctly recognized as keywords. Meanwhile, the words in the special positions provide less context semantic information, so the position characteristics of the keywords appearing in the technical standard relatively dominate. When the number of the key words is increased from 2, the comparison between Feature1 and Feature3 shows that the recall rate corresponding to Feature1 is slowly increased and gradually decreases; the Feature2 obviously increases the recall rate of the correct rate when the number of the keywords is 4-8, and then the correct rate is reduced to some extent; when the number of keywords exceeds 6, Feature3 increases the recall ratio. The description shows that as the number of the keywords increases, the influence of the position on the weight of the keywords gradually decreases, and the influence of the word co-occurrence characteristics and the context semantic characteristics on the weight of the keywords gradually increases. Meanwhile, comparing Feature5 with Feature7 shows that the accuracy and recall rate are increased after the word co-occurrence Feature is added. This is because word co-occurrence factors help identify more phrase-type keywords that are likely to correspond to abbreviated keywords having a certain general meaning but not occupying a position advantage, and as the number of keywords increases, the keywords identified by word co-occurrence characteristics are more likely to be included in the reference keyword set. Comparing Feature4 with Feature7, it can be found that the recall rate is obviously increased from the keyword number of 4 after the context semantic features are added. The reason is that when the number of keywords is increased, the candidate keywords which are characterized by rich context semantic information are more likely to be selected as the keywords. When the number of key words is the same, comparing Feature1, Feature2, Feature3 and Feature7, it can be found that Feature7 achieves better recognition effect than any single Feature due to the advantage of combining different features.
The automatic keyword extraction method provided by the invention integrates the position characteristic, the word co-occurrence characteristic and the context semantic characteristic to extract the keywords, and comprehensively considers the weight influence of the internal position of the document and the context semantic characteristic on the keywords, so that higher accuracy and recall rate are achieved, the technical standard retrieval quality of the 3GPP is improved, the labor cost is reduced, and the requirement of practical application can be well met.
The above-mentioned embodiments only express the embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (6)

1. An automatic keyword extraction method is characterized by comprising the following steps: extracting common words, extracting candidate keywords, filtering the common words aiming at the candidate keywords, calculating a candidate keyword weight score by integrating position characteristics, word co-occurrence characteristics and context semantic characteristics, calculating a dynamic threshold according to a candidate keyword weight score range, and determining result keywords by using the dynamic threshold;
the method for calculating the position feature weight comprises the following steps: respectively dividing sentence subsets by using punctuations as boundaries aiming at text parts corresponding to titles of different levels of the 3GPP technical standard, sequentially numbering the sentences in the sentence sets from 1, and recording the technical standard d i Middle candidate keyword set CK (d) i )={ck 1 ,ck 2 ...ck i ...ck n In which ck i For any candidate keyword in the set, n is the number of the candidate keywords, and the special position set is recorded as
SP={Title,Scope,Reference,Definitions,Abbrevations,NOTE},
locate(ck i ) Representing candidate keywords ck i Position of occurrence, defining a characteristic function Pos (ck) i ) Representing candidate keywords ck i The weight in the dimension of the appearance position is assigned
Figure FSB0000199005510000011
Wherein, Sno cki Representing candidate keywords ck i Number of sentence in it, Snu cki Representing candidate keywords ck i Number of sentences in text paragraph, len (ck) i ) Representing candidate keywords ck i The number of words contained; the weights appearing at different positions are averaged and are recorded as W (Pos (ck) i ) Represents the average of the position weights, then
Figure FSB0000199005510000012
Of these, fre (ck) i ) Representing candidate keywords ck i Frequency of occurrence in the same technical standard;
the word co-occurrence feature weight calculation method comprises the following steps:
the candidate keyword set with all technical standards is CK ═ CK (d) 1 ),CK(d 2 )...CK(d i )...CK(d n ) H.for technical standard d i Any one of the candidate keywords ck i Memory component ck i Are respectively cw 1 ,cw 2 …cw i …cw m M is ck i Number of words contained, let cw i Co-occurring word set of (1) is cocur i ={wco 1 ,wco 2 …wco i …wco p P is the size of the co-occurrence set, wherein wco j Representing the word cw i Wco j ∈CK(d i ) And satisfies wco 1 ∩wco 2 ∩…∩wco j ∩…∩wco p ={cw i J is more than or equal to 1 and less than or equal to p, then cw i For candidate keywords ck i Is expressed as
Figure FSB0000199005510000021
Of these, fre (wco) j ) Representing the word cw i Co-occurrence of wco j Frequency of occurrence, len (wco) j ) Indicates a co-occurrence wco j The number of words contained; when the candidate keyword ck i When a plurality of words are included, candidate keywords ck are calculated i The weight calculation formula on the dimension of word co-occurrence is
Figure FSB0000199005510000022
The method for calculating the context semantic feature weight comprises the following steps:
the calculation task is decomposed into the probability maximum value of each word forming the context (w) which is respectively and independently predicted by the current word w, and the objective function is
Figure FSB0000199005510000023
Wherein c is i Epsilon Context (w), D is a technical standard corpus, theta is a model parameter, and the conditional probability P (c) i | w) is expressed as
Figure FSB0000199005510000024
Wherein,
Figure FSB0000199005510000025
and v w Are respectively the word c i And w, c' is all non-repeating words in the corpus, v c′ A vector representation of c'; each technical standard D in the technical standard set D i Viewed as being composed of a series of words w 1 …w i …w n Composition, assuming mutual independence between words, to technical standard d i Of each candidate keyword ck i If the word type is used, the formula for calculating the prediction probability is
Figure FSB0000199005510000031
To technical standard d i Any one of the candidate keywords ck i Comprehensively considering the position feature, the word co-occurrence feature and the context semantic feature, calculating the candidate keywords ck i The weight scores in the three characteristic dimensions are formulated as
W(ck i )=W(Pos(ck i ))+W(Coo(ck i ))+W(Sem(ck i ));
Note d i Of each candidate keyword ck i Corresponding score
Score(d i )={W(ck 1 )...W(ck i )...W(ck n ) For Score (d) } for Score (d) i ) The scores in (1) are ranked from high to low, a dynamic threshold value lambda is set as the average value of all the scores, and the calculation formula is
Figure FSB0000199005510000032
If d is i The medium candidate keyword satisfies W (ck) i ) When the k is more than or equal to lambda, ck is i And adding the result into the result keyword set.
2. The method for automatically extracting keywords according to claim 1, wherein the method for automatically extracting keywords comprises:
step 1) removing text noise in the 3GPP technical standard;
step 2) extracting common words in the technical standard;
step 3) extracting candidate keywords and filtering common words based on the syntactic analysis tree;
and 4) comprehensively considering the position characteristics, word co-occurrence characteristics and context semantic characteristics of the candidate keywords in the document, calculating weight scores and sequencing, finally calculating a dynamic threshold according to the actual score range of the technical standard, and adding the candidate keywords with the scores exceeding the threshold into a result keyword set.
3. The method for automatically extracting keywords according to claim 2, wherein the step 1) is specifically as follows: and removing text noise in the 3GPP technical standard by adopting the Apache POI analysis technical standard.
4. The method for automatically extracting keywords according to claim 2, wherein the step 2) comprises: extracting common words in the technical standard based on the word frequency-document distribution entropy, wherein the word frequency-document distribution entropy refers to uncertainty measurement of the distribution state of the words w in the technical standard set; let D be D ═ D in the document set composed of n technical standards 1 ,d 2 ...d i ...d n Recording the word frequency of the word w-the document distribution entropy is H (w), then H (w) is calculated by the formula
Figure FSB0000199005510000041
Wherein, P (w, d) i ) For the word w to appear in technical standard d i I is 1. ltoreq. n, according to the maximum likelihood estimation method, P (w, d) i ) Is calculated by the formula
Figure FSB0000199005510000042
Wherein, f (w, d) i ) For words w in technical standards d i The number of occurrences in (c).
5. The method of any of claims 1-4, wherein extracting candidate keywords based on a dependency parsing tree comprises:
step 1: traversing the technical standard set D, for each technical standard D in D i Dividing into sentences according to punctuations, and representing the divided sentence sets as
Figure FSB0000199005510000043
n s As a document d i The number of Chinese sentences;
step 2: for set sequences (d) i ) Each sentence in the tree is subjected to dependency syntax analysis by using a Stanford Parser syntax analyzer to obtain a corresponding dependency syntax analysis tree set Trees (d) i ) Memory for recording
Figure FSB0000199005510000044
Wherein T is i Indicates technical standard d i The dependency syntax analysis tree corresponding to the ith sentence;
step 3: cyclic read dependency parse tree set Trees (d) i ) For any dependency syntax tree T i ∈Trees(d i ) Taking the words and corresponding parts of speech in the syntactic dependency tree as a whole as leaf nodes, and traversing the T in a medium-order and orderly mode i If the current node is a leaf node, judging whether the part of speech of the node is a noun, a verb or an adjective, if the condition is met, adding the node into the candidate keyword set, otherwise, jumping to the next node; if the current node is not a leaf node, judging whether the current node is a noun phrase, if so, continuing to recursively traverse the right subtree of the current node until the subtree does not have a non-leaf node taking the noun phrase as a parent node, and at the moment, adding the child nodes of the noun phrase into a candidate keyword set as a whole;
step 4: and further filtering the candidate keyword set by using the extracted general words, and removing an element containing the general words from the candidate keyword set if the element exists in the candidate keyword set.
6. The method of claim 2 or 3, wherein the text noise includes pictures, tables, formulas, special symbols, and illegal characters.
CN201810611476.7A 2018-06-13 2018-06-13 Automatic keyword extraction method Active CN108920456B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810611476.7A CN108920456B (en) 2018-06-13 2018-06-13 Automatic keyword extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810611476.7A CN108920456B (en) 2018-06-13 2018-06-13 Automatic keyword extraction method

Publications (2)

Publication Number Publication Date
CN108920456A CN108920456A (en) 2018-11-30
CN108920456B true CN108920456B (en) 2022-08-30

Family

ID=64419617

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810611476.7A Active CN108920456B (en) 2018-06-13 2018-06-13 Automatic keyword extraction method

Country Status (1)

Country Link
CN (1) CN108920456B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109614626A (en) * 2018-12-21 2019-04-12 北京信息科技大学 Keyword Automatic method based on gravitational model
CN111435405A (en) * 2019-01-15 2020-07-21 北京行数通科技有限公司 Method and device for automatically labeling key sentences of article
CN109960724B (en) * 2019-03-13 2021-06-04 北京工业大学 Text summarization method based on TF-IDF
CN110134767B (en) * 2019-05-10 2021-07-23 云知声(上海)智能科技有限公司 Screening method of vocabulary
CN110147425B (en) * 2019-05-22 2021-04-06 华泰期货有限公司 Keyword extraction method and device, computer equipment and storage medium
CN110377724A (en) * 2019-07-01 2019-10-25 厦门美域中央信息科技有限公司 A kind of corpus keyword Automatic algorithm based on data mining
CN111552786B (en) * 2020-04-16 2021-07-09 重庆大学 Question-answering working method based on keyword extraction
CN111597793B (en) * 2020-04-20 2023-06-16 中山大学 Paper innovation measuring method based on SAO-ADV structure
CN111680509A (en) * 2020-06-10 2020-09-18 四川九洲电器集团有限责任公司 Method and device for automatically extracting text keywords based on co-occurrence language network
CN111985217B (en) * 2020-09-09 2022-08-02 吉林大学 Keyword extraction method, computing device and readable storage medium
CN114626361A (en) * 2020-12-10 2022-06-14 广州视源电子科技股份有限公司 Sentence making method, sentence making model training method and device and computer equipment
CN112988951A (en) * 2021-03-16 2021-06-18 福州数据技术研究院有限公司 Scientific research project review expert accurate recommendation method and storage device
CN113191145B (en) * 2021-05-21 2023-08-11 百度在线网络技术(北京)有限公司 Keyword processing method and device, electronic equipment and medium
CN113657113A (en) * 2021-08-24 2021-11-16 北京字跳网络技术有限公司 Text processing method and device and electronic equipment
CN113743090B (en) * 2021-09-08 2024-04-12 度小满科技(北京)有限公司 Keyword extraction method and device
CN113971216B (en) * 2021-10-22 2023-02-03 北京百度网讯科技有限公司 Data processing method and device, electronic equipment and memory
CN114492433A (en) * 2022-01-27 2022-05-13 南京烽火星空通信发展有限公司 Method for automatically selecting proper keyword combination to extract text

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929937A (en) * 2012-09-28 2013-02-13 福州博远无线网络科技有限公司 Text-subject-model-based data processing method for commodity classification
CN104281645A (en) * 2014-08-27 2015-01-14 北京理工大学 Method for identifying emotion key sentence on basis of lexical semantics and syntactic dependency
CN105843795A (en) * 2016-03-21 2016-08-10 华南理工大学 Topic model based document keyword extraction method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110004465A1 (en) * 2009-07-02 2011-01-06 Battelle Memorial Institute Computation and Analysis of Significant Themes

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929937A (en) * 2012-09-28 2013-02-13 福州博远无线网络科技有限公司 Text-subject-model-based data processing method for commodity classification
CN104281645A (en) * 2014-08-27 2015-01-14 北京理工大学 Method for identifying emotion key sentence on basis of lexical semantics and syntactic dependency
CN105843795A (en) * 2016-03-21 2016-08-10 华南理工大学 Topic model based document keyword extraction method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于字同现频率的关键词自动抽取;都云程等;《北京信息科技大学学报》;20111231;第26卷(第6期);第1-3页 *

Also Published As

Publication number Publication date
CN108920456A (en) 2018-11-30

Similar Documents

Publication Publication Date Title
CN108920456B (en) Automatic keyword extraction method
CN107229610B (en) A kind of analysis method and device of affection data
US9317498B2 (en) Systems and methods for generating summaries of documents
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
US9626358B2 (en) Creating ontologies by analyzing natural language texts
Beeferman et al. Statistical models for text segmentation
US9892111B2 (en) Method and device to estimate similarity between documents having multiple segments
US20170293687A1 (en) Evaluating text classifier parameters based on semantic features
US8645418B2 (en) Method and apparatus for word quality mining and evaluating
EP3086239A1 (en) Scenario generation device and computer program therefor
US20170293607A1 (en) Natural language text classification based on semantic features
CN107463548B (en) Phrase mining method and device
US9235573B2 (en) Universal difference measure
EP3086237A1 (en) Phrase pair gathering device and computer program therefor
CN114065758B (en) Document keyword extraction method based on hypergraph random walk
EP3086240A1 (en) Complex predicate template gathering device, and computer program therefor
CN113988053A (en) Hot word extraction method and device
CN111310467B (en) Topic extraction method and system combining semantic inference in long text
Kotenko et al. Evaluation of text classification techniques for inappropriate web content blocking
KR102376489B1 (en) Text document cluster and topic generation apparatus and method thereof
CN111930949B (en) Search string processing method and device, computer readable medium and electronic equipment
CN111899832B (en) Medical theme management system and method based on context semantic analysis
CN115455975A (en) Method and device for extracting topic keywords based on multi-model fusion decision
Mendels et al. Collecting code-switched data from social media
CN114548113A (en) Event-based reference resolution system, method, terminal and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant