CN110427626A

CN110427626A - The extracting method and device of keyword

Info

Publication number: CN110427626A
Application number: CN201910703459.0A
Authority: CN
Inventors: 崔峭
Original assignee: Beijing Mininglamp Software System Co ltd
Current assignee: Beijing Mininglamp Software System Co ltd
Priority date: 2019-07-31
Filing date: 2019-07-31
Publication date: 2019-11-08
Anticipated expiration: 2039-07-31
Also published as: CN110427626B

Abstract

The present invention provides a kind of extracting method of keyword and devices.Specifically, this method comprises: the text to input carries out text retrieval conference TREC, with corresponding first weighted value of content type each in the determination text；Semantic analysis is carried out to each word in the text, with corresponding second weighted value of each word of determination；The term frequencies TF value of each word is adjusted according to first weighted value and second weighted value, and the TF value by adjusting after calculates the third weighted value of each word；According to the third weighted value, extracts in the word and specify word as the keyword for being retrieved.Through the invention, it solves TF-IDF to calculate dependent on the relevant document of multiple contents, the word weight of single text can not be calculated, and TF-IDF method discrete data lower for the degree of association shows poor problem, reach the precision effect for improving and extracting to key word information.

Description

The extracting method and device of keyword

Technical field

The present invention relates to the communications fields, in particular to the extracting method and device of a kind of keyword.

Background technique

The current most common searching system is all based on what keyword was realized, and the extraction of keyword, nearly all uses word Frequently the calculation method of (term frequency, TF) and anti-document frequency (inverse document frequency, IDF). But TF-IDF is calculated and is depended on the relevant document of multiple contents, can not calculate the word weight of single text, and the side TF-IDF Method discrete data performance lower for the degree of association is poor.

Summary of the invention

The embodiment of the invention provides a kind of extracting method of keyword and devices, at least to solve TF- in the related technology IDF, which is calculated, depends on the relevant document of multiple contents, can not calculate the word weight of single text, and TF-IDF method is for closing The lower discrete data of connection degree shows poor problem.

According to one embodiment of present invention, provide a kind of extracting method of keyword, comprising: to the text of input into Row text retrieval conference TREC, with corresponding first weighted value of content type each in the determination text；To each in the text A word carries out semantic analysis, with corresponding second weighted value of each word of determination；According to first weighted value and described Two weighted values are adjusted the term frequencies TF value of each word, and the TF value by adjusting after calculates each institute The third weighted value of predicate language；According to the third weighted value, extracts in the word and word is specified to be used as being retrieved Keyword.

Optionally, before carrying out semantic analysis to the word in the text, the method also includes: according to default rule Word segmentation processing then is carried out to the text, and, it is determined according to the relevance between each word after participle each The part of speech of a word.

Optionally, semantic analysis is carried out to each word in the text, to determine corresponding second power of specified word Weight values, comprising: each word is ranked up according to preset part of speech priority rule；According to the part of speech priority Sequence assigns corresponding second weighted value to each word.

Optionally, according to first weighted value and second weighted value to the term frequencies TF value of each word It is adjusted, further includes: obtain the TF value of each word；By the TF value and first weighted value and described the Two weighted values are multiplied, with the determination TF value adjusted.

Optionally, the TF value by adjusting after calculates the third weighted value of each word, comprising: obtains each The IDF value of the word；It is determined according to the TF value of each word adjusted and the IDF value of each word The third weighted value.

Optionally, it according to the third weighted value, extracts in the word and specifies word as the pass for being retrieved Keyword, the method also includes: it removes third weighted value described in the word and is less than the word of default weight threshold as institute State specified word.

Optionally, the content type includes at least one of: drawing in the text according to pre-set text format Point content of text type, the location type of paragraph in the text, the location type of sentence in the text.

According to another embodiment of the invention, a kind of extraction element of keyword is provided, comprising: first determines mould Block, for carrying out text retrieval conference TREC to the text of input, with corresponding first power of content type each in the determination text Weight values；Second determining module, it is corresponding with each word of determination for carrying out semantic analysis to each word in the text Second weighted value；Module is adjusted, for the word according to first weighted value and second weighted value to each word Speech frequency rate TF value is adjusted, and the TF value by adjusting after calculates the third weighted value of each word；Extract mould Block specifies word as the keyword for being retrieved for according to the third weighted value, extracting in the word.

According to still another embodiment of the invention, a kind of storage medium is additionally provided, meter is stored in the storage medium Calculation machine program, wherein the computer program is arranged to execute the step in any of the above-described embodiment of the method when operation.

According to still another embodiment of the invention, a kind of electronic device, including memory and processor are additionally provided, it is described Computer program is stored in memory, the processor is arranged to run the computer program to execute any of the above-described Step in embodiment of the method.

Through the invention, the TF of word is carried out using the results of structural analysis of text and the semantic analysis result of word Adjustment, therefore, can solve solution, TF-IDF is calculated dependent on the relevant document of multiple contents in the related technology, can not calculate list The word weight of one text, and TF-IDF method discrete data lower for the degree of association shows poor problem, reaches Improve the precision effect extracted to key word information.

Detailed description of the invention

The drawings described herein are used to provide a further understanding of the present invention, constitutes part of this application, this hair Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:

Fig. 1 is the flow chart of the extraction of keyword according to an embodiment of the present invention；

Fig. 2 is a kind of schematic diagram of test text according to an embodiment of the present invention；

Fig. 3 is a kind of result figure for extracting result according to an embodiment of the present invention；

Fig. 4 is a kind of structural block diagram of the extracting method device of keyword according to an embodiment of the present invention.

Specific embodiment

Hereinafter, the present invention will be described in detail with reference to the accompanying drawings and in combination with Examples.It should be noted that not conflicting In the case of, the features in the embodiments and the embodiments of the present application can be combined with each other.

It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.

Embodiment 1

A kind of extracting method for running on keyword is provided in the present embodiment, and Fig. 1 is according to an embodiment of the present invention The flow chart of the extraction of keyword, as shown in Figure 1, the process includes the following steps:

Step S102 carries out text retrieval conference TREC to the text of input, with each content type pair in the determination text The first weighted value answered；

Step S104 carries out semantic analysis to each word in the text, with each word of determination corresponding second Weighted value；

Step S106, according to first weighted value and second weighted value to the term frequencies TF of each word Value is adjusted, and the TF value by adjusting after calculates the third weighted value of each word；

Step S108 is extracted in the word and is specified word as being retrieved according to the third weighted value Keyword.

Specifically, in the text according to pre-set text format divide content of text type refer in text it is each in Hold part.Such as the specification of a patent application, abstract of description, abstract of description attached drawing, claims, Specification, Figure of description can be used as a kind of division mode of content of text type of content part.And for specification and Speech, the division mode of the content of text type of the specification can be said according to technical field, background technique, summary of the invention, attached drawing Bright, specific embodiment is divided.

Specifically, the location type of paragraph refers to the position of paragraph content of text in the text in text.Such as, if In technical field, if in summary of the invention, if in the description of the drawings.Whether paragraph is first paragraph simultaneously, still most Paragraph afterwards, or the paragraph of some intermediate position.

Specifically, the location type of sentence is similar with the location type of paragraph in text in text, refer to sentence in section It falls or the position of content of text in the text.Such as, if in technical field, if in summary of the invention, if attached In figure explanation.Simultaneously whether some paragraph first sentence, tail sentence or middle section.

Above description is illustrative examples, and the manifestation mode of any content type based on above-mentioned thinking is in this implementation Within the protection scope of example.

Specifically, being with patent application when corresponding first weighted value of each content type in determining the text , content importance of the content obviously than other four parts in specification in specific embodiment is high, therefore, for Paragraph, sentence in the specific embodiment, specific embodiment can assign weight more higher than other content type.And having In body embodiment, often the content in first section or preceding several paragraphs is most important.Therefore, for first section or former A paragraph will assign weight more higher than other paragraphs.And in each paragraph, often first section or endpiece usually provide knot By the sentence of property, therefore, the weight of head and the tail section will assign weight more higher than sentence in other paragraphs.

Specifically, on the one hand semantic analysis can filter out the principal entities of article discussion according to semanteme, on the other hand, The unwanted contributions in sentence can be removed.For example, " Xiao Ming is a Chinese ", can propose subject " Xiao Ming " and predicative " in Compatriots ".Then a higher weight, other words are assigned for critical entities such as subject, the objects of core phrase and sentence Weight of converging is directly disposed as 1.Another example is that in view of sometimes quantifier, adjective are also more crucial part.Cause This, can to quantifier, adjectival weight assign it is lower than critical entities such as core phrase and the subject of sentence, objects, but than it The high weight of his word can assign higher value.

It should be pointed out that the purpose of default weight threshold is to influence the knot retrieved below in order to avoid output result is excessive Fruit, because excessive keyword is searched in knowledge mapping, it is possible to can because major key is excessive, return excessive information or Person excessively can not return information because limiting.

In order to better understand the technical solution recorded in the present embodiment, following scene is additionally provided in the present embodiment To better understand the scheme recorded in above-described embodiment.

Fig. 2 is a kind of schematic diagram of test text according to an embodiment of the present invention.As shown in Fig. 2,

Step 1: text retrieval conference TREC is carried out to the test text of input.To analyze, the 1st row is assigned, the 31st row is most High weighted value assigns the 7th row, the weighted value (being equivalent to the content in the 7th row of removal and the 9th row) that the 9th row weighted value is 0, However the sentence of other rows is then assigned lower than the 1st row, the weighted value of the 31st row.

Step 2: semantic analysis being carried out to each word in the text, is segmented, part of speech, goes to listen word, syntax The processing such as analysis.Such as recorded in the 2nd row " many people really remember artificial intelligence, or because Shi Di in 2001 This literary Pierre's Burger instructs that film " artificial intelligence " " it can extract, subject " people ", " Glenn Stevens Pierre Burger ", it calls Language " remembers ", " guidance ", object " artificial intelligence ".However for the subject of core, object " people ", " Glenn Stevens Pierre primary Lattice ", " artificial intelligence " assign the highest weighted value greater than 1.Predicate " remembers " that " guidance " then assigns and be greater than 1 but be less than core Subject, the corresponding weighted value of object data.Other words are then directly disposed as 1.

Step 3: according to first weighted value and second weighted value to the term frequencies TF value of each word It is adjusted.Calculate the TF value of each word in every words.And multiplication is carried out according to the weighted value that step 1 and step 2 are got Operation gets the TF value of each word.

Step 4: calculating the TF-IDF value of each word, (the case where being zero for IDF value, TF-IDF value is directly weighed using word Weight).It after obtaining the TF-IDF value of each word, can directly use, or utilize sigmoid function, TF-IDF value is gone Linearisation, and result is normalized.To get the corresponding weighted value of keyword of each word.

Step 5: each word weighted value corresponding with keyword is compared with preset threshold value, thus filter out as Result shown in Fig. 3.Fig. 3 is a kind of result figure for extracting result according to an embodiment of the present invention.As shown in figure 3, final extract It as a result is " artificial intelligence ", " mankind ", " robot ", " machine ", " law ".Wherein, the weight highest of law ", " robot " Weight it is minimum.

Step 6: according to the output in Fig. 3 as a result, targetedly being retrieved.

Through the above description of the embodiments, those skilled in the art can be understood that according to above-mentioned implementation The method of example can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but it is very much In the case of the former be more preferably embodiment.Based on this understanding, technical solution of the present invention is substantially in other words to existing The part that technology contributes can be embodied in the form of software products, which is stored in a storage In medium (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that a terminal device (can be mobile phone, calculate Machine, server or network equipment etc.) execute method described in each embodiment of the present invention.

Embodiment 2

Additionally provide a kind of device in the present embodiment, the device for realizing above-described embodiment and preferred embodiment, The descriptions that have already been made will not be repeated.As used below, term " module " may be implemented predetermined function software and/or The combination of hardware.Although device described in following embodiment is preferably realized with software, hardware or software and hard The realization of the combination of part is also that may and be contemplated.

Fig. 4 is a kind of structural block diagram of the extracting method device of keyword according to an embodiment of the present invention, as shown in figure 4, The device includes:

First determining module 42, for carrying out text retrieval conference TREC to the text of input, with each in the determination text Corresponding first weighted value of content type；

Second determining module 44, for carrying out semantic analysis to each word in the text, with each word of determination Corresponding second weighted value；

Module 46 is adjusted, for the word according to first weighted value and second weighted value to each word Frequency TF value is adjusted, and the TF value by adjusting after calculates the third weighted value of each word；

Extraction module 48 specifies word to be used as carrying out for according to the third weighted value, extracting in the word The keyword of retrieval.

It should be noted that above-mentioned modules can be realized by software or hardware, for the latter, Ke Yitong Following manner realization is crossed, but not limited to this: above-mentioned module is respectively positioned in same processor；Alternatively, above-mentioned modules are with any Combined form is located in different processors.

Embodiment 3

The embodiments of the present invention also provide a kind of storage medium, computer program is stored in the storage medium, wherein The computer program is arranged to execute the step in any of the above-described embodiment of the method when operation.

Optionally, in the present embodiment, above-mentioned storage medium can be set to store by executing based on following steps Calculation machine program:

S1 carries out text retrieval conference TREC to the text of input, with each content type in the determination text corresponding the One weighted value；

S2 carries out semantic analysis to each word in the text, with corresponding second weighted value of each word of determination；

S3 carries out the term frequencies TF value of each word according to first weighted value and second weighted value Adjustment, and the TF value by adjusting after calculates the third weighted value of each word.

Optionally, in the present embodiment, above-mentioned storage medium can include but is not limited to: USB flash disk, read-only memory (Read- Only Memory, referred to as ROM), it is random access memory (Random Access Memory, referred to as RAM), mobile hard The various media that can store computer program such as disk, magnetic or disk.

The embodiments of the present invention also provide a kind of electronic device, including memory and processor, stored in the memory There is computer program, which is arranged to run computer program to execute the step in any of the above-described embodiment of the method Suddenly.

Optionally, above-mentioned electronic device can also include transmission device and input-output equipment, wherein the transmission device It is connected with above-mentioned processor, which connects with above-mentioned processor.

Optionally, in the present embodiment, above-mentioned processor can be set to execute following steps by computer program:

Optionally, the specific example in the present embodiment can be with reference to described in above-described embodiment and optional embodiment Example, details are not described herein for the present embodiment.

Obviously, those skilled in the art should be understood that each module of the above invention or each step can be with general Computing device realize that they can be concentrated on a single computing device, or be distributed in multiple computing devices and formed Network on, optionally, they can be realized with the program code that computing device can perform, it is thus possible to which they are stored It is performed by computing device in the storage device, and in some cases, it can be to be different from shown in sequence execution herein Out or description the step of, perhaps they are fabricated to each integrated circuit modules or by them multiple modules or Step is fabricated to single integrated circuit module to realize.In this way, the present invention is not limited to any specific hardware and softwares to combine.

The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field For art personnel, the invention may be variously modified and varied.It is all within principle of the invention, it is made it is any modification, etc. With replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims

1. a kind of extracting method of keyword characterized by comprising

Text retrieval conference TREC is carried out to the text of input, with corresponding first weight of content type each in the determination text Value；

Semantic analysis is carried out to each word in the text, with corresponding second weighted value of each word of determination；

The term frequencies TF value of each word is adjusted according to first weighted value and second weighted value, and The TF value by adjusting after calculates the third weighted value of each word；

According to the third weighted value, extracts in the word and specify word as the keyword for being retrieved.

2. the method according to claim 1, wherein in the text word carry out semantic analysis it Before, the method also includes:

Word segmentation processing is carried out to the text according to preset rules, and,

The part of speech of each word is determined according to the relevance between each word after participle.

3. according to the method described in claim 2, it is characterized in that, in the text each word carry out semantic analysis, To determine specified corresponding second weighted value of word, comprising:

Each word is ranked up according to preset part of speech priority rule；

Corresponding second weighted value is assigned to each word according to the sequence of the part of speech priority.

4. the method according to claim 1, wherein according to first weighted value and second weighted value pair The term frequencies TF value of each word is adjusted, further includes:

Obtain the TF value of each word；

The TF value is multiplied with first weighted value and second weighted value, with the determination TF value adjusted.

5. the method according to claim 1, wherein the TF value by adjusting after calculates each word Third weighted value, comprising:

Obtain the IDF value of each word；

The third weight is determined according to the IDF value of the TF value of each word adjusted and each word Value.

6. the method according to claim 1, wherein extracting the word middle finger according to the third weighted value Determine word as the keyword for being retrieved, the method also includes:

It removes third weighted value described in the word and is less than the word of default weight threshold as the specified word.

7. the method according to claim 1, wherein the content type includes at least one of:

The content of text type divided in the text according to pre-set text format, the location type of paragraph, institute in the text State the location type of sentence in text.

8. a kind of extraction element of keyword characterized by comprising

First determining module, for carrying out text retrieval conference TREC to the text of input, with each content class in the determination text Corresponding first weighted value of type；

Second determining module, it is corresponding with each word of determination for carrying out semantic analysis to each word in the text Second weighted value；

Module is adjusted, for the term frequencies TF according to first weighted value and second weighted value to each word Value is adjusted, and the TF value by adjusting after calculates the third weighted value of each word；

Extraction module is extracted and specifies word as being retrieved in the word for according to the third weighted value Keyword.

9. a kind of storage medium, which is characterized in that be stored with computer program in the storage medium, wherein the computer Program is arranged to execute method described in any one of claim 1 to 7 when operation.

10. a kind of electronic device, including memory and processor, which is characterized in that be stored with computer journey in the memory Sequence, the processor are arranged to run the computer program to execute side described in any one of claim 1 to 7 Method.