CN107357777A

CN107357777A - The method and apparatus for extracting label information

Info

Publication number: CN107357777A
Application number: CN201710459445.XA
Authority: CN
Inventors: 王萌萌; 晋耀红; 蒋宏飞; 杨凯程
Original assignee: China Science And Technology (beijing) Co Ltd; Beijing Shenzhou Taiyue Software Co Ltd
Current assignee: Dingfu Intelligent Technology Co., Ltd
Priority date: 2017-06-16
Filing date: 2017-06-16
Publication date: 2017-11-17
Anticipated expiration: 2037-06-16
Also published as: CN107357777B

Abstract

The invention discloses a kind of method and apparatus for extracting label information, belong to Internet technical field.Method includes：Text message is segmented, obtains candidate phrase set, candidate phrase set includes at least one candidate phrase, and each candidate phrase includes at least one keyword；For each candidate phrase in candidate phrase set, the scoring of the position of sentence where determining the scoring of candidate phrase place sentence and determining the candidate phrase, and determine the first scoring of each keyword that the candidate phrase includes, first scoring of each keyword that the scoring of the position of sentence and the candidate phrase where the scoring of sentence, the candidate phrase according to where the candidate phrase include, determines the scoring of the candidate phrase；Based on the scoring of each candidate phrase, the selection scoring highest preset number candidate phrase from candidate phrase set；By the label information of preset number candidate phrase composition text information.The present invention improves the accuracy of label information.

Description

The method and apparatus for extracting label information

Technical field

The present invention relates to Internet technical field, more particularly to a kind of method and apparatus for extracting label information.

Background technology

With the development of Internet technology, the netizen quantity in China is growing at top speed, the text message on network In explosive growth.But the information content redundancy of the text message on network, user read the full content of text message It will appreciate that the purport of text information.From the angle of user, oneself text message interested is quickly selected to save Save many reading times.Therefore, server needs to extract label information from text message, and user selects certainly according to label information Oneself is read text message interested, so as to improve reading efficiency.

When server extracts label information from text message in the prior art, server is by regular expression from text Specific type word corresponding with the regular expression is extracted in information.Server carries out word segmentation processing to text information, obtains To the first keyword set；The first keyword in first keyword set is combined, obtains the second keyword set Close.Server is screened based on preset rules to each second keyword in the second keyword set, and it is crucial to obtain the 3rd Set of words, and specific type word corresponding with the regular expression is added to the 3rd keyword set as the 3rd keyword In.Server calculates the characteristic value of each 3rd keyword in the 3rd keyword set；Based on each in the 3rd keyword set The characteristic value of 3rd keyword, calculate the scoring of each 3rd keyword in the 3rd keyword set.Server is based on the 3rd and closed The scoring of each 3rd keyword, target keyword is extracted from the 3rd keyword set in keyword set, by the mesh of extraction Crucial phrase is marked into label information.

During the present invention is realized, inventor has found that prior art at least has problems with：

It is the characteristic value based on the 3rd keyword in the above method, the 3rd keyword is scored, but some is in language Not approved 3rd keyword has very that maximum probability obtains high score in justice, and the label information so extracted can not accurately express text The purport of information, cause accuracy low.

The content of the invention

In order to solve problem of the prior art, the invention provides a kind of method and apparatus for extracting label information.Technology Scheme is as follows：

In a first aspect, the invention provides a kind of method for extracting label information, methods described includes：

Text message is segmented, obtains candidate phrase set, the candidate phrase set includes at least one candidate Phrase, each candidate phrase include at least one keyword；

For each candidate phrase in the candidate phrase set, the scoring of sentence where determining the candidate phrase, And the scoring of the position of sentence where determining the candidate phrase, and determine each keyword that the candidate phrase includes The first scoring, according to the scoring of sentence where the candidate phrase, the scoring of the position of sentence where the candidate phrase and First scoring of each keyword that the candidate phrase includes, determines the scoring of the candidate phrase；

Based on the scoring of each candidate phrase, the selection scoring highest preset number from the candidate phrase set Individual candidate phrase；

The preset number candidate phrase is formed to the label information of the text message.

In a possible implementation, the first of each keyword for determining the candidate phrase and including comments Point, including：

For each keyword, the first occurrence number and the second occurrence number are determined, first occurrence number is described Occurrence number of the keyword in the text message, second occurrence number are each key that the text message includes Total occurrence number of word；

According to first occurrence number and second occurrence number, the word frequency of the keyword is determined；

The first quantity and the second quantity are determined, first quantity is the sample text letter that sample text information bank includes The quantity of breath, second quantity are the quantity for the text message that the sample text information bank includes the keyword；

According to first quantity and second quantity, the reverse document-frequency of the keyword is determined；

According to the word frequency and the reverse document-frequency, the first scoring of the keyword is determined.

In a possible implementation, the scoring of the position for determining candidate phrase place sentence, including：

First position of the paragraph in the text message where sentence where determining the candidate phrase, and the time Select the second place of the phrase in the paragraph；

According to the first position and the second place, the scoring of the position of sentence where determining the candidate phrase.

It is described short according to the scoring of sentence where the candidate phrase, the candidate in a possible implementation First scoring of each keyword that the scoring of the position of sentence where language and the candidate phrase include, determines that the candidate is short The scoring of language, including：

The position correspondence of sentence where first weight, the candidate phrase corresponding to sentence where determining the candidate phrase The second weight, the 3rd weight corresponding to each keyword for including of the candidate phrase；

The each keyword included for the candidate phrase, by the scoring of sentence where the candidate phrase and described the One multiplied by weight, the first numerical value is obtained, and, by the scoring of the position of sentence where the candidate phrase and second weight It is multiplied, obtains second value, and, the 3rd multiplied by weight corresponding to the first scoring and the keyword by the keyword, Third value is obtained, first numerical value, the second value are added with the third value, obtains the of the keyword Two scorings；

According to the second scoring of each keyword, the scoring of the candidate phrase is determined.

It is described by first numerical value, the second value and the third value in a possible implementation It is added, obtains the second scoring of the keyword, including：

The contribution degree of the keyword is determined, according to the contribution degree of the keyword, is determined corresponding to the keyword Four weights；

First numerical value, the second value are added with the third value, obtain the 4th numerical value；

By the 4th numerical value and the 4th multiplied by weight, the second scoring of the keyword is obtained.

It is described that the preset number candidate phrase is formed into the text message in a possible implementation Label information, including：

The candidate phrase composition concepts tab information of concept type is selected from the preset number candidate phrase；With/ Or,

The candidate phrase composition event tag information of event type is selected from the preset number candidate phrase.

In a possible implementation, methods described also includes：

The candidate phrase to be ended up in the concepts tab information with presetting the keyword of part of speech is moved to the event mark Sign in information；

The candidate phrase for the keyword for not including the default part of speech in the event tag information is moved to described general Read in label information.

It is described that text message is segmented in a possible implementation, candidate phrase set is obtained, including：

The text message is made pauses in reading unpunctuated ancient writings, obtains at least one candidate sentences, by least one candidate sentence subgroup Into candidate sentences set；

Each candidate sentences in the candidate sentences set are segmented, obtain at least one keyword, by described in At least one crucial phrase is into keyword set；

Based on syntax tree algorithm, by the keyword in the keyword set, at least one candidate phrase is generated；

At least one candidate phrase is formed into the candidate phrase set.

In a possible implementation, each candidate sentences in the candidate sentences set divide Word, before obtaining at least one keyword, methods described also includes：

Determine the sentence element of each candidate sentences in the candidate sentences set；

It is default composition by sentence element in the candidate sentences set according to the sentence element of each candidate sentences Candidate sentences delete.

Second aspect, the invention provides a kind of device for extracting label information, described device includes：

Word-dividing mode, for being segmented to text message, candidate phrase set is obtained, the candidate phrase set includes At least one candidate phrase, each candidate phrase include at least one keyword；

Grading module, for for each candidate phrase in the candidate phrase set, determining the candidate phrase institute The scoring of the position of sentence where the scoring of sentence and the determination candidate phrase, and determine the candidate phrase bag First scoring of each keyword included, according to the scoring of sentence where the candidate phrase, sentence where the candidate phrase The scoring of position and the first scoring of each keyword for including of the candidate phrase, determine the scoring of the candidate phrase；

Selecting module, for the scoring based on each candidate phrase, scoring is selected from the candidate phrase set Highest preset number candidate phrase；

Comprising modules, for the preset number candidate phrase to be formed to the label information of the text message.

In a possible implementation, institute's scoring module, it is additionally operable to, for each keyword, determine the first appearance Number and the second occurrence number, first occurrence number are occurrence number of the keyword in the text message, institute State the total occurrence number for each keyword that the second occurrence number includes for the text message；According to first occurrence number With second occurrence number, the word frequency of the keyword is determined；The first quantity and the second quantity are determined, first quantity is The quantity for the sample text information that sample text information bank includes, second quantity are to be wrapped in the sample text information bank Include the quantity of the text message of the keyword；According to first quantity and second quantity, the keyword is determined Reverse document-frequency；According to the word frequency and the reverse document-frequency, the first scoring of the keyword is determined.

In a possible implementation, institute's scoring module, sentence institute where determining the candidate phrase is additionally operable to In first position of the paragraph in the text message, and the second place of the candidate phrase in the paragraph；According to The first position and the second place, the scoring of the position of sentence where determining the candidate phrase.

In a possible implementation, institute's scoring module, sentence pair where determining the candidate phrase is additionally operable to The second weight of the position correspondence of sentence, the candidate phrase include each where the first weight for answering, the candidate phrase 3rd weight corresponding to keyword；The each keyword included for the candidate phrase, by sentence where the candidate phrase Scoring and first multiplied by weight, obtain the first numerical value, and, by the scoring of the position of sentence where the candidate phrase With second multiplied by weight, second value is obtained, and, corresponding to the first scoring and the keyword by the keyword 3rd multiplied by weight, obtains third value, and first numerical value, the second value are added with the third value, obtained Second scoring of the keyword；According to the second scoring of each keyword, the scoring of the candidate phrase is determined.

In a possible implementation, institute's scoring module, it is additionally operable to determine the contribution degree of the keyword, according to The contribution degree of the keyword, determine the 4th weight corresponding to the keyword；By first numerical value, the second value and The third value is added, and obtains the 4th numerical value；By the 4th numerical value and the 4th multiplied by weight, the keyword is obtained Second scoring.

In a possible implementation, the comprising modules, it is additionally operable to from the preset number candidate phrase Select the candidate phrase composition concepts tab information of concept type；Event type is selected from the preset number candidate phrase Candidate phrase composition event tag information.

In a possible implementation, described device also includes：

Mobile module, for the candidate phrase to be ended up in the concepts tab information with presetting the keyword of part of speech to be moved Into the event tag information；And/or

The mobile module, it is additionally operable to that the time of the keyword of the default part of speech will not be included in the event tag information Phrase is selected to be moved in the concepts tab information.

In a possible implementation, the word-dividing mode, it is additionally operable to make pauses in reading unpunctuated ancient writings to the text message, obtains At least one candidate sentences, at least one candidate sentences are formed into candidate sentences set；To in the candidate sentences set Each candidate sentences segmented, at least one keyword is obtained, by least one crucial phrase into keyword set； Based on syntax tree algorithm, by the keyword in the keyword set, at least one candidate phrase is generated；Will be described at least one Candidate phrase forms the candidate phrase set.

In a possible implementation, the word-dividing mode, it is additionally operable to determine every in the candidate sentences set The sentence element of individual candidate sentences；According to the sentence element of each candidate sentences, by sentence in the candidate sentences set Composition is that the candidate sentences of default composition are deleted.

In embodiments of the present invention, text message is segmented, obtains candidate phrase set, the candidate phrase set bag At least one candidate phrase is included, label information is extracted based on the candidate phrase in candidate phrase set, it is more so as to extract Meta tag information.Also, for each candidate phrase in candidate phrase set, sentence comments where determining the candidate phrase The scoring of the position of sentence where dividing and determining the candidate phrase, and determine each keyword that the candidate phrase includes The first scoring, according to the scoring of the position of sentence and the time where the scoring of sentence where the candidate phrase, the candidate phrase The first scoring of each keyword that phrase includes is selected, determines the scoring of the candidate phrase.Due to combining sentence scoring, position Scoring and keyword score, so as to improve the accuracy for the scoring for determining candidate phrase, and then improve extraction label letter The accuracy of breath.

Brief description of the drawings

Fig. 1 is a kind of method flow diagram for extracting label information provided in an embodiment of the present invention；

Fig. 2 is a kind of method flow diagram for extracting label information provided in an embodiment of the present invention；

Fig. 3 is a kind of apparatus structure schematic diagram for extracting label information provided in an embodiment of the present invention；

Fig. 4 is a kind of structural representation for being used to extract the server of label information provided in an embodiment of the present invention.

Embodiment

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing to embodiment party of the present invention Formula is described in further detail.

The embodiments of the invention provide a kind of method for extracting label information, referring to Fig. 1, this method includes：

Step 101：Text message is segmented, obtains candidate phrase set, the candidate phrase set includes at least one Individual candidate phrase, each candidate phrase include at least one keyword.

Step 102：For each candidate phrase in the candidate phrase set, sentence comments where determining the candidate phrase The scoring of the position of sentence where dividing and determining the candidate phrase, and determine each keyword that the candidate phrase includes The first scoring, according to the scoring of the position of sentence and the time where the scoring of sentence where the candidate phrase, the candidate phrase The first scoring of each keyword that phrase includes is selected, determines the scoring of the candidate phrase.

Step 103：Based on the scoring of each candidate phrase, the selection scoring highest present count from the candidate phrase set Mesh candidate phrase.

Step 104：By the label information of preset number candidate phrase composition text information.

In a possible implementation, the first scoring of each keyword that the candidate phrase includes is determined, including：

For each keyword, the first occurrence number and the second occurrence number are determined, first occurrence number is the key Occurrence number of the word in text information, second occurrence number are the total appearance for each keyword that text information includes Number；

The first quantity and the second quantity are determined, first quantity is the sample text information that sample text information bank includes Quantity, second quantity be the sample text information bank include the keyword text message quantity；

In a possible implementation, the scoring of the position of sentence where determining the candidate phrase, including：

First position of the paragraph in text information where sentence where determining the candidate phrase, and the candidate phrase The second place in the paragraph；

In a possible implementation, sentence where the scoring of sentence, the candidate phrase according to where the candidate phrase First scoring of each keyword that the scoring of the position of son and the candidate phrase include, determines the scoring of the candidate phrase, wraps Include：

The of the position correspondence of sentence where determining the first weight, the candidate phrase corresponding to sentence where the candidate phrase 3rd weight corresponding to each keyword that two weights, the candidate phrase include；

The each keyword included for the candidate phrase, by the scoring of sentence where the candidate phrase and first weight It is multiplied, obtains the first numerical value, and, the scoring of the position of sentence where the candidate phrase and second multiplied by weight obtain Second value, and, the 3rd multiplied by weight corresponding to the first scoring and the keyword by the keyword, third value is obtained, First numerical value, the second value are added with the third value, obtain the second scoring of the keyword；

According to the second of each keyword the scoring, the scoring of the candidate phrase is determined.

In a possible implementation, first numerical value, the second value are added with the third value, are somebody's turn to do Second scoring of keyword, including：

The contribution degree of the keyword is determined, according to the contribution degree of the keyword, determines the 4th weight corresponding to the keyword；

In a possible implementation, the label that the preset number candidate phrase is formed to text information is believed Breath, including：

The candidate phrase composition concepts tab information of concept type is selected from the preset number candidate phrase；And/or

In a possible implementation, this method also includes：

The candidate phrase to be ended up in the concepts tab information with presetting the keyword of part of speech is moved to event tag letter In breath；

The candidate phrase for the keyword for not including the default part of speech in the event tag information is moved to the concepts tab In information.

In a possible implementation, text message is segmented, obtains candidate phrase set, including：

Text information is made pauses in reading unpunctuated ancient writings, obtains at least one candidate sentences, at least one candidate sentences are formed into candidate Sentence set；

Each candidate sentences in the candidate sentences set are segmented, obtain at least one keyword, will at least one Individual crucial phrase is into keyword set；

At least one candidate phrase is formed into the candidate phrase set.

In a possible implementation, each candidate sentences in the candidate sentences set are segmented, obtained Before at least one keyword, this method also includes：

According to the sentence element of each candidate sentences, by the candidate that sentence element in the candidate sentences set is default composition Sentence is deleted.

The embodiments of the invention provide a kind of method for extracting label information, this method is applied in the server, referring to figure 2, this method includes：

Step 201：Server segments to text message, obtains candidate phrase set, and candidate phrase set is included extremely A few candidate phrase, each candidate phrase include at least one keyword.

In order to improve the reading efficiency of user, before user obtains text message by terminal from server, service Device extracts label information from text message, and the label information is used for the purport for indicating text information.When user passes through terminal When text message is obtained from server, server sends the label information of text message to terminal.Terminal the reception server is sent out The label information of the text message sent, the label information of text message is shown, so as to which user quickly understands according to the label information To the purport of text information.Wherein, text information can be any text message including text information；For example, this article This information can be E-News information, social network information, comment on commodity information, info web, e-mail messages etc.；In this hair In bright embodiment, text message is not especially limited.

Server is segmented to text message, and when obtaining candidate phrase set, server can be carried out to text message Participle, obtains keyword set, based on keyword set, generates candidate phrase set；Accordingly, this step can be by following Step (1) to (4) realization, including：

(1)：Server is made pauses in reading unpunctuated ancient writings to text message, obtains at least one candidate sentences, by least one candidate sentences Form candidate sentences set.

Because for understanding an article, the contribution of stem sentence is far longer than the effect of subordinate clause；Also, subsequently through syntax When tree algorithm extracts candidate phrase, the speed of extraction is largely influenceed by sentence length, in order to improve extraction candidate The arithmetic speed of phrase, server filter out adverbial clause；Accordingly, at least one candidate sentences are formed candidate sentence by server After subclass, in addition to：

Server determines the sentence element of each candidate sentences in candidate sentences set；According to the sentence of each candidate sentences Subconstiuent, the candidate sentences that sentence element in candidate sentences set is default composition are deleted.

Default composition can be the adverbial modifier or attribute etc..In embodiments of the present invention, default composition is not especially limited.

(2)：Server segments to each candidate sentences in candidate sentences set, obtains at least one keyword, By at least one crucial phrase into keyword set.

Due to " ", " ", " ", " ", " " etc keyword it is smaller to the contribution degree of text message.Therefore, In order to reduce operand and improve accuracy, in this step, server can also will " ", " ", " ", " ", " " this kind of keyword removes.Therefore, server by least one crucial phrase into after keyword set, in addition to：

The part of speech of each keyword in server mark keyword set；According to the part of speech of each keyword, from key The keyword of the first default part of speech is searched in set of words, the keyword of the first default part of speech is removed from keyword set.Its In, the keyword of the first default part of speech can be auxiliary word, preposition, modal particle or number etc..In embodiments of the present invention, to One default part of speech is not especially limited.

(3)：Server is based on syntax tree algorithm, by the keyword in keyword set, generates at least one candidate phrase.

Each keyword in keyword set is input in syntax tree-model by server, and the syntax tree-model includes The syntax tree algorithm；By the syntax tree algorithm, the keyword in keyword set is generated into keyword tree.The keyword tree is wrapped Include the relation between multiple nodes and multiple nodes；A node in keyword tree is a keyword.Server is based on Keyword tree, generate at least one candidate phrase.

Because keyword tree includes the relation between keyword and keyword；Only have related one or more passes Keyword could form candidate phrase.Accordingly, server is based on keyword tree, and the step of generating at least one candidate phrase can be with For：

For each leaf node in keyword tree, server selects the keyword of the leaf node from keyword tree And the crucial phrase of the father node of the leaf node is into candidate phrase.

In this step, server can also combine the candidate key generation of the father node of the father node of the leaf node Candidate phrase, accordingly, server are based on keyword tree, and the step of generating at least one candidate phrase can be：

For each leaf node in keyword tree, server selects the key of the leaf node from keyword tree Word, and the keyword of the father node of the leaf node is obtained, and the keyword of the father node of father node is obtained, until obtaining To the keyword of root node；By the crucial phrase of acquisition into candidate phrase.

In embodiments of the present invention, the extraction of candidate phrase is carried out by syntax tree algorithm, can ensure what is extracted Candidate phrase is polynary, and has very strong semantic coherence between the keyword that includes of candidate phrase, so as to improve Accuracy of the candidate based on candidate phrase generation label information.

(4)：At least one candidate phrase is formed candidate phrase set by server.

After server obtains candidate phrase set, determined by following steps 202 each in the candidate phrase set The scoring of candidate phrase.

Step 202：For each candidate phrase in candidate phrase set, server determines sentence where the candidate phrase Scoring, sentence where the candidate phrase position scoring and the first scoring of each keyword for including of the candidate phrase, The scoring of the position of sentence and the candidate phrase include where the scoring of sentence, the candidate phrase according to where the candidate phrase First scoring of each keyword, determines the scoring of the candidate phrase.

For each candidate phrase in candidate phrase set, server can be determined by following steps (1) to (4) The scoring of the candidate phrase.

(1)：The scoring of sentence where server determines the candidate phrase.

Server obtains the sentence where the candidate phrase；By BM25 algorithms, the scoring of the sentence is determined.

(2)：The scoring of the position of sentence where server determines the candidate phrase.

Because the scoring of the sentence of diverse location is different, for example, the sentence in summary can more embody the purport of text message, The corresponding higher scoring of sentence in summary；Sentence in text corresponds to relatively low scoring.Therefore, before this step, service Storage location and the corresponding relation of scoring in device.The position can be summary or text.Accordingly, this step can be：

Position of the sentence in text message where server determines the candidate phrase, according to the position, from position and comment The scoring of the position of sentence where the candidate phrase is obtained in the corresponding relation divided.

In this step, server can also calculate the candidate phrase institute according to the position of sentence where the candidate phrase Scoring in the position of sentence.Accordingly.This step can be：

Server determine where sentence where the candidate phrase it is short fall first position in text message, and, the time Select the second place of the phrase in the paragraph；According to first position and the second place, the position of sentence where determining the candidate phrase The scoring put.

Wherein, first position can be title, summary or text.The second place can be the first sentence of first section, the non-head of first section Sentence, the non-first sentence of first section or the non-first sentence of non-first section.Accordingly, server determines the candidate according to first position and the second place Can be the step of the scoring of the position of sentence where phrase：

Server obtains corresponding to first position the according to first position from the corresponding relation of first position and son scoring One son scoring；According to the second place, the second son corresponding to the second place is obtained from the corresponding relation of the second place and son scoring Scoring, the first son scoring is multiplied with the second son scoring, the scoring of the position of sentence where obtaining the candidate phrase.

For example, when first position is title, the first son scoring is 0.9；When first position is makes a summary, the first son scoring is 0.6；When first position is text, the first son scoring is 0.3.When the second place is first section first sentence, the second son scoring is 0.3；The When two positions are non-first section first sentence, the second son scoring is 0.3*0.4=0.12；When the second place is non-first section first sentence, the second son Score as 0.1；When the second place is non-first section non-first sentence, the second son scoring is 0.1*0.4=0.04.

(3)：Server determines the first scoring of each keyword that the candidate phrase includes.

The each keyword included for the candidate phrase, server is determined by following steps (3-1) to (3-5) should First scoring of keyword, including：

(3-1)：Server determines the first occurrence number and the second occurrence number, and the first occurrence number is the keyword at this Occurrence number in text message, the second occurrence number are the total occurrence number for each keyword that text information includes.

(3-2)：Server determines the word frequency of the keyword according to the first occurrence number and the second occurrence number.

The ratio of first occurrence number and the second occurrence number is defined as the word frequency of the keyword by server.

(3-3)：Server determines the first quantity and the second quantity, and the first quantity is the sample that sample text information bank includes The quantity of this text message, the second quantity are the quantity for the text message that sample text information bank includes the keyword.

Sample text information bank is previously stored in server, the sample text information bank includes at least one sample text Information.In this step, the quantity for the sample text information that the server statistics sample text information bank includes, for the ease of Description, is referred to as the first quantity by the quantity.Server statistics sample text information bank includes the text message of the keyword Quantity, the quantity is referred to as the second quantity.

(3-4)：Server determines the reverse document-frequency of the keyword according to the first quantity and the second quantity.

Server, by below equation one, determines the reverse file of the keyword frequently according to the first quantity and the second quantity Rate：

Formula one：Wherein, idf is the reverse document-frequency of the keyword, and D is the first quantity, J the Two quantity.

Include the text message of the keyword due to that may be not present in sample text information bank, therefore, the second quantity can Can be zero.Therefore, can be in this step：

Server, by below equation two, determines the reverse file of the keyword frequently according to the first quantity and the second quantity Rate：

Formula two：Wherein, idf is the reverse document-frequency of the keyword, and D is the first quantity, J the Two quantity.

(3-4)：Word frequency and reverse document-frequency of the server according to the keyword, determine the first scoring of the keyword.

Word frequency and reverse document-frequency of the server according to the keyword, by the first preset algorithm, determine the keyword First scoring.

First preset algorithm can be configured and change as needed, in embodiments of the present invention, to the first pre- imputation Method is not especially limited.Removed for example, the first preset algorithm can be multiplication, addition, subtraction, division, weighting multiplication or weighting Method.

When the first preset algorithm is multiplication, this step can be：

The word frequency of the keyword is multiplied by server with reverse document-frequency, obtains the first scoring of the keyword.

(4)：The scoring of the position of sentence where the server scoring of sentence, the candidate phrase according to where the candidate phrase First scoring of each keyword included with the candidate phrase, determines the scoring of the candidate phrase.

In this step, server can be directly according to where the scoring of candidate phrase place sentence, the candidate phrase First scoring of each keyword that the scoring of the position of sentence and the candidate phrase include, determines the scoring of the candidate phrase, Namely the first following implementation.Server can also be respectively square where sentence, the candidate phrase where the candidate phrase Each keyword that the position of battle array and the candidate phrase include sets the first weight, the second weight and the 3rd weight, based on the time Each key that the scoring of the position of sentence and the candidate phrase where the scoring of sentence, the candidate phrase where selecting phrase include First scoring of word and the first weight, the second weight and the 3rd weight, determine the scoring of the candidate phrase, namely following the Two kinds of implementations.

For the first implementation, this step can be realized by following steps (4-1) to (4-2), including：

(4-1)：The each keyword included for the candidate phrase, server are commented according to candidate phrase place sentence Point, the scoring of position and the first scoring of the keyword of sentence where the candidate phrase, by the second preset algorithm, it is determined that should Second scoring of keyword.

Second preset algorithm can be configured and change as needed, in embodiments of the present invention, to the second pre- imputation Method is not especially limited.Removed for example, the second preset algorithm can be multiplication, addition, subtraction, division, weighting multiplication or weighting Method.

When the second preset algorithm is addition, this step can be：

Server is by the scoring of the position of sentence and the pass where the scoring of sentence where the candidate phrase, the candidate phrase First scoring of keyword is added, and obtains the second scoring of the keyword.

It should be noted that because monosyllabic word is smaller to the contribution degree for understanding article.Therefore, server is it is determined that monosyllabic word Keyword second scoring when, drop power processing is carried out to the keyword.The word with quotation marks repeatedly occurred in text message To understanding that the contribution degree of article is larger.Therefore, server is right when it is determined that the second scoring of the keyword with quotation marks repeatedly occur The keyword is weighted processing.Accordingly, this step can be：

The each keyword included for the candidate phrase, server determine the contribution degree of the keyword；According to the contribution Degree, determines the 4th weight corresponding to the keyword；Sentence where the scoring of sentence, the candidate phrase according to where the candidate phrase Position scoring and the keyword first scoring, by the second preset algorithm, obtain the 5th numerical value；By the 5th numerical value and Four multiplied by weight, obtain the second scoring of the keyword.

Server determines the contribution degree of the keyword, had according to the number of words of the keyword and/or the importance of the keyword Body process can be：

Contribution degree of the server using the number of words of the keyword as the keyword；Also, number of words is more, contribution degree is higher. Or

Contribution degree of the server using the importance of the keyword as the keyword；Also, importance is higher, contribution degree is got over It is high.Or

After the number of words of the keyword and importance are weighted processing by server, the contribution degree of the keyword is obtained.

Wherein, the importance of the keyword can be used for the occurrence number expression of the keyword and/or the keyword is dashed forward Go out display etc. to represent.

The corresponding relation of contribution degree and weight is stored in server；Accordingly, server determines the pass according to the contribution degree Can be the step of four weights corresponding to keyword：

Server obtains the 4th power corresponding to the keyword according to the contribution degree from the corresponding relation of contribution degree and weight Weight.

(4-2)：Second scoring of each keyword that server includes according to the candidate phrase, determines the candidate phrase Scoring.

Second scoring of each keyword that server includes according to the candidate phrase, by the 3rd preset algorithm, it is determined that The scoring of the candidate phrase.

3rd preset algorithm can be configured and change as needed, in embodiments of the present invention, to the 3rd pre- imputation Method is not especially limited.For example, the 3rd preset algorithm can be multiplication, addition, subtraction, division, weighting multiplication, weighting division or Person's maximizing etc..

When the 3rd preset algorithm is addition, this step can be：

Second scoring of each keyword that server includes the candidate phrase is added, and obtains commenting for the candidate phrase Point.

When the 3rd preset algorithm is maximizing, this step can be：

Maximum scores are selected in second scoring of each keyword that server includes from the candidate phrase, the maximum is commented It is allocated as the scoring for the candidate phrase.

For second of implementation, this step can be realized by following steps (4-a) to (4-c), including：

(4-a)：Sentence where first weight, the candidate phrase corresponding to sentence where server determines the candidate phrase 3rd weight corresponding to each keyword that second weight of position correspondence, the candidate phrase include.

Sentence where first weight, the candidate phrase corresponding to sentence where the candidate phrase is prestored in server Second weight of position；First weight corresponding to sentence where in this step, server obtains the candidate phrase that has stored, Second weight of the position correspondence of sentence where the candidate phrase.

Each keyword and the corresponding relation of the 3rd weight are stored in server；Accordingly, it is short to obtain the candidate for server Can be the step of three weights corresponding to each keyword that language includes：

Each keyword that server includes according to the candidate phrase, obtained from keyword and the corresponding relation of the 3rd weight Take the 3rd weight corresponding to each keyword that the candidate phrase includes.

It should be noted that the 3rd weight corresponding to each keyword can be with identical, can also be different.For example, the candidate First weight a1=0.1 corresponding to sentence where phrase, the second weight a2=of the position correspondence of sentence where the candidate phrase 0.55, the 3rd weight a3=0.35 corresponding to each keyword that the candidate phrase includes.

(4-b)：The each keyword included for candidate phrase, server by the scoring of sentence where the candidate phrase with First multiplied by weight, the first numerical value is obtained, and, by the scoring of the position of sentence where the candidate phrase and the second weight phase Multiply, obtain second value, and, by the 3rd multiplied by weight corresponding to the first of the keyword the scoring and the keyword, obtain the Three numerical value, the first numerical value, second value are added with third value, obtain the second scoring of the keyword.

Similarly, since monosyllabic word is smaller to the contribution degree for understanding article.Therefore, server is it is determined that the keyword of monosyllabic word Second scoring when, drop power processing is carried out to the keyword.The word with quotation marks repeatedly occurred in text message is to understanding text The contribution degree of chapter is larger.Therefore, server is when it is determined that the second scoring of the keyword with quotation marks repeatedly occur, to the keyword It is weighted processing.Accordingly, the first numerical value, second value are added by server with third value, obtain the of the keyword Two can be the step of scoring：

Server determines the contribution degree of the keyword；According to the contribution degree, the 4th weight corresponding to the keyword is determined；Will First numerical value, second value are added with third value, obtain the 4th numerical value, and the 4th numerical value and the 4th multiplied by weight are somebody's turn to do Second scoring of keyword.

(4-c)：Server determines the scoring of candidate phrase according to the second of each keyword the scoring.

This step and step (4-2) are identical, will not be repeated here.

Step 203：Scoring of the server based on each candidate phrase, selection scoring highest is pre- from candidate phrase set If number candidate phrase.

Scoring of the server based on each candidate phrase, each candidate phrase is carried out according to the order of scoring from high to low Sequence, output sequence is in most preceding preset number candidate phrase.

Preset number can be configured and change as needed, in embodiments of the present invention, preset number not made to have Body limits.For example, preset number can be 8 or 10 etc..

In embodiments of the present invention, due to nonsensical keyword may be included in candidate phrase, for example, auxiliary word, Jie Word, modal particle, number；Therefore, after server selects preset number candidate phrase, server is by preset number candidate The keyword of the second default part of speech in phrase filters out.

Second default part of speech and the first default part of speech be able to can also be differed with identical；In embodiments of the present invention to this It is not especially limited.For example, the keyword of the second default part of speech can be auxiliary word, preposition, modal particle or number etc..

In embodiments of the present invention, server can set concepts tab information and event tag information；Wherein, concept mark Label information includes the most crucial concept phrase of text information, and event tag information includes the core thing in text information Part.After execution of step 203, server is given birth to by the product concept label information of following steps 204 by following steps 205 Into event label information.

Step 204：Server forms concepts tab from the candidate phrase for presetting selection concept type in several candidate phrases Information.

Wherein, the candidate phrase of concept type refers to the candidate phrase for including noun.

Step 205：Server selects the candidate phrase of event type to form event mark from preset number candidate phrase Sign information.

Wherein, the candidate phrase of event type refers to the candidate phrase for including verb.

It should be noted that step 204 and step 205 do not have time order and function order, step 204 can be first carried out, then hold Row step 205；Step 205 can also be first carried out, then performs step 204.

Because server may malfunction in classification concept label information and event tag information；Therefore, server is also Phrase correction can be carried out, detailed process can be：

The candidate phrase to be ended up in concepts tab information with the keyword of the 3rd default part of speech is moved to event by server In label information；And/or the candidate phrase for the keyword for not including the 3rd default part of speech in event tag information is moved to generally Read in label information.

3rd default part of speech can be configured and change as needed, in embodiments of the present invention, to the 3rd default word Property is not especially limited.For example, the 3rd default part of speech can be verb.

After server extraction label information, the mark of server storage text message and the label information of text message Corresponding relation.Terminal can obtain the label information from server；Detailed process can be：

Terminal to server, which is sent, obtains request, and acquisition request carries the mark of text message to be obtained.Server The acquisition request that receiving terminal is sent, according to the mark of text information, obtained from the corresponding relation of mark and label information The label information of text information, the label information of text information is sent to terminal.This article that terminal the reception server is sent The label information of this information, show the label information of text information.So as to user can based on text information label believe Breath, the purport of fast understanding text information.Wherein, the mark of text message can be the title of text message, URL, storage Path or numbering etc..

In embodiments of the present invention, (Latent Dirichlet Allocation, the document master of the LDA in existing method Inscribe generation model) label information of extraction is a metatag, and the embodiment of the present invention is based on the label information that syntax tree is extracted More meta tag informations, and extracted concepts tab information and event tag information.

The invention provides a kind of device for extracting label information, device application are in the server, above-mentioned for performing Extract the server in the method for label information.Referring to Fig. 3, the device includes：

Word-dividing mode 301, for being segmented to text message, obtain candidate phrase set, the candidate phrase set Including at least one candidate phrase, each candidate phrase includes at least one keyword；

Grading module 302, for for each candidate phrase in the candidate phrase set, determining the candidate phrase The scoring of the position of sentence where the scoring of place sentence and the determination candidate phrase, and determine the candidate phrase Including each keyword the first scoring, according to the scoring of sentence where the candidate phrase, sentence where the candidate phrase First scoring of each keyword that the scoring of the position of son and the candidate phrase include, determines commenting for the candidate phrase Point；

Selecting module 303, for the scoring based on each candidate phrase, select to comment from the candidate phrase set Divide highest preset number candidate phrase；

Comprising modules 304, for the preset number candidate phrase to be formed to the label information of the text message.

In a possible implementation, institute's scoring module 302, it is additionally operable to, for each keyword, determine first Occurrence number and the second occurrence number, first occurrence number are that the keyword goes out occurrence in the text message Number, second occurrence number are the total occurrence number for each keyword that the text message includes；Go out according to described first Occurrence number and second occurrence number, determine the word frequency of the keyword；Determine the first quantity and the second quantity, described first Quantity is the quantity for the sample text information that sample text information bank includes, and second quantity is the sample text information Storehouse includes the quantity of the text message of the keyword；According to first quantity and second quantity, the pass is determined The reverse document-frequency of keyword；According to the word frequency and the reverse document-frequency, the first scoring of the keyword is determined.

In a possible implementation, institute's scoring module 302, sentence where determining the candidate phrase is additionally operable to First position of the place paragraph in the text message, and the second place of the candidate phrase in the paragraph；Root According to the first position and the second place, the scoring of the position of sentence where determining the candidate phrase.

In a possible implementation, institute's scoring module 302, sentence where determining the candidate phrase is additionally operable to The second weight of the position correspondence of sentence, the candidate phrase include every where corresponding first weight, the candidate phrase 3rd weight corresponding to individual keyword；The each keyword included for the candidate phrase, by sentence where the candidate phrase The scoring of son and first multiplied by weight, obtain the first numerical value, and, by commenting for the position of sentence where the candidate phrase Point with second multiplied by weight, obtain second value, and, the first of the keyword the scoring is corresponding with the keyword The 3rd multiplied by weight, obtain third value, first numerical value, the second value be added with the third value, obtained To the second scoring of the keyword；According to the second scoring of each keyword, the scoring of the candidate phrase is determined.

In a possible implementation, institute's scoring module 302, it is additionally operable to determine the contribution degree of the keyword, According to the contribution degree of the keyword, the 4th weight corresponding to the keyword is determined；By first numerical value, second number Value is added with the third value, obtains the 4th numerical value；By the 4th numerical value and the 4th multiplied by weight, the pass is obtained Second scoring of keyword.

In a possible implementation, the comprising modules 304, it is additionally operable to from the preset number candidate phrase The candidate phrase composition concepts tab information of middle selection concept type；Event class is selected from the preset number candidate phrase The candidate phrase composition event tag information of type.

In a possible implementation, described device also includes：

In a possible implementation, the word-dividing mode 301, it is additionally operable to make pauses in reading unpunctuated ancient writings to the text message, At least one candidate sentences are obtained, at least one candidate sentences are formed into candidate sentences set；To the candidate sentence subset Each candidate sentences in conjunction are segmented, and obtain at least one keyword, by least one crucial phrase into keyword Set；Based on syntax tree algorithm, by the keyword in the keyword set, at least one candidate phrase is generated；By described in extremely A few candidate phrase forms the candidate phrase set.

In a possible implementation, the word-dividing mode 301, it is additionally operable to determine in the candidate sentences set The sentence element of each candidate sentences；According to the sentence element of each candidate sentences, by sentence in the candidate sentences set Subconstiuent is that the candidate sentences of default composition are deleted.

It should be noted that：Above-described embodiment provide extraction label information device when extracting label information, only with The division progress of above-mentioned each functional module, can be as needed and by above-mentioned function distribution by not for example, in practical application Same functional module is completed, i.e., the internal structure of device is divided into different functional modules, to complete whole described above Or partial function.In addition, the method for the device and extraction label information for the extraction label information that above-described embodiment provides is implemented Example belongs to same design, and its specific implementation process refers to embodiment of the method, repeats no more here.

Fig. 4 is a kind of server for being used to extract label information according to an exemplary embodiment.Reference picture 4, clothes Business device 400 includes processing component 422, and it further comprises one or more processors, and as depositing representated by memory 432 Memory resource, can be by the instruction of the execution of processing component 422, such as application program for storing.What is stored in memory 432 should With program can include it is one or more each correspond to the module of one group of instruction.In addition, processing component 422 by with Execute instruction is set to, to perform the function in the method for said extracted label information performed by server.

Server 400 can also include a power supply module 426 and be configured as the power management of execute server 400, and one Individual wired or wireless network interface 450 is configured as server 400 being connected to network, and input and output (I/O) interface 458.Server 400 can be operated based on the operating system for being stored in memory 432, such as Windows Server^TM, Mac OS X^TM, Unix^TM,Linux^TM, FreeBSD^TMIt is or similar.

The embodiment of the present invention additionally provides a kind of computer-readable recording medium, and the computer-readable recording medium can be The computer-readable recording medium included in memory in above-described embodiment；It can also be individualism, be taken without supplying The computer-readable recording medium being engaged in device.The computer-readable recording medium storage has one or more than one program, should One method that either more than one program is used for performing extraction label information by one or more than one processor.

One of ordinary skill in the art will appreciate that hardware can be passed through by realizing all or part of step of above-described embodiment To complete, by program the hardware of correlation can also be instructed to complete, described program can be stored in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only storage, disk or CD etc..

The foregoing is only presently preferred embodiments of the present invention, be not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent substitution and improvements made etc., it should be included in the scope of the protection.

Claims

A kind of 1. method for extracting label information, it is characterised in that methods described includes：

Text message is segmented, obtains candidate phrase set, the candidate phrase set includes at least one candidate phrase, Each candidate phrase includes at least one keyword；

For each candidate phrase in the candidate phrase set, the scoring of sentence where determining the candidate phrase and The scoring of the position of sentence where determining the candidate phrase, and determine the of each keyword that the candidate phrase includes One scoring, according to the scoring of sentence where the candidate phrase, the scoring of the position of sentence where the candidate phrase and described First scoring of each keyword that candidate phrase includes, determines the scoring of the candidate phrase；

Based on the scoring of each candidate phrase, selection scoring highest preset number is waited from the candidate phrase set Select phrase；

The preset number candidate phrase is formed to the label information of the text message.
2. according to the method for claim 1, it is characterised in that each keyword for determining the candidate phrase and including First scoring, including：

For each keyword, the first occurrence number and the second occurrence number are determined, first occurrence number is the key Occurrence number of the word in the text message, second occurrence number are each keyword that the text message includes Total occurrence number；

According to first occurrence number and second occurrence number, the word frequency of the keyword is determined；

The first quantity and the second quantity are determined, first quantity is the sample text information that sample text information bank includes Quantity, second quantity are the quantity for the text message that the sample text information bank includes the keyword；

According to first quantity and second quantity, the reverse document-frequency of the keyword is determined；

According to the word frequency and the reverse document-frequency, the first scoring of the keyword is determined.
3. according to the method for claim 1, it is characterised in that the position for determining candidate phrase place sentence Scoring, including：

First position of the paragraph in the text message where sentence where determining the candidate phrase, and the candidate are short The second place of the language in the paragraph；

According to the first position and the second place, the scoring of the position of sentence where determining the candidate phrase.
4. according to the method for claim 1, it is characterised in that the scoring of the sentence according to where the candidate phrase, First scoring of each keyword that the scoring of the position of sentence where the candidate phrase and the candidate phrase include, it is determined that The scoring of the candidate phrase, including：

The of the position correspondence of sentence where first weight, the candidate phrase corresponding to sentence where determining the candidate phrase 3rd weight corresponding to each keyword that two weights, the candidate phrase include；

The each keyword included for the candidate phrase, by the scoring of sentence where the candidate phrase and the described first power Heavy phase multiplies, and obtains the first numerical value, and, by the scoring of the position of sentence where the candidate phrase and the second weight phase Multiply, obtain second value, and, the 3rd multiplied by weight corresponding to the first scoring and the keyword by the keyword, obtain To third value, first numerical value, the second value are added with the third value, obtain the second of the keyword Scoring；

According to the second scoring of each keyword, the scoring of the candidate phrase is determined.
5. according to the method for claim 4, it is characterised in that described by first numerical value, the second value and institute Third value addition is stated, obtains the second scoring of the keyword, including：

The contribution degree of the keyword is determined, according to the contribution degree of the keyword, determines the 4th power corresponding to the keyword Weight；

First numerical value, the second value are added with the third value, obtain the 4th numerical value；

By the 4th numerical value and the 4th multiplied by weight, the second scoring of the keyword is obtained.
6. according to the method for claim 1, it is characterised in that described by described in preset number candidate phrase composition The label information of text message, including：

The candidate phrase composition concepts tab information of concept type is selected from the preset number candidate phrase；And/or

The candidate phrase composition event tag information of event type is selected from the preset number candidate phrase.
7. according to the method for claim 6, it is characterised in that methods described also includes：

The candidate phrase to end up to preset the keyword of part of speech in the concepts tab information is moved into the event tag to believe In breath；

The candidate phrase for the keyword for not including the default part of speech in the event tag information is moved to the concept mark Sign in information.
8. according to any described methods of claim 1-7, it is characterised in that it is described that text message is segmented, waited Phrase set is selected, including：

The text message is made pauses in reading unpunctuated ancient writings, obtains at least one candidate sentences, at least one candidate sentences composition is waited Select sentence set；

Each candidate sentences in the candidate sentences set are segmented, obtain at least one keyword, by described at least One crucial phrase is into keyword set；

Based on syntax tree algorithm, by the keyword in the keyword set, at least one candidate phrase is generated；

At least one candidate phrase is formed into the candidate phrase set.
9. according to the method for claim 8, it is characterised in that each candidate sentence in the candidate sentences set Son is segmented, and before obtaining at least one keyword, methods described also includes：

Determine the sentence element of each candidate sentences in the candidate sentences set；

According to the sentence element of each candidate sentences, by the time that sentence element in the candidate sentences set is default composition Sentence is selected to delete.
10. a kind of device for extracting label information, it is characterised in that described device includes：

Word-dividing mode, for being segmented to text message, candidate phrase set is obtained, the candidate phrase set is included at least One candidate phrase, each candidate phrase include at least one keyword；

Grading module, for for each candidate phrase in the candidate phrase set, determining sentence where the candidate phrase The scoring of the position of sentence where the scoring of son and the determination candidate phrase, and determine what the candidate phrase included First scoring of each keyword, the scoring of sentence, the position of sentence where the candidate phrase according to where the candidate phrase First scoring of each keyword that the scoring put and the candidate phrase include, determines the scoring of the candidate phrase；

Selecting module, for the scoring based on each candidate phrase, the selection scoring highest from the candidate phrase set Preset number candidate phrase；

Comprising modules, for the preset number candidate phrase to be formed to the label information of the text message.