The content of the invention
In order to solve problem of the prior art, the invention provides a kind of method and apparatus for extracting label information.Technology
Scheme is as follows:
In a first aspect, the invention provides a kind of method for extracting label information, methods described includes:
Text message is segmented, obtains candidate phrase set, the candidate phrase set includes at least one candidate
Phrase, each candidate phrase include at least one keyword;
For each candidate phrase in the candidate phrase set, the scoring of sentence where determining the candidate phrase,
And the scoring of the position of sentence where determining the candidate phrase, and determine each keyword that the candidate phrase includes
The first scoring, according to the scoring of sentence where the candidate phrase, the scoring of the position of sentence where the candidate phrase and
First scoring of each keyword that the candidate phrase includes, determines the scoring of the candidate phrase;
Based on the scoring of each candidate phrase, the selection scoring highest preset number from the candidate phrase set
Individual candidate phrase;
The preset number candidate phrase is formed to the label information of the text message.
In a possible implementation, the first of each keyword for determining the candidate phrase and including comments
Point, including:
For each keyword, the first occurrence number and the second occurrence number are determined, first occurrence number is described
Occurrence number of the keyword in the text message, second occurrence number are each key that the text message includes
Total occurrence number of word;
According to first occurrence number and second occurrence number, the word frequency of the keyword is determined;
The first quantity and the second quantity are determined, first quantity is the sample text letter that sample text information bank includes
The quantity of breath, second quantity are the quantity for the text message that the sample text information bank includes the keyword;
According to first quantity and second quantity, the reverse document-frequency of the keyword is determined;
According to the word frequency and the reverse document-frequency, the first scoring of the keyword is determined.
In a possible implementation, the scoring of the position for determining candidate phrase place sentence, including:
First position of the paragraph in the text message where sentence where determining the candidate phrase, and the time
Select the second place of the phrase in the paragraph;
According to the first position and the second place, the scoring of the position of sentence where determining the candidate phrase.
It is described short according to the scoring of sentence where the candidate phrase, the candidate in a possible implementation
First scoring of each keyword that the scoring of the position of sentence where language and the candidate phrase include, determines that the candidate is short
The scoring of language, including:
The position correspondence of sentence where first weight, the candidate phrase corresponding to sentence where determining the candidate phrase
The second weight, the 3rd weight corresponding to each keyword for including of the candidate phrase;
The each keyword included for the candidate phrase, by the scoring of sentence where the candidate phrase and described the
One multiplied by weight, the first numerical value is obtained, and, by the scoring of the position of sentence where the candidate phrase and second weight
It is multiplied, obtains second value, and, the 3rd multiplied by weight corresponding to the first scoring and the keyword by the keyword,
Third value is obtained, first numerical value, the second value are added with the third value, obtains the of the keyword
Two scorings;
According to the second scoring of each keyword, the scoring of the candidate phrase is determined.
It is described by first numerical value, the second value and the third value in a possible implementation
It is added, obtains the second scoring of the keyword, including:
The contribution degree of the keyword is determined, according to the contribution degree of the keyword, is determined corresponding to the keyword
Four weights;
First numerical value, the second value are added with the third value, obtain the 4th numerical value;
By the 4th numerical value and the 4th multiplied by weight, the second scoring of the keyword is obtained.
It is described that the preset number candidate phrase is formed into the text message in a possible implementation
Label information, including:
The candidate phrase composition concepts tab information of concept type is selected from the preset number candidate phrase;With/
Or,
The candidate phrase composition event tag information of event type is selected from the preset number candidate phrase.
In a possible implementation, methods described also includes:
The candidate phrase to be ended up in the concepts tab information with presetting the keyword of part of speech is moved to the event mark
Sign in information;
The candidate phrase for the keyword for not including the default part of speech in the event tag information is moved to described general
Read in label information.
It is described that text message is segmented in a possible implementation, candidate phrase set is obtained, including:
The text message is made pauses in reading unpunctuated ancient writings, obtains at least one candidate sentences, by least one candidate sentence subgroup
Into candidate sentences set;
Each candidate sentences in the candidate sentences set are segmented, obtain at least one keyword, by described in
At least one crucial phrase is into keyword set;
Based on syntax tree algorithm, by the keyword in the keyword set, at least one candidate phrase is generated;
At least one candidate phrase is formed into the candidate phrase set.
In a possible implementation, each candidate sentences in the candidate sentences set divide
Word, before obtaining at least one keyword, methods described also includes:
Determine the sentence element of each candidate sentences in the candidate sentences set;
It is default composition by sentence element in the candidate sentences set according to the sentence element of each candidate sentences
Candidate sentences delete.
Second aspect, the invention provides a kind of device for extracting label information, described device includes:
Word-dividing mode, for being segmented to text message, candidate phrase set is obtained, the candidate phrase set includes
At least one candidate phrase, each candidate phrase include at least one keyword;
Grading module, for for each candidate phrase in the candidate phrase set, determining the candidate phrase institute
The scoring of the position of sentence where the scoring of sentence and the determination candidate phrase, and determine the candidate phrase bag
First scoring of each keyword included, according to the scoring of sentence where the candidate phrase, sentence where the candidate phrase
The scoring of position and the first scoring of each keyword for including of the candidate phrase, determine the scoring of the candidate phrase;
Selecting module, for the scoring based on each candidate phrase, scoring is selected from the candidate phrase set
Highest preset number candidate phrase;
Comprising modules, for the preset number candidate phrase to be formed to the label information of the text message.
In a possible implementation, institute's scoring module, it is additionally operable to, for each keyword, determine the first appearance
Number and the second occurrence number, first occurrence number are occurrence number of the keyword in the text message, institute
State the total occurrence number for each keyword that the second occurrence number includes for the text message;According to first occurrence number
With second occurrence number, the word frequency of the keyword is determined;The first quantity and the second quantity are determined, first quantity is
The quantity for the sample text information that sample text information bank includes, second quantity are to be wrapped in the sample text information bank
Include the quantity of the text message of the keyword;According to first quantity and second quantity, the keyword is determined
Reverse document-frequency;According to the word frequency and the reverse document-frequency, the first scoring of the keyword is determined.
In a possible implementation, institute's scoring module, sentence institute where determining the candidate phrase is additionally operable to
In first position of the paragraph in the text message, and the second place of the candidate phrase in the paragraph;According to
The first position and the second place, the scoring of the position of sentence where determining the candidate phrase.
In a possible implementation, institute's scoring module, sentence pair where determining the candidate phrase is additionally operable to
The second weight of the position correspondence of sentence, the candidate phrase include each where the first weight for answering, the candidate phrase
3rd weight corresponding to keyword;The each keyword included for the candidate phrase, by sentence where the candidate phrase
Scoring and first multiplied by weight, obtain the first numerical value, and, by the scoring of the position of sentence where the candidate phrase
With second multiplied by weight, second value is obtained, and, corresponding to the first scoring and the keyword by the keyword
3rd multiplied by weight, obtains third value, and first numerical value, the second value are added with the third value, obtained
Second scoring of the keyword;According to the second scoring of each keyword, the scoring of the candidate phrase is determined.
In a possible implementation, institute's scoring module, it is additionally operable to determine the contribution degree of the keyword, according to
The contribution degree of the keyword, determine the 4th weight corresponding to the keyword;By first numerical value, the second value and
The third value is added, and obtains the 4th numerical value;By the 4th numerical value and the 4th multiplied by weight, the keyword is obtained
Second scoring.
In a possible implementation, the comprising modules, it is additionally operable to from the preset number candidate phrase
Select the candidate phrase composition concepts tab information of concept type;Event type is selected from the preset number candidate phrase
Candidate phrase composition event tag information.
In a possible implementation, described device also includes:
Mobile module, for the candidate phrase to be ended up in the concepts tab information with presetting the keyword of part of speech to be moved
Into the event tag information;And/or
The mobile module, it is additionally operable to that the time of the keyword of the default part of speech will not be included in the event tag information
Phrase is selected to be moved in the concepts tab information.
In a possible implementation, the word-dividing mode, it is additionally operable to make pauses in reading unpunctuated ancient writings to the text message, obtains
At least one candidate sentences, at least one candidate sentences are formed into candidate sentences set;To in the candidate sentences set
Each candidate sentences segmented, at least one keyword is obtained, by least one crucial phrase into keyword set;
Based on syntax tree algorithm, by the keyword in the keyword set, at least one candidate phrase is generated;Will be described at least one
Candidate phrase forms the candidate phrase set.
In a possible implementation, the word-dividing mode, it is additionally operable to determine every in the candidate sentences set
The sentence element of individual candidate sentences;According to the sentence element of each candidate sentences, by sentence in the candidate sentences set
Composition is that the candidate sentences of default composition are deleted.
In embodiments of the present invention, text message is segmented, obtains candidate phrase set, the candidate phrase set bag
At least one candidate phrase is included, label information is extracted based on the candidate phrase in candidate phrase set, it is more so as to extract
Meta tag information.Also, for each candidate phrase in candidate phrase set, sentence comments where determining the candidate phrase
The scoring of the position of sentence where dividing and determining the candidate phrase, and determine each keyword that the candidate phrase includes
The first scoring, according to the scoring of the position of sentence and the time where the scoring of sentence where the candidate phrase, the candidate phrase
The first scoring of each keyword that phrase includes is selected, determines the scoring of the candidate phrase.Due to combining sentence scoring, position
Scoring and keyword score, so as to improve the accuracy for the scoring for determining candidate phrase, and then improve extraction label letter
The accuracy of breath.
Embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing to embodiment party of the present invention
Formula is described in further detail.
The embodiments of the invention provide a kind of method for extracting label information, referring to Fig. 1, this method includes:
Step 101:Text message is segmented, obtains candidate phrase set, the candidate phrase set includes at least one
Individual candidate phrase, each candidate phrase include at least one keyword.
Step 102:For each candidate phrase in the candidate phrase set, sentence comments where determining the candidate phrase
The scoring of the position of sentence where dividing and determining the candidate phrase, and determine each keyword that the candidate phrase includes
The first scoring, according to the scoring of the position of sentence and the time where the scoring of sentence where the candidate phrase, the candidate phrase
The first scoring of each keyword that phrase includes is selected, determines the scoring of the candidate phrase.
Step 103:Based on the scoring of each candidate phrase, the selection scoring highest present count from the candidate phrase set
Mesh candidate phrase.
Step 104:By the label information of preset number candidate phrase composition text information.
In a possible implementation, the first scoring of each keyword that the candidate phrase includes is determined, including:
For each keyword, the first occurrence number and the second occurrence number are determined, first occurrence number is the key
Occurrence number of the word in text information, second occurrence number are the total appearance for each keyword that text information includes
Number;
According to first occurrence number and second occurrence number, the word frequency of the keyword is determined;
The first quantity and the second quantity are determined, first quantity is the sample text information that sample text information bank includes
Quantity, second quantity be the sample text information bank include the keyword text message quantity;
According to first quantity and second quantity, the reverse document-frequency of the keyword is determined;
According to the word frequency and the reverse document-frequency, the first scoring of the keyword is determined.
In a possible implementation, the scoring of the position of sentence where determining the candidate phrase, including:
First position of the paragraph in text information where sentence where determining the candidate phrase, and the candidate phrase
The second place in the paragraph;
According to the first position and the second place, the scoring of the position of sentence where determining the candidate phrase.
In a possible implementation, sentence where the scoring of sentence, the candidate phrase according to where the candidate phrase
First scoring of each keyword that the scoring of the position of son and the candidate phrase include, determines the scoring of the candidate phrase, wraps
Include:
The of the position correspondence of sentence where determining the first weight, the candidate phrase corresponding to sentence where the candidate phrase
3rd weight corresponding to each keyword that two weights, the candidate phrase include;
The each keyword included for the candidate phrase, by the scoring of sentence where the candidate phrase and first weight
It is multiplied, obtains the first numerical value, and, the scoring of the position of sentence where the candidate phrase and second multiplied by weight obtain
Second value, and, the 3rd multiplied by weight corresponding to the first scoring and the keyword by the keyword, third value is obtained,
First numerical value, the second value are added with the third value, obtain the second scoring of the keyword;
According to the second of each keyword the scoring, the scoring of the candidate phrase is determined.
In a possible implementation, first numerical value, the second value are added with the third value, are somebody's turn to do
Second scoring of keyword, including:
The contribution degree of the keyword is determined, according to the contribution degree of the keyword, determines the 4th weight corresponding to the keyword;
First numerical value, the second value are added with the third value, obtain the 4th numerical value;
By the 4th numerical value and the 4th multiplied by weight, the second scoring of the keyword is obtained.
In a possible implementation, the label that the preset number candidate phrase is formed to text information is believed
Breath, including:
The candidate phrase composition concepts tab information of concept type is selected from the preset number candidate phrase;And/or
The candidate phrase composition event tag information of event type is selected from the preset number candidate phrase.
In a possible implementation, this method also includes:
The candidate phrase to be ended up in the concepts tab information with presetting the keyword of part of speech is moved to event tag letter
In breath;
The candidate phrase for the keyword for not including the default part of speech in the event tag information is moved to the concepts tab
In information.
In a possible implementation, text message is segmented, obtains candidate phrase set, including:
Text information is made pauses in reading unpunctuated ancient writings, obtains at least one candidate sentences, at least one candidate sentences are formed into candidate
Sentence set;
Each candidate sentences in the candidate sentences set are segmented, obtain at least one keyword, will at least one
Individual crucial phrase is into keyword set;
Based on syntax tree algorithm, by the keyword in the keyword set, at least one candidate phrase is generated;
At least one candidate phrase is formed into the candidate phrase set.
In a possible implementation, each candidate sentences in the candidate sentences set are segmented, obtained
Before at least one keyword, this method also includes:
Determine the sentence element of each candidate sentences in the candidate sentences set;
According to the sentence element of each candidate sentences, by the candidate that sentence element in the candidate sentences set is default composition
Sentence is deleted.
In embodiments of the present invention, text message is segmented, obtains candidate phrase set, the candidate phrase set bag
At least one candidate phrase is included, label information is extracted based on the candidate phrase in candidate phrase set, it is more so as to extract
Meta tag information.Also, for each candidate phrase in candidate phrase set, sentence comments where determining the candidate phrase
The scoring of the position of sentence where dividing and determining the candidate phrase, and determine each keyword that the candidate phrase includes
The first scoring, according to the scoring of the position of sentence and the time where the scoring of sentence where the candidate phrase, the candidate phrase
The first scoring of each keyword that phrase includes is selected, determines the scoring of the candidate phrase.Due to combining sentence scoring, position
Scoring and keyword score, so as to improve the accuracy for the scoring for determining candidate phrase, and then improve extraction label letter
The accuracy of breath.
The embodiments of the invention provide a kind of method for extracting label information, this method is applied in the server, referring to figure
2, this method includes:
Step 201:Server segments to text message, obtains candidate phrase set, and candidate phrase set is included extremely
A few candidate phrase, each candidate phrase include at least one keyword.
In order to improve the reading efficiency of user, before user obtains text message by terminal from server, service
Device extracts label information from text message, and the label information is used for the purport for indicating text information.When user passes through terminal
When text message is obtained from server, server sends the label information of text message to terminal.Terminal the reception server is sent out
The label information of the text message sent, the label information of text message is shown, so as to which user quickly understands according to the label information
To the purport of text information.Wherein, text information can be any text message including text information;For example, this article
This information can be E-News information, social network information, comment on commodity information, info web, e-mail messages etc.;In this hair
In bright embodiment, text message is not especially limited.
Server is segmented to text message, and when obtaining candidate phrase set, server can be carried out to text message
Participle, obtains keyword set, based on keyword set, generates candidate phrase set;Accordingly, this step can be by following
Step (1) to (4) realization, including:
(1):Server is made pauses in reading unpunctuated ancient writings to text message, obtains at least one candidate sentences, by least one candidate sentences
Form candidate sentences set.
Because for understanding an article, the contribution of stem sentence is far longer than the effect of subordinate clause;Also, subsequently through syntax
When tree algorithm extracts candidate phrase, the speed of extraction is largely influenceed by sentence length, in order to improve extraction candidate
The arithmetic speed of phrase, server filter out adverbial clause;Accordingly, at least one candidate sentences are formed candidate sentence by server
After subclass, in addition to:
Server determines the sentence element of each candidate sentences in candidate sentences set;According to the sentence of each candidate sentences
Subconstiuent, the candidate sentences that sentence element in candidate sentences set is default composition are deleted.
Default composition can be the adverbial modifier or attribute etc..In embodiments of the present invention, default composition is not especially limited.
(2):Server segments to each candidate sentences in candidate sentences set, obtains at least one keyword,
By at least one crucial phrase into keyword set.
Due to " ", " ", " ", " ", " " etc keyword it is smaller to the contribution degree of text message.Therefore,
In order to reduce operand and improve accuracy, in this step, server can also will " ", " ", " ", " ",
" " this kind of keyword removes.Therefore, server by least one crucial phrase into after keyword set, in addition to:
The part of speech of each keyword in server mark keyword set;According to the part of speech of each keyword, from key
The keyword of the first default part of speech is searched in set of words, the keyword of the first default part of speech is removed from keyword set.Its
In, the keyword of the first default part of speech can be auxiliary word, preposition, modal particle or number etc..In embodiments of the present invention, to
One default part of speech is not especially limited.
(3):Server is based on syntax tree algorithm, by the keyword in keyword set, generates at least one candidate phrase.
Each keyword in keyword set is input in syntax tree-model by server, and the syntax tree-model includes
The syntax tree algorithm;By the syntax tree algorithm, the keyword in keyword set is generated into keyword tree.The keyword tree is wrapped
Include the relation between multiple nodes and multiple nodes;A node in keyword tree is a keyword.Server is based on
Keyword tree, generate at least one candidate phrase.
Because keyword tree includes the relation between keyword and keyword;Only have related one or more passes
Keyword could form candidate phrase.Accordingly, server is based on keyword tree, and the step of generating at least one candidate phrase can be with
For:
For each leaf node in keyword tree, server selects the keyword of the leaf node from keyword tree
And the crucial phrase of the father node of the leaf node is into candidate phrase.
In this step, server can also combine the candidate key generation of the father node of the father node of the leaf node
Candidate phrase, accordingly, server are based on keyword tree, and the step of generating at least one candidate phrase can be:
For each leaf node in keyword tree, server selects the key of the leaf node from keyword tree
Word, and the keyword of the father node of the leaf node is obtained, and the keyword of the father node of father node is obtained, until obtaining
To the keyword of root node;By the crucial phrase of acquisition into candidate phrase.
In embodiments of the present invention, the extraction of candidate phrase is carried out by syntax tree algorithm, can ensure what is extracted
Candidate phrase is polynary, and has very strong semantic coherence between the keyword that includes of candidate phrase, so as to improve
Accuracy of the candidate based on candidate phrase generation label information.
(4):At least one candidate phrase is formed candidate phrase set by server.
After server obtains candidate phrase set, determined by following steps 202 each in the candidate phrase set
The scoring of candidate phrase.
Step 202:For each candidate phrase in candidate phrase set, server determines sentence where the candidate phrase
Scoring, sentence where the candidate phrase position scoring and the first scoring of each keyword for including of the candidate phrase,
The scoring of the position of sentence and the candidate phrase include where the scoring of sentence, the candidate phrase according to where the candidate phrase
First scoring of each keyword, determines the scoring of the candidate phrase.
For each candidate phrase in candidate phrase set, server can be determined by following steps (1) to (4)
The scoring of the candidate phrase.
(1):The scoring of sentence where server determines the candidate phrase.
Server obtains the sentence where the candidate phrase;By BM25 algorithms, the scoring of the sentence is determined.
(2):The scoring of the position of sentence where server determines the candidate phrase.
Because the scoring of the sentence of diverse location is different, for example, the sentence in summary can more embody the purport of text message,
The corresponding higher scoring of sentence in summary;Sentence in text corresponds to relatively low scoring.Therefore, before this step, service
Storage location and the corresponding relation of scoring in device.The position can be summary or text.Accordingly, this step can be:
Position of the sentence in text message where server determines the candidate phrase, according to the position, from position and comment
The scoring of the position of sentence where the candidate phrase is obtained in the corresponding relation divided.
In this step, server can also calculate the candidate phrase institute according to the position of sentence where the candidate phrase
Scoring in the position of sentence.Accordingly.This step can be:
Server determine where sentence where the candidate phrase it is short fall first position in text message, and, the time
Select the second place of the phrase in the paragraph;According to first position and the second place, the position of sentence where determining the candidate phrase
The scoring put.
Wherein, first position can be title, summary or text.The second place can be the first sentence of first section, the non-head of first section
Sentence, the non-first sentence of first section or the non-first sentence of non-first section.Accordingly, server determines the candidate according to first position and the second place
Can be the step of the scoring of the position of sentence where phrase:
Server obtains corresponding to first position the according to first position from the corresponding relation of first position and son scoring
One son scoring;According to the second place, the second son corresponding to the second place is obtained from the corresponding relation of the second place and son scoring
Scoring, the first son scoring is multiplied with the second son scoring, the scoring of the position of sentence where obtaining the candidate phrase.
For example, when first position is title, the first son scoring is 0.9;When first position is makes a summary, the first son scoring is
0.6;When first position is text, the first son scoring is 0.3.When the second place is first section first sentence, the second son scoring is 0.3;The
When two positions are non-first section first sentence, the second son scoring is 0.3*0.4=0.12;When the second place is non-first section first sentence, the second son
Score as 0.1;When the second place is non-first section non-first sentence, the second son scoring is 0.1*0.4=0.04.
(3):Server determines the first scoring of each keyword that the candidate phrase includes.
The each keyword included for the candidate phrase, server is determined by following steps (3-1) to (3-5) should
First scoring of keyword, including:
(3-1):Server determines the first occurrence number and the second occurrence number, and the first occurrence number is the keyword at this
Occurrence number in text message, the second occurrence number are the total occurrence number for each keyword that text information includes.
(3-2):Server determines the word frequency of the keyword according to the first occurrence number and the second occurrence number.
The ratio of first occurrence number and the second occurrence number is defined as the word frequency of the keyword by server.
(3-3):Server determines the first quantity and the second quantity, and the first quantity is the sample that sample text information bank includes
The quantity of this text message, the second quantity are the quantity for the text message that sample text information bank includes the keyword.
Sample text information bank is previously stored in server, the sample text information bank includes at least one sample text
Information.In this step, the quantity for the sample text information that the server statistics sample text information bank includes, for the ease of
Description, is referred to as the first quantity by the quantity.Server statistics sample text information bank includes the text message of the keyword
Quantity, the quantity is referred to as the second quantity.
(3-4):Server determines the reverse document-frequency of the keyword according to the first quantity and the second quantity.
Server, by below equation one, determines the reverse file of the keyword frequently according to the first quantity and the second quantity
Rate:
Formula one:Wherein, idf is the reverse document-frequency of the keyword, and D is the first quantity, J the
Two quantity.
Include the text message of the keyword due to that may be not present in sample text information bank, therefore, the second quantity can
Can be zero.Therefore, can be in this step:
Server, by below equation two, determines the reverse file of the keyword frequently according to the first quantity and the second quantity
Rate:
Formula two:Wherein, idf is the reverse document-frequency of the keyword, and D is the first quantity, J the
Two quantity.
(3-4):Word frequency and reverse document-frequency of the server according to the keyword, determine the first scoring of the keyword.
Word frequency and reverse document-frequency of the server according to the keyword, by the first preset algorithm, determine the keyword
First scoring.
First preset algorithm can be configured and change as needed, in embodiments of the present invention, to the first pre- imputation
Method is not especially limited.Removed for example, the first preset algorithm can be multiplication, addition, subtraction, division, weighting multiplication or weighting
Method.
When the first preset algorithm is multiplication, this step can be:
The word frequency of the keyword is multiplied by server with reverse document-frequency, obtains the first scoring of the keyword.
(4):The scoring of the position of sentence where the server scoring of sentence, the candidate phrase according to where the candidate phrase
First scoring of each keyword included with the candidate phrase, determines the scoring of the candidate phrase.
In this step, server can be directly according to where the scoring of candidate phrase place sentence, the candidate phrase
First scoring of each keyword that the scoring of the position of sentence and the candidate phrase include, determines the scoring of the candidate phrase,
Namely the first following implementation.Server can also be respectively square where sentence, the candidate phrase where the candidate phrase
Each keyword that the position of battle array and the candidate phrase include sets the first weight, the second weight and the 3rd weight, based on the time
Each key that the scoring of the position of sentence and the candidate phrase where the scoring of sentence, the candidate phrase where selecting phrase include
First scoring of word and the first weight, the second weight and the 3rd weight, determine the scoring of the candidate phrase, namely following the
Two kinds of implementations.
For the first implementation, this step can be realized by following steps (4-1) to (4-2), including:
(4-1):The each keyword included for the candidate phrase, server are commented according to candidate phrase place sentence
Point, the scoring of position and the first scoring of the keyword of sentence where the candidate phrase, by the second preset algorithm, it is determined that should
Second scoring of keyword.
Second preset algorithm can be configured and change as needed, in embodiments of the present invention, to the second pre- imputation
Method is not especially limited.Removed for example, the second preset algorithm can be multiplication, addition, subtraction, division, weighting multiplication or weighting
Method.
When the second preset algorithm is addition, this step can be:
Server is by the scoring of the position of sentence and the pass where the scoring of sentence where the candidate phrase, the candidate phrase
First scoring of keyword is added, and obtains the second scoring of the keyword.
It should be noted that because monosyllabic word is smaller to the contribution degree for understanding article.Therefore, server is it is determined that monosyllabic word
Keyword second scoring when, drop power processing is carried out to the keyword.The word with quotation marks repeatedly occurred in text message
To understanding that the contribution degree of article is larger.Therefore, server is right when it is determined that the second scoring of the keyword with quotation marks repeatedly occur
The keyword is weighted processing.Accordingly, this step can be:
The each keyword included for the candidate phrase, server determine the contribution degree of the keyword;According to the contribution
Degree, determines the 4th weight corresponding to the keyword;Sentence where the scoring of sentence, the candidate phrase according to where the candidate phrase
Position scoring and the keyword first scoring, by the second preset algorithm, obtain the 5th numerical value;By the 5th numerical value and
Four multiplied by weight, obtain the second scoring of the keyword.
Server determines the contribution degree of the keyword, had according to the number of words of the keyword and/or the importance of the keyword
Body process can be:
Contribution degree of the server using the number of words of the keyword as the keyword;Also, number of words is more, contribution degree is higher.
Or
Contribution degree of the server using the importance of the keyword as the keyword;Also, importance is higher, contribution degree is got over
It is high.Or
After the number of words of the keyword and importance are weighted processing by server, the contribution degree of the keyword is obtained.
Wherein, the importance of the keyword can be used for the occurrence number expression of the keyword and/or the keyword is dashed forward
Go out display etc. to represent.
The corresponding relation of contribution degree and weight is stored in server;Accordingly, server determines the pass according to the contribution degree
Can be the step of four weights corresponding to keyword:
Server obtains the 4th power corresponding to the keyword according to the contribution degree from the corresponding relation of contribution degree and weight
Weight.
(4-2):Second scoring of each keyword that server includes according to the candidate phrase, determines the candidate phrase
Scoring.
Second scoring of each keyword that server includes according to the candidate phrase, by the 3rd preset algorithm, it is determined that
The scoring of the candidate phrase.
3rd preset algorithm can be configured and change as needed, in embodiments of the present invention, to the 3rd pre- imputation
Method is not especially limited.For example, the 3rd preset algorithm can be multiplication, addition, subtraction, division, weighting multiplication, weighting division or
Person's maximizing etc..
When the 3rd preset algorithm is addition, this step can be:
Second scoring of each keyword that server includes the candidate phrase is added, and obtains commenting for the candidate phrase
Point.
When the 3rd preset algorithm is maximizing, this step can be:
Maximum scores are selected in second scoring of each keyword that server includes from the candidate phrase, the maximum is commented
It is allocated as the scoring for the candidate phrase.
For second of implementation, this step can be realized by following steps (4-a) to (4-c), including:
(4-a):Sentence where first weight, the candidate phrase corresponding to sentence where server determines the candidate phrase
3rd weight corresponding to each keyword that second weight of position correspondence, the candidate phrase include.
Sentence where first weight, the candidate phrase corresponding to sentence where the candidate phrase is prestored in server
Second weight of position;First weight corresponding to sentence where in this step, server obtains the candidate phrase that has stored,
Second weight of the position correspondence of sentence where the candidate phrase.
Each keyword and the corresponding relation of the 3rd weight are stored in server;Accordingly, it is short to obtain the candidate for server
Can be the step of three weights corresponding to each keyword that language includes:
Each keyword that server includes according to the candidate phrase, obtained from keyword and the corresponding relation of the 3rd weight
Take the 3rd weight corresponding to each keyword that the candidate phrase includes.
It should be noted that the 3rd weight corresponding to each keyword can be with identical, can also be different.For example, the candidate
First weight a1=0.1 corresponding to sentence where phrase, the second weight a2=of the position correspondence of sentence where the candidate phrase
0.55, the 3rd weight a3=0.35 corresponding to each keyword that the candidate phrase includes.
(4-b):The each keyword included for candidate phrase, server by the scoring of sentence where the candidate phrase with
First multiplied by weight, the first numerical value is obtained, and, by the scoring of the position of sentence where the candidate phrase and the second weight phase
Multiply, obtain second value, and, by the 3rd multiplied by weight corresponding to the first of the keyword the scoring and the keyword, obtain the
Three numerical value, the first numerical value, second value are added with third value, obtain the second scoring of the keyword.
Similarly, since monosyllabic word is smaller to the contribution degree for understanding article.Therefore, server is it is determined that the keyword of monosyllabic word
Second scoring when, drop power processing is carried out to the keyword.The word with quotation marks repeatedly occurred in text message is to understanding text
The contribution degree of chapter is larger.Therefore, server is when it is determined that the second scoring of the keyword with quotation marks repeatedly occur, to the keyword
It is weighted processing.Accordingly, the first numerical value, second value are added by server with third value, obtain the of the keyword
Two can be the step of scoring:
Server determines the contribution degree of the keyword;According to the contribution degree, the 4th weight corresponding to the keyword is determined;Will
First numerical value, second value are added with third value, obtain the 4th numerical value, and the 4th numerical value and the 4th multiplied by weight are somebody's turn to do
Second scoring of keyword.
(4-c):Server determines the scoring of candidate phrase according to the second of each keyword the scoring.
This step and step (4-2) are identical, will not be repeated here.
Step 203:Scoring of the server based on each candidate phrase, selection scoring highest is pre- from candidate phrase set
If number candidate phrase.
Scoring of the server based on each candidate phrase, each candidate phrase is carried out according to the order of scoring from high to low
Sequence, output sequence is in most preceding preset number candidate phrase.
Preset number can be configured and change as needed, in embodiments of the present invention, preset number not made to have
Body limits.For example, preset number can be 8 or 10 etc..
In embodiments of the present invention, due to nonsensical keyword may be included in candidate phrase, for example, auxiliary word, Jie
Word, modal particle, number;Therefore, after server selects preset number candidate phrase, server is by preset number candidate
The keyword of the second default part of speech in phrase filters out.
Second default part of speech and the first default part of speech be able to can also be differed with identical;In embodiments of the present invention to this
It is not especially limited.For example, the keyword of the second default part of speech can be auxiliary word, preposition, modal particle or number etc..
In embodiments of the present invention, server can set concepts tab information and event tag information;Wherein, concept mark
Label information includes the most crucial concept phrase of text information, and event tag information includes the core thing in text information
Part.After execution of step 203, server is given birth to by the product concept label information of following steps 204 by following steps 205
Into event label information.
Step 204:Server forms concepts tab from the candidate phrase for presetting selection concept type in several candidate phrases
Information.
Wherein, the candidate phrase of concept type refers to the candidate phrase for including noun.
Step 205:Server selects the candidate phrase of event type to form event mark from preset number candidate phrase
Sign information.
Wherein, the candidate phrase of event type refers to the candidate phrase for including verb.
It should be noted that step 204 and step 205 do not have time order and function order, step 204 can be first carried out, then hold
Row step 205;Step 205 can also be first carried out, then performs step 204.
Because server may malfunction in classification concept label information and event tag information;Therefore, server is also
Phrase correction can be carried out, detailed process can be:
The candidate phrase to be ended up in concepts tab information with the keyword of the 3rd default part of speech is moved to event by server
In label information;And/or the candidate phrase for the keyword for not including the 3rd default part of speech in event tag information is moved to generally
Read in label information.
3rd default part of speech can be configured and change as needed, in embodiments of the present invention, to the 3rd default word
Property is not especially limited.For example, the 3rd default part of speech can be verb.
After server extraction label information, the mark of server storage text message and the label information of text message
Corresponding relation.Terminal can obtain the label information from server;Detailed process can be:
Terminal to server, which is sent, obtains request, and acquisition request carries the mark of text message to be obtained.Server
The acquisition request that receiving terminal is sent, according to the mark of text information, obtained from the corresponding relation of mark and label information
The label information of text information, the label information of text information is sent to terminal.This article that terminal the reception server is sent
The label information of this information, show the label information of text information.So as to user can based on text information label believe
Breath, the purport of fast understanding text information.Wherein, the mark of text message can be the title of text message, URL, storage
Path or numbering etc..
In embodiments of the present invention, (Latent Dirichlet Allocation, the document master of the LDA in existing method
Inscribe generation model) label information of extraction is a metatag, and the embodiment of the present invention is based on the label information that syntax tree is extracted
More meta tag informations, and extracted concepts tab information and event tag information.
In embodiments of the present invention, text message is segmented, obtains candidate phrase set, the candidate phrase set bag
At least one candidate phrase is included, label information is extracted based on the candidate phrase in candidate phrase set, it is more so as to extract
Meta tag information.Also, for each candidate phrase in candidate phrase set, sentence comments where determining the candidate phrase
The scoring of the position of sentence where dividing and determining the candidate phrase, and determine each keyword that the candidate phrase includes
The first scoring, according to the scoring of the position of sentence and the time where the scoring of sentence where the candidate phrase, the candidate phrase
The first scoring of each keyword that phrase includes is selected, determines the scoring of the candidate phrase.Due to combining sentence scoring, position
Scoring and keyword score, so as to improve the accuracy for the scoring for determining candidate phrase, and then improve extraction label letter
The accuracy of breath.
The invention provides a kind of device for extracting label information, device application are in the server, above-mentioned for performing
Extract the server in the method for label information.Referring to Fig. 3, the device includes:
Word-dividing mode 301, for being segmented to text message, obtain candidate phrase set, the candidate phrase set
Including at least one candidate phrase, each candidate phrase includes at least one keyword;
Grading module 302, for for each candidate phrase in the candidate phrase set, determining the candidate phrase
The scoring of the position of sentence where the scoring of place sentence and the determination candidate phrase, and determine the candidate phrase
Including each keyword the first scoring, according to the scoring of sentence where the candidate phrase, sentence where the candidate phrase
First scoring of each keyword that the scoring of the position of son and the candidate phrase include, determines commenting for the candidate phrase
Point;
Selecting module 303, for the scoring based on each candidate phrase, select to comment from the candidate phrase set
Divide highest preset number candidate phrase;
Comprising modules 304, for the preset number candidate phrase to be formed to the label information of the text message.
In a possible implementation, institute's scoring module 302, it is additionally operable to, for each keyword, determine first
Occurrence number and the second occurrence number, first occurrence number are that the keyword goes out occurrence in the text message
Number, second occurrence number are the total occurrence number for each keyword that the text message includes;Go out according to described first
Occurrence number and second occurrence number, determine the word frequency of the keyword;Determine the first quantity and the second quantity, described first
Quantity is the quantity for the sample text information that sample text information bank includes, and second quantity is the sample text information
Storehouse includes the quantity of the text message of the keyword;According to first quantity and second quantity, the pass is determined
The reverse document-frequency of keyword;According to the word frequency and the reverse document-frequency, the first scoring of the keyword is determined.
In a possible implementation, institute's scoring module 302, sentence where determining the candidate phrase is additionally operable to
First position of the place paragraph in the text message, and the second place of the candidate phrase in the paragraph;Root
According to the first position and the second place, the scoring of the position of sentence where determining the candidate phrase.
In a possible implementation, institute's scoring module 302, sentence where determining the candidate phrase is additionally operable to
The second weight of the position correspondence of sentence, the candidate phrase include every where corresponding first weight, the candidate phrase
3rd weight corresponding to individual keyword;The each keyword included for the candidate phrase, by sentence where the candidate phrase
The scoring of son and first multiplied by weight, obtain the first numerical value, and, by commenting for the position of sentence where the candidate phrase
Point with second multiplied by weight, obtain second value, and, the first of the keyword the scoring is corresponding with the keyword
The 3rd multiplied by weight, obtain third value, first numerical value, the second value be added with the third value, obtained
To the second scoring of the keyword;According to the second scoring of each keyword, the scoring of the candidate phrase is determined.
In a possible implementation, institute's scoring module 302, it is additionally operable to determine the contribution degree of the keyword,
According to the contribution degree of the keyword, the 4th weight corresponding to the keyword is determined;By first numerical value, second number
Value is added with the third value, obtains the 4th numerical value;By the 4th numerical value and the 4th multiplied by weight, the pass is obtained
Second scoring of keyword.
In a possible implementation, the comprising modules 304, it is additionally operable to from the preset number candidate phrase
The candidate phrase composition concepts tab information of middle selection concept type;Event class is selected from the preset number candidate phrase
The candidate phrase composition event tag information of type.
In a possible implementation, described device also includes:
Mobile module, for the candidate phrase to be ended up in the concepts tab information with presetting the keyword of part of speech to be moved
Into the event tag information;And/or
The mobile module, it is additionally operable to that the time of the keyword of the default part of speech will not be included in the event tag information
Phrase is selected to be moved in the concepts tab information.
In a possible implementation, the word-dividing mode 301, it is additionally operable to make pauses in reading unpunctuated ancient writings to the text message,
At least one candidate sentences are obtained, at least one candidate sentences are formed into candidate sentences set;To the candidate sentence subset
Each candidate sentences in conjunction are segmented, and obtain at least one keyword, by least one crucial phrase into keyword
Set;Based on syntax tree algorithm, by the keyword in the keyword set, at least one candidate phrase is generated;By described in extremely
A few candidate phrase forms the candidate phrase set.
In a possible implementation, the word-dividing mode 301, it is additionally operable to determine in the candidate sentences set
The sentence element of each candidate sentences;According to the sentence element of each candidate sentences, by sentence in the candidate sentences set
Subconstiuent is that the candidate sentences of default composition are deleted.
In embodiments of the present invention, text message is segmented, obtains candidate phrase set, the candidate phrase set bag
At least one candidate phrase is included, label information is extracted based on the candidate phrase in candidate phrase set, it is more so as to extract
Meta tag information.Also, for each candidate phrase in candidate phrase set, sentence comments where determining the candidate phrase
The scoring of the position of sentence where dividing and determining the candidate phrase, and determine each keyword that the candidate phrase includes
The first scoring, according to the scoring of the position of sentence and the time where the scoring of sentence where the candidate phrase, the candidate phrase
The first scoring of each keyword that phrase includes is selected, determines the scoring of the candidate phrase.Due to combining sentence scoring, position
Scoring and keyword score, so as to improve the accuracy for the scoring for determining candidate phrase, and then improve extraction label letter
The accuracy of breath.
It should be noted that:Above-described embodiment provide extraction label information device when extracting label information, only with
The division progress of above-mentioned each functional module, can be as needed and by above-mentioned function distribution by not for example, in practical application
Same functional module is completed, i.e., the internal structure of device is divided into different functional modules, to complete whole described above
Or partial function.In addition, the method for the device and extraction label information for the extraction label information that above-described embodiment provides is implemented
Example belongs to same design, and its specific implementation process refers to embodiment of the method, repeats no more here.
Fig. 4 is a kind of server for being used to extract label information according to an exemplary embodiment.Reference picture 4, clothes
Business device 400 includes processing component 422, and it further comprises one or more processors, and as depositing representated by memory 432
Memory resource, can be by the instruction of the execution of processing component 422, such as application program for storing.What is stored in memory 432 should
With program can include it is one or more each correspond to the module of one group of instruction.In addition, processing component 422 by with
Execute instruction is set to, to perform the function in the method for said extracted label information performed by server.
Server 400 can also include a power supply module 426 and be configured as the power management of execute server 400, and one
Individual wired or wireless network interface 450 is configured as server 400 being connected to network, and input and output (I/O) interface
458.Server 400 can be operated based on the operating system for being stored in memory 432, such as Windows ServerTM, Mac OS
XTM, UnixTM,LinuxTM, FreeBSDTMIt is or similar.
The embodiment of the present invention additionally provides a kind of computer-readable recording medium, and the computer-readable recording medium can be
The computer-readable recording medium included in memory in above-described embodiment;It can also be individualism, be taken without supplying
The computer-readable recording medium being engaged in device.The computer-readable recording medium storage has one or more than one program, should
One method that either more than one program is used for performing extraction label information by one or more than one processor.
One of ordinary skill in the art will appreciate that hardware can be passed through by realizing all or part of step of above-described embodiment
To complete, by program the hardware of correlation can also be instructed to complete, described program can be stored in a kind of computer-readable
In storage medium, storage medium mentioned above can be read-only storage, disk or CD etc..
The foregoing is only presently preferred embodiments of the present invention, be not intended to limit the invention, it is all the present invention spirit and
Within principle, any modification, equivalent substitution and improvements made etc., it should be included in the scope of the protection.