A kind of many label text sorting technique and system
Technical field
The present invention relates to text classification field, particularly relate to a kind of many label text sorting technique and system.
Background technology
In recent years, along with the high speed development of internet, the especially arrival in mobile Internet epoch, the mankind enter large data age, all can produce the data of magnanimity every day, therefore mass data be analyzed, obtain valuable information and become the focus that academia and industry member be concerned about jointly.As the main external form of mass data, the treatment technology that text is correlated with receives the very big concern of people, and Text Classification also enters a new developing stage.
Traditional text sorting technique mainly pays close attention to single labeling, i.e. a corresponding class label of text.But in actual life, the class label of a text often more than, as one section is introduced the article of certain national economy, probably can relate to politics and culture simultaneously, now this article should simultaneously three labels such as at least corresponding economic, political, cultural; The article that one section of physical culture game situation report is relevant, probably also have larger length and introduce certain sports star, therefore it at least should comprise physical culture game situation, sports star's two labels.As can be seen here, many labelings can provide more abundant classified information, are follow-up possible application, as text classification management, monitoring, filter etc., provide larger help, therefore, many labelings problem, namely provides multiple label to one section of document and has practical significance.
At this, optimal many labelings method is each sub-set of tags modeling respectively for many labels, determines the probability model of each sub-set of tags; Treat classifying documents subsequently, calculate its posterior probability corresponding to all many sub-set of tags successively, the many sub-set of tags finally selecting posterior probability maximum are as the classification results of document.But the quantity of many sub-set of tags increases with the exponentially level that increases of many number of tags, supposes that the number of many labels in labeling problem more than is K, total then the number of many sub-set of tags is in theory 2
k-1, when the numerical value of K is larger, due to the enormous amount of many sub-set of tags, it is unpractical for building model respectively to all subsets.
For these reasons, in many labelings problem, traditional sorting technique mainly adopts Naive Bayes Classifier method to classify, it supposes independent and each label probability of occurrence equalization that distributes between each label, because each label is by correspondence word distributed model, this word distributed model can be regarded as a probability model, and on this basis, existing many labelings method is as follows:
Step one: train two probability models respectively for each label, namely comprises the probability model of this label and does not comprise the probability model of this label, the training method of probability model is described for the probability model comprising this label below:
Step 1: acquisition comprises the text data of this label in a large number as training data.
Step 2: the words probability of occurrence of each words in statistics training data.
Step 3: using the set of above-mentioned words probability of occurrence as the probability model comprising this label, for the document structure tree probability of this label of subsequent calculations.
Accordingly, the probability model not comprising this label can be trained and obtains on the text data not comprising this label gathered.
Step 2: obtain document C to be sorted.
Step 3: judge whether this document C to be sorted comprises each label in tag set successively, wherein, the determination methods whether this document C to be sorted comprises label X is as follows:
Step 1: treat classifying documents C and carry out word segmentation processing, obtains words string sequence Cx.
Step 2: calculate this words string sequence Cx corresponding to the likelihood score comprising label X, especially under naive Bayesian hypothesis, this likelihood score can be calculated as and equal each words in words string sequence Cx corresponding to the continued product of words generating probability comprising certain label X.
Step 3: calculate words string sequence Cx corresponding to the likelihood score not comprising label X.
Step 4: calculate words string sequence Cx corresponding to comprising the likelihood score of label X and words string sequence Cx corresponding to the likelihood ratio between the likelihood score not comprising label X.
Step 5: described likelihood ratio is greater than 1, then think that this document package to be sorted is containing label X, otherwise then think that this document to be sorted does not comprise this label.
Step 4: obtain many labelings result that this document to be sorted is corresponding.
In the above sorting technique based on naive Bayesian decision-making, the document probability distribution of each label is artificially supposed separate, this kind of independent hypothesis make to have when solving many labelings problem method simple, be easy to the advantage that realizes, efficiency is higher.But in actual applications, the separate hypothesis of label does not conform to reality, as one section of article introducing certain national economy probably also relates to politics and the culture of this country, obviously between economical, politics, culture three labels, there is certain probability relativity.So is obviously independently of one another irrational between each label of simple hypothesis, also this reason just, above-mentioned many labelings method is difficult to obtain desirable many labelings effect usually.
The another kind of method accurately can carrying out the classification of many label text sets up probability model respectively for all sub-set of tags determined, the posterior probability of document to be sorted corresponding to all sub-set of tags is calculated successively subsequently for document to be sorted, the sub-set of tags finally selecting posterior probability maximum is as the classification results of document, the sorting technique of label text more than this kind due to sub-set of tags quantity by with many number of tags increase exponentially level increase, therefore it is unpractical for building probability model respectively to all sub-set of tags, so this kind of sorting technique is not widely applied.
Summary of the invention
The unreasonable problem that embodiments of the invention exist for existing many label text sorting technique, provides the many label text sorting technique and system of considering the correlativity in sub-set of tags before each label in a kind of enforceable classification.
For achieving the above object, the technical solution used in the present invention is: a kind of many label text sorting technique, comprising:
Determine to wait to examine sub-set of tags for the document to be sorted received;
Word segmentation processing is carried out to described document to be sorted, obtains each words;
A sub-set of tags is extracted successively as current examination sub-set of tags examination sub-set of tags from described treating;
A words is extracted successively as current calculating words from described each words;
Obtain the likelihood score of described current calculating words corresponding to each label in described current examination sub-set of tags;
Corresponding to the likelihood score of each label in described current examination sub-set of tags, linear weighted function is carried out to described current calculating words, obtains the weighted likelihood degree of described current calculating words corresponding to described current examination sub-set of tags;
Determine each weighting coefficient carrying out linear weighted function, make all words maximum corresponding to the continued product of the weighted likelihood degree of described current examination sub-set of tags, and using maximum continued product as the likelihood score of described document to be sorted corresponding to described current examination sub-set of tags, wherein, for the label one_to_one corresponding in each weighting coefficient of all words and described current examination sub-set of tags, and each weighting coefficient sum equals 1;
According to the likelihood score of described document to be sorted corresponding to described current examination sub-set of tags, calculate the posterior probability of described document to be sorted corresponding to described current examination sub-set of tags;
In examination sub-set of tags, the sub-set of tags that the makes described posterior probability maximum classification results as described document to be sorted is chosen described treating.
Preferably, the described document to be sorted for receiving is determined to treat that examination sub-set of tags comprises:
Obtain the tag set comprising all labels;
Using described tag set as in all weighting coefficients determined during described current examination sub-set of tags, choose the label making weighting coefficient be more than or equal to pre-determined threshold and form new tag set;
Each label in described new tag set is combined, waits described in obtaining to examine sub-set of tags.
Preferably, the described document to be sorted for receiving is determined to treat that examination sub-set of tags also comprises:
When described tag set is all less than described pre-determined threshold as all weighting coefficients determined during current examination sub-set of tags, the label choosing the predetermined number making weighting coefficient maximum forms described new tag set.
Preferably, describedly determine each weighting coefficient carrying out linear weighted function, all words comprised corresponding to the continued product of the weighted likelihood degree of described current examination sub-set of tags is maximum:
Utilize EM algorithm to determine to carry out each weighting coefficient of linear weighted function, make all words maximum corresponding to the continued product of the weighted likelihood degree of described current examination sub-set of tags.
Preferably, described according to the likelihood score of described document to be sorted corresponding to described current examination sub-set of tags, calculate described document to be sorted and comprise corresponding to the posterior probability of described current examination sub-set of tags:
Calculate the prior probability of described current examination sub-set of tags;
Calculate the prior probability of described current examination sub-set of tags and the described document to be sorted product corresponding to the likelihood score of described current examination sub-set of tags, as the posterior probability of described document to be sorted corresponding to described current examination sub-set of tags.
Preferably, the prior probability of the described current examination sub-set of tags of described calculating comprises:
Obtain all Training document;
Obtain the label that all Training document relate to, composing training tag set;
Each label in described training tag set is sorted;
Order adjustment is carried out to many label for labelling of all Training document, makes the sequential bits in many label for labelling between each label consistent with the sequential bits between corresponding label in training tag set;
According to the many label for labelling through order adjustment of all Training document, training obtains discrete Markov chain, to make in described Markov chain in each state and described training tag set each label according to described sequence one_to_one corresponding;
The prior probability calculating the sub-set of tags of described current examination equals the product of the redirect probability between the corresponding states of described Markov chain.
Preferably, the described current calculating words of described acquisition comprises corresponding to the likelihood score of each label in current examination sub-set of tags:
Obtain the Training document that many label for labelling are described current examination sub-set of tags;
Corresponding to the likelihood score of each label in described current examination sub-set of tags, linear weighted function is carried out to words in described Training document, to obtain in described Training document words corresponding to the weighted likelihood degree of described current examination sub-set of tags;
To make all words in described Training document be target to the maximum corresponding to the continued product of the weighted likelihood degree of described current examination sub-set of tags, training to obtain in Training document words corresponding to the likelihood score of each label in described current examination sub-set of tags;
From described Training document in the likelihood score of words corresponding to each label in described current examination sub-set of tags, obtain the likelihood score of described current calculating words corresponding to each label in current examination sub-set of tags.
Preferably, described to make all words in described Training document be target to the maximum corresponding to the continued product of the weighted likelihood degree of described current examination sub-set of tags, training obtains words in Training document and comprises corresponding to the likelihood score of each label in described current examination sub-set of tags:
Determine that in described Training document, words is corresponding to the initial value of the likelihood score of each label in described current examination sub-set of tags;
Determine to carry out the initial value of each weighting coefficient of linear weighted function to words in described Training document corresponding to the likelihood score of each label in described current examination sub-set of tags;
Target is to the maximum corresponding to the continued product of the weighted likelihood degree of described current examination sub-set of tags to make all words in described Training document, based on the initial value of described likelihood score and the initial value of described each weighting coefficient, utilize EM algorithm to train and to obtain in Training document words corresponding to the likelihood score of each label in described current examination sub-set of tags.
Preferably, describedly determine that in described Training document, words comprises corresponding to the initial value of the likelihood score of each label in described current examination sub-set of tags:
Obtain the word distributed model of each label in described current examination sub-set of tags;
According to the word distributed model of each label in described current examination sub-set of tags, calculate the likelihood score corresponding to each label in described current examination sub-set of tags of words in described Training document as the initial value of corresponding likelihood score.
Preferably, describedly determine that the initial value to words in described Training document carries out each weighting coefficient of linear weighted function corresponding to the likelihood score of each label in described current examination sub-set of tags comprises:
Determine inverse words in described Training document being equaled to number of labels in described current examination sub-set of tags corresponding to the initial value that the likelihood score of each label in described current examination sub-set of tags carries out each weighting coefficient of linear weighted function.
To achieve these goals, the technical solution used in the present invention is: a kind of many label text categorizing system, comprising:
Treating examination sub-set of tags determination module, waiting to examine sub-set of tags for determining for the document to be sorted received;
Word-dividing mode, for carrying out word segmentation processing to described document to be sorted, obtains each words;
Current examination sub-set of tags extraction module, for extracting a sub-set of tags successively as current examination sub-set of tags from described treating in examination sub-set of tags;
Current calculating words extraction module, for extracting a words successively as current calculating words from described each words;
Words likelihood score acquisition module, for obtaining the likelihood score of described current calculating words corresponding to each label in described current examination sub-set of tags;
Weighted likelihood degree computing module, for carrying out linear weighted function to described current calculating words corresponding to the likelihood score of each label in described current examination sub-set of tags, obtains the weighted likelihood degree of described current calculating words corresponding to described current examination sub-set of tags;
Document likelihood score computing module, for determining each weighting coefficient carrying out linear weighted function, make all words maximum corresponding to the continued product of the weighted likelihood degree of described current examination sub-set of tags, and using maximum continued product as the likelihood score of described document to be sorted corresponding to described current examination sub-set of tags, wherein, for the label one_to_one corresponding in each weighting coefficient of all words and described current examination sub-set of tags, and each weighting coefficient sum equals 1;
Posterior probability computing module, for according to the likelihood score of described document to be sorted corresponding to described current examination sub-set of tags, calculates the posterior probability of described document to be sorted corresponding to described current examination sub-set of tags; And,
Classification results output module, in examination sub-set of tags, choosing the sub-set of tags that the makes described posterior probability maximum classification results as described document to be sorted described treating.
Preferably, treat described in that examination sub-set of tags determination module comprises:
Tag set acquiring unit, for obtain comprise all labels tag set as described current examination sub-set of tags;
Unit chosen by label, for using described tag set as in all weighting coefficients determined during described current examination sub-set of tags, choose the label making weighting coefficient be more than or equal to pre-determined threshold and form new tag set; And,
Treating examination sub-set of tags output unit, for combining each label in described new tag set, waiting described in obtaining to examine sub-set of tags.
Preferably, described wait examining sub-set of tags output unit also for when described tag set is all less than described pre-determined threshold as all weighting coefficients determined during current examination sub-set of tags, the label choosing the predetermined number making weighting coefficient maximum forms described new tag set.
Preferably, each weighting coefficient of described document likelihood score computing module also for utilizing EM algorithm to determine to carry out linear weighted function, makes all words maximum corresponding to the continued product of the weighted likelihood degree of described current examination sub-set of tags.
Preferably, described posterior probability computing module comprises:
Prior probability computing unit, for calculating the prior probability of described current examination sub-set of tags; And,
Posterior probability computing unit, for calculating the prior probability of described current examination sub-set of tags and the described document to be sorted product corresponding to the likelihood score of described current examination sub-set of tags, as the posterior probability of described document to be sorted corresponding to described current examination sub-set of tags.
Preferably, described prior probability computing unit comprises:
Training document obtains subelement, for obtaining all Training document;
Training tag set obtains subelement, for obtaining the label that all Training document relate to, and composing training tag set;
Sequence subelement, for sorting to each label in described training tag set;
Order adjusts subelement, for carrying out order adjustment to many label for labelling of all Training document, makes the sequential bits in many label for labelling between each label consistent with the sequential bits between corresponding label in training tag set;
Markov chain training subelement, for the many label for labelling through order adjustment according to all Training document, training obtains discrete Markov chain, to make in described Markov chain in each state and described training tag set each label according to described sequence one_to_one corresponding; And,
Prior probability computation subunit, the product of the redirect probability between the corresponding states equaling described Markov chain for the prior probability of the sub-set of tags calculating described current examination.
Preferably, described words likelihood score acquisition module comprises:
Training document obtains subelement, for obtaining the Training document that many label for labelling are described current examination sub-set of tags;
Training weighted likelihood degree acquiring unit, for carrying out linear weighted function to words in described Training document corresponding to the likelihood score of each label in described current examination sub-set of tags, to obtain in described Training document words corresponding to the weighted likelihood degree of described current examination sub-set of tags;
Training parameter determining unit, for to make all words in described Training document be target to the maximum corresponding to the continued product of the weighted likelihood degree of described current examination sub-set of tags, training to obtain in Training document words corresponding to the likelihood score of each label in described current examination sub-set of tags; And,
Words likelihood score acquiring unit, in the likelihood score of words from described Training document corresponding to each label in described current examination sub-set of tags, obtains the likelihood score of described current calculating words corresponding to each label in current examination sub-set of tags.
Preferably, described training parameter determining unit comprises:
Likelihood score initial value determination subelement, for determining that in described Training document, words is corresponding to the initial value of the likelihood score of each label in described current examination sub-set of tags;
Weighting coefficient initial values determination subelement, carries out the initial value of each weighting coefficient of linear weighted function to words in described Training document corresponding to the likelihood score of each label in described current examination sub-set of tags for determining; And,
Training parameter determination subelement, for to make all words in described Training document be target to the maximum corresponding to the continued product of the weighted likelihood degree of described current examination sub-set of tags, based on the initial value of described likelihood score and the initial value of described each weighting coefficient, utilize EM algorithm to train and to obtain in described Training document words corresponding to the likelihood score of each label in described current examination sub-set of tags.
Preferably, described likelihood score initial value determination subelement also for:
Obtain the word distributed model of each label in described current examination sub-set of tags;
According to the word distributed model of each label in described current examination sub-set of tags, calculate the likelihood score corresponding to each label in described current examination sub-set of tags of words in described Training document as the initial value of corresponding likelihood score.
Preferably, described weighting coefficient initial values determination subelement is also for determining inverse words in described Training document being equaled to number of labels in described current examination sub-set of tags corresponding to the initial value that the likelihood score of each label in described current examination sub-set of tags carries out each weighting coefficient of linear weighted function.
Beneficial effect of the present invention is, the correlativity that many label text sorting technique of the present invention and system are considered in sub-set of tags between each label by treating words in classifying documents corresponding to the mode that the likelihood score of label each in sub-set of tags carries out linear weighted function, and by correlativity that the mode reasonably optimizing that the continued product of the linear weighted function result making the likelihood score corresponding to label each in sub-set of tags of each words in document to be sorted is maximum is embodied by weighting coefficient, and then make the relatively existing many labelings method and system of many labelings method and system of the present invention have the feature that overall performance is high and calculated amount is little concurrently.
Accompanying drawing explanation
Fig. 1 shows the process flow diagram of a kind of embodiment according to many label text sorting technique of the present invention;
Fig. 2 shows the process flow diagram of a kind of embodiment according to the prior probability calculating current examination sub-set of tags in many labelings method of the present invention;
Fig. 3 shows the Markov chain that method according to Fig. 2 is determined;
Fig. 4 shows the Markovian state redirect probability that the Markov chain correction according to Fig. 2 obtains;
Fig. 5 shows the frame principle figure of a kind of embodiment according to many labelings system of the present invention.
Embodiment
Be described below in detail embodiments of the invention, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has element that is identical or similar functions from start to finish.Being exemplary below by the embodiment be described with reference to the drawings, only for explaining the present invention, and can not limitation of the present invention being interpreted as.
As shown in Figure 1, the embodiment of many label text sorting technique of the present invention can comprise the steps:
Step S1: for the document d to be sorted received
obdetermine to wait to examine sub-set of tags, at this, if having 20 labels in tag set, then determine to wait to examine each sub-set of tags that the most basic skills of sub-set of tags obtains for carrying out various combination to 20 labels.。
Step S2: treat classifying documents d
obcarry out word segmentation processing, obtain each words.
Step S3: assignment i=1.
Step S4: from treating to extract sub-set of tags CS examination sub-set of tags
ias current examination sub-set of tags.
Step S5: assignment j=1.
Step S6: extract words W from each words
jas current calculating words.
Step S7: obtain current calculating words W
jcorresponding to current examination sub-set of tags CS
imiddle label C
klikelihood score P (W
j| C
k), wherein, the value of k is from 1 to current examination sub-set of tags CS
imiddle total number of labels
integer, likelihood score P (W
j| C
k) value obtained in advance by model training process.
Step S8: to current calculating words W
jcorresponding to current examination sub-set of tags CS
iin the likelihood score of each label carry out linear weighted function, obtain current calculating words W
jcorresponding to current examination sub-set of tags CS
iweighted likelihood degree
wherein P (C
k| d
ob) be for document d to be sorted
ob, with label C
kcorresponding weighting coefficient, should be understood that at this, each weighting coefficient of all words and current examination sub-set of tags CS
iin each label one_to_one corresponding, namely different words is corresponding to current examination sub-set of tags CS
ithe weighting coefficient of the likelihood score of middle same label is identical, and carries out linear weighted function and should meet the constraint condition that each weighting coefficient sum equals 1.
Step S9: judge j whether equal to carry out word segmentation processing acquisition words sum jmax, in this way then perform step S11, as otherwise perform step S10.
Step S10: assignment j=j+1, continue afterwards to perform step S6.
Step S11: determine each weighting coefficient P (C carrying out linear weighted function
k| d
ob), make all words corresponding to current examination sub-set of tags CS
ithe continued product of weighted likelihood degree maximum, and using maximum continued product as described document d to be sorted
obcorresponding to current examination sub-set of tags CS
ilikelihood score P (d
ob| CS
i), namely
Step S12: according to document d to be sorted
obcorresponding to current examination sub-set of tags CS
ilikelihood score P (d
ob| CS
i), calculate document d to be sorted
obcorresponding to current examination sub-set of tags CS
iposterior probability P (CS
i| d
ob), calculating posterior probability method according to Bayes' theorem is:
wherein, P (d
ob) and P (CS
i) be respectively the prior probability of document to be sorted and current examination sub-set of tags CS
iprior probability.
Step S13: judge whether i equals to wait to examine the subset sum imax of sub-set of tags, then performs step S15 in this way, as otherwise perform step S14.
Step S14: assignment i=i+1, continue afterwards to perform step S4.
Step S15: treating, in examination sub-set of tags, to choose and make posterior probability P (CS
i| d
ob) maximum sub-set of tags is as document d to be sorted
obclassification results.
Many label text sorting technique of the present invention adopts said method, treats classifying documents d
obmiddle words W
jcorresponding to waiting that the mode examining the likelihood score of each label in sub-set of tags to carry out linear weighted function is considered to wait to examine the correlativity in sub-set of tags between each label, and by making document d to be sorted
obin each words corresponding to the correlativity waiting to examine the maximum mode reasonably optimizing of the continued product of the linear weighted function result of the likelihood score of each label in sub-set of tags to be embodied by weighting coefficient, and then make the relatively existing many labelings method and system of many labelings method and system of the present invention have higher overall performance.Relative to for all, many label text sorting technique of the present invention treats that examination sub-set of tags sets up probability model respectively, subsequently for document to be sorted, calculate document d to be sorted successively
obthe posterior probability examining sub-set of tags is waited corresponding to all, the maximum sub-set of tags of posterior probability is finally selected greatly to reduce calculated amount as many labelings method of the classification results of document, although sorting technique below can obtain the most accurate classification results, but because the exponentially level that increases with many number of tags increases by the quantity of sub-set of tags, therefore it is unpractical for building probability model respectively to all sub-set of tags, so this kind of sorting technique is not widely applied.
In order to reduce the calculated amount of the present invention's many labelings method further, for the document d to be sorted received in above-mentioned steps S1
obdetermine to treat that examination sub-set of tags can comprise the steps: further
Step S101: obtain the tag set comprising all labels.
Step S102: using tag set as current examination sub-set of tags time, be such as current examination sub-set of tags CS
imaxaccording to all weighting coefficients that above-mentioned steps S4 to step S11 determines, and in all weighting coefficients, choose the label making weighting coefficient be more than or equal to pre-determined threshold form new tag set, this pre-determined threshold can be selected according to concrete many labelings task and required classification degree of accuracy, pre-determined threshold is less, and classification degree of accuracy is higher.
Step S103: combine label each in new tag set, obtains waiting to examine sub-set of tags, and even new tag set has m label, so will obtain 2
mwait for-1 to examine sub-set of tags.
If when tag set is all less than pre-determined threshold as all weighting coefficients determined during current examination sub-set of tags, the label can choosing the predetermined number making weighting coefficient maximum forms above-mentioned new tag set, the predetermined number even arranged is five, then make weighting coefficient from large to minispread, and choose label corresponding to the first five weighting coefficient and form above-mentioned new tag set.
In above-mentioned steps S11, EM algorithm (EM) can be utilized to determine to carry out each weighting coefficient P (C of linear weighted function
k| d
ob), make all words maximum corresponding to the continued product of the weighted likelihood degree of current examination sub-set of tags.Wherein, the calculation procedure of EM algorithm is as follows:
(1) document d to be sorted is first traveled through
obin all words, and calculate document d to be sorted successively
obwherein each words is corresponding to current examination sub-set of tags CS
imiddle label C
kcombination condition probability P (C
k| d
ob, w
j), be specifically calculated as follows:
wherein, C
k'for current examination sub-set of tags CS
iin label, the value of k' is identical with the value of k, is from 1 to current examination sub-set of tags CS
imiddle total number of labels
integer.
(2) weighting coefficient P (C is upgraded according to following formulae discovery
k| d
ob):
wherein, n (d
ob, w
j) represent document d to be sorted
obmiddle words w
jnumber.
(3) according to the formula continuous iteration weighting coefficient P (C of above-mentioned (1) and (2)
k| d
ob), until meet iteration stopping condition, when carrying out iteration, need for weighting coefficient P (C
k| d
ob) initial value is set, such as, as long as this initial value meets and carries out linear weighted function and should meet the constraint condition that each weighting coefficient sum equals 1, weighting coefficient P (C
k| d
ob) initial value equal
under normal circumstances, this iteration stopping condition can be iterations and has reached maximum iteration time or likelihood score P (d
ob| CS
i) Magnification (i.e. the Magnification of relatively last iteration result) reduce to be such as 2% ~ 5% setting ratio below, at this, according to EM algorithm, along with continuous iteration, document d to be sorted will be made
obcorresponding to current examination sub-set of tags CS
ilikelihood score P (d
ob| CS
i) increase gradually, when increasing to a certain degree, likelihood score P (d
ob| CS
i) when proceeding iteration, the trend increased will slow down gradually, until substantially do not change, in this case, proceed iteration and will there is no actual function definition, therefore, those skilled in the art can preset a suitable maximum iteration time according to concrete many labelings task or according to Magnification need be utilized to arrange constraint condition.
In above-mentioned steps S12, owing to examining sub-set of tags, the prior probability P (d of document to be sorted for any waiting
ob) all identical, therefore, calculate document d to be sorted
obcorresponding to current examination sub-set of tags CS
iposterior probability can comprise further:
Step S121: calculate current examination sub-set of tags CS
iprior probability P (CS
i).
Step S122: calculate current examination sub-set of tags CS
iprior probability P (CS
i) and document d to be sorted
obcorresponding to current examination sub-set of tags CS
ilikelihood score P (d
ob| CS
i) product, as document d to be sorted
obcorresponding to current examination sub-set of tags CS
iposterior probability P (CS
i| d
ob).
Many labelings method of the present invention additionally provides a kind of prior probability construction method based on Markov chain, and the method has taken into full account current examination sub-set of tags CS
iin correlativity between each label, and then improve the accuracy that posterior probability calculates, namely in above-mentioned steps S121, calculate current examination sub-set of tags CS
iprior probability P (CS
i) can comprise the steps: further
Step S1211: obtain all Training document.
Step S1212: obtain the label that all Training document relate to, composing training tag set.
Step S1213: each label in training tag set is sorted, each label in training tag set specifically can sort according to certain relation by the method for carrying out sorting, as hierarchical relationship, time relationship, spatial relationship etc., such as " group project ", " basketball ", " physical culture " three labels, can sort as follows according to its hierarchical relationship: " physical culture ", " group project ", " basketball "; If the not obvious relation of each label in training tag set, can select certain order arbitrary, as determined at random, this does not affect the final effect of method yet.
Step S1214: order adjustment is carried out to many label for labelling of all Training document, make the sequential bits in many label for labelling between each label consistent with the sequential bits between corresponding label in training tag set, tag set is such as trained to have five labels, and be A, B, C, D, E according to step S1213 sequence, so, if label for labelling more than relates to A, E, C tri-labels, then this many label for labelling is adjusted to A, C, E.
Step S1215: according to the many label for labelling through order adjustment of all Training document, training obtains discrete Markov chain, to make in Markov chain each state and each label in training tag set according to described sequence one_to_one corresponding, be specially the total number of labels that in statistics Training document, each state of Markov chain is corresponding, such as, suppose when above-mentioned training tag set has five labels, if many label for labelling are A in Training document, C, the Training document of E has 50, many label for labelling are B, C, the Training document of E has 100, many label for labelling are B, D, the Training document of E has 200, transfer frequency statistics then in Markov chain between each state as shown in Figure 3.
Step S1216: the prior probability calculating the sub-set of tags of described current examination equals the product of the redirect probability between the corresponding states of Markov chain, at this, as shown in Figure 3 and Figure 4, a state in Markov chain to the redirect probability of another state be correspondence go out frequency on arc divided by previous state all go out frequency sum on arc, such as wait examine sub-set of tags A, C, E prior probability be that initial state is to A, A to C, C to E, the product of the redirect probability between E to done state, is 1/7; Such as wait again examine sub-set of tags A, B, C prior probability be 0.
In above-mentioned steps S7, obtain current calculating words W
jcorresponding to current examination sub-set of tags CS
imiddle label C
klikelihood score P (W
j| C
k) can comprise the steps: further
Step S71: obtaining many label for labelling is current examination sub-set of tags CS
itraining document d
i.
Step S72: to Training document d
imiddle words is corresponding to current examination sub-set of tags CS
iin the likelihood score of each label carry out linear weighted function, obtain Training document d
imiddle words W
jxcorresponding to current examination sub-set of tags CS
iweighted likelihood degree, wherein, the value of jx is 1 to Training document d
iwords sum jxmax integer; At this, the concept of this weighted likelihood degree and document d to be sorted
obin all words corresponding to current examination sub-set of tags CS
ithe concept of weighted likelihood degree identical.
Step S73: to make Training document d
iin all words corresponding to current examination sub-set of tags CS
ithe continued product of weighted likelihood degree be target to the maximum, training obtains Training document d
imiddle words W
jxcorresponding to current examination sub-set of tags CS
iin the likelihood score of each label.
Step S74: from Training document d
imiddle words W
jxcorresponding to current examination sub-set of tags CS
iin each label likelihood score in, obtain current calculating words W
jcorresponding to current examination sub-set of tags CS
imiddle label C
klikelihood score.
In above-mentioned steps S73, to make Training document d
iin all words corresponding to current examination sub-set of tags CS
ithe continued product of weighted likelihood degree be target to the maximum, training obtains Training document d
imiddle words W
jxcorresponding to current examination sub-set of tags CS
iin the likelihood score of each label can comprise further: to make Training document d
iin all words corresponding to current examination sub-set of tags CS
ithe continued product of weighted likelihood degree be target to the maximum, utilize EM algorithm (EM) to train and obtain Training document d
imiddle words W
jxcorresponding to current examination sub-set of tags CS
iin the likelihood score of each label.
According to EM algorithm, above-mentioned to make Training document d
iin all words corresponding to current examination sub-set of tags CS
ithe continued product of weighted likelihood degree be target to the maximum, training obtains Training document d
imiddle words W
jxcorresponding to current examination sub-set of tags CS
iin the likelihood score of each label can comprise the steps: further
Step S731: determine Training document d
imiddle words W
jxcorresponding to current examination sub-set of tags CS
imiddle label C
klikelihood score P (W
jx| C
k) initial value.
Step S732: determine Training document d
imiddle words W
jxcorresponding to current examination sub-set of tags CS
iin the likelihood score of each label carry out each weighting coefficient P (C of linear weighted function
k| d
i) initial value.
Step S733: be target to the maximum corresponding to the continued product of the weighted likelihood degree of described current examination sub-set of tags to make all words in described Training document, based on the initial value of above-mentioned likelihood score and the initial value of each weighting coefficient, utilize EM algorithm to train and obtain Training document d
imiddle words W
jxcorresponding to current examination sub-set of tags CS
imiddle label C
klikelihood score, according to EM algorithm, concrete computing formula is as follows:
(1) first Training document d is traveled through
iin all words, and calculation training document d successively
iwherein each words is corresponding to current examination sub-set of tags CS
imiddle label C
kcombination condition probability
(2) weighting coefficient P (C is upgraded according to following formulae discovery
k| d
i):
wherein, n (d
i, w
jx) represent Training document d
imiddle words w
jxnumber.
(3) Training document d is determined according to following formulae discovery renewal
imiddle words W
jxcorresponding to current examination sub-set of tags CS
imiddle label C
klikelihood score P (w
jx| C
k):
wherein, W
jx'for Training document d
imiddle words, the value of jx ' is identical with the value of jx, is 1 to Training document d
iwords sum jxmax integer.
(4) Training document d is obtained according to the continuous iteration of formula of above-mentioned (1), (2) and (3)
imiddle words W
jxcorresponding to current examination sub-set of tags CS
imiddle label C
klikelihood score P (W
jx| C
k), until meet iteration stopping condition, the explanation about iteration stopping condition refers to above-mentioned to weighting coefficient P (C
k| d
ob) carry out the explanation of the iteration stopping condition of iteration.
Training document d is determined in above-mentioned steps S731
imiddle words W
jxcorresponding to current examination sub-set of tags CS
imiddle label C
klikelihood score P (W
jx| C
k) initial value can comprise: obtain current examination sub-set of tags CS
iin the word distributed model of each label; According to current examination sub-set of tags CS
iin the word distributed model of each label, calculation training document d
imiddle words W
jxcorresponding to current examination sub-set of tags CS
imiddle label C
klikelihood score as the initial value of corresponding likelihood score, because the word distributed model of label and sub-set of tags have nothing to do, therefore, sub-set of tags is examined for all the treating determined, only each label involved by sub-set of tags need be examined to set up word distributed model for all waiting, for all, this waits that the mode examining sub-set of tags to set up word distributed model respectively greatly reduces the quantity needing the word distributed model set up relatively.
Determine in above-mentioned steps S732 Training document d
imiddle words W
jxcorresponding to current examination sub-set of tags CS
iin the likelihood score of each label carry out each weighting coefficient P (C of linear weighted function
k| d
i) initial value can comprise: determine Training document d
imiddle words W
jxcorresponding to current examination sub-set of tags CS
iin the likelihood score of each label carry out each weighting coefficient P (C of linear weighted function
k| d
i) initial value equal current examination sub-set of tags CS
ithe inverse of middle number of labels, namely equals
Corresponding with many label text categorizing system of the present invention, as shown in Figure 5, many label text categorizing system of the present invention comprises receiver module A, treats examination sub-set of tags determination module 1, word-dividing mode 2, current examination sub-set of tags extraction module 3, current calculating words extraction module 4, words likelihood score acquisition module 5, weighted likelihood degree computing module 6, document likelihood score computing module 7, posterior probability computing module 8 and classification results output module 9, wherein, this receiver module A is for receiving the document to be sorted of user's input; This treats that examination sub-set of tags determination module 1 is waited to examine sub-set of tags for determining for the document to be sorted received; This word-dividing mode 2 carries out word segmentation processing for treating classifying documents, obtains each words; This current examination sub-set of tags extraction module 3 is for extracting a sub-set of tags successively as current examination sub-set of tags from waiting to examine in sub-set of tags; This current calculating words extraction module 4 for extracting a words successively as current calculating words from each words; This words likelihood score acquisition module 5 is for obtaining the likelihood score of current calculating words corresponding to each label in current examination sub-set of tags; This weighted likelihood degree computing module 6, for carrying out linear weighted function to current calculating words corresponding to the likelihood score of each label in current examination sub-set of tags, obtains the weighted likelihood degree of current calculating words corresponding to current examination sub-set of tags; The document likelihood score computing module 7 is for determining each weighting coefficient carrying out linear weighted function, make all words maximum corresponding to the continued product of the weighted likelihood degree of current examination sub-set of tags, and using maximum continued product as the likelihood score of document to be sorted corresponding to current examination sub-set of tags, wherein, for the label one_to_one corresponding in each weighting coefficient of all words and described current examination sub-set of tags, and each weighting coefficient sum equals 1; This posterior probability computing module 8, for according to the likelihood score of document to be sorted corresponding to current examination sub-set of tags, calculates the posterior probability of document to be sorted corresponding to current examination sub-set of tags; This classification results output module 9 is for treating, in examination sub-set of tags, to choose the sub-set of tags that the makes posterior probability maximum classification results as document to be sorted.
Above-mentionedly treat that examination sub-set of tags determination module 1 can comprise tag set acquiring unit further, unit chosen by label and treat examination sub-set of tags output unit, this tag set acquiring unit for obtain comprise all labels tag set as current examination sub-set of tags; This label choose unit for using tag set as in all weighting coefficients determined during current examination sub-set of tags, choose the label making weighting coefficient be more than or equal to pre-determined threshold and form new tag set; This treats that examination sub-set of tags output unit is for combining label each in new tag set, obtains waiting to examine sub-set of tags.Wherein until examination sub-set of tags output unit also for when tag set is all less than described pre-determined threshold as all weighting coefficients determined during current examination sub-set of tags, the label choosing the predetermined number making weighting coefficient maximum forms described new tag set.
The each weighting coefficient of above-mentioned document likelihood score computing module 7 also for utilizing EM algorithm to determine to carry out linear weighted function, makes all words maximum corresponding to the continued product of the weighted likelihood degree of described current examination sub-set of tags.
Above-mentioned posterior probability computing module 8 can comprise prior probability computing unit and posterior probability computing unit further, and this prior probability computing unit is for calculating the prior probability of current examination sub-set of tags; This posterior probability computing unit for calculating the prior probability of current examination sub-set of tags and the document to be sorted product corresponding to the likelihood score of current examination sub-set of tags, as the posterior probability of document to be sorted corresponding to current examination sub-set of tags.
Above-mentioned prior probability computing unit can comprise Training document further and obtain subelement, training tag set acquisition subelement, sequence subelement, sequentially adjusts subelement, Markov chain training subelement and prior probability computation subunit, and this Training document obtains subelement for obtaining all Training document; This training tag set obtains the label that subelement relates to for obtaining all Training document, composing training tag set; This sequence subelement is used for sorting to each label in training tag set; This order adjustment subelement is used for carrying out order adjustment to many label for labelling of all Training document, makes the sequential bits in many label for labelling between each label consistent with the sequential bits between corresponding label in training tag set; This Markov chain training subelement is used for the many label for labelling through order adjustment according to all Training document, training obtains discrete Markov chain, to make in Markov chain each state and each label in training tag set according to described sequence one_to_one corresponding; This prior probability computation subunit equals the product of the redirect probability between the corresponding states of Markov chain for the prior probability of the sub-set of tags calculating current examination.
Above-mentioned words likelihood score acquisition module 5 can comprise Training document further and obtain subelement, training weighted likelihood degree acquiring unit, training parameter determining unit and words likelihood score acquiring unit, and this Training document obtains subelement for obtaining the Training document that many label for labelling are current examination sub-set of tags; This training weighted likelihood degree acquiring unit is used for carrying out linear weighted function to words in Training document corresponding to the likelihood score of each label in current examination sub-set of tags, to obtain in Training document words corresponding to the weighted likelihood degree of current examination sub-set of tags; This training parameter determining unit is used for making all words in Training document be target to the maximum corresponding to the continued product of the weighted likelihood degree of current examination sub-set of tags, and training to obtain in Training document words corresponding to the likelihood score of each label in described current examination sub-set of tags; This words likelihood score acquiring unit is used in the likelihood score of words corresponding to each label in described current examination sub-set of tags, obtaining the likelihood score of current calculating words corresponding to each label in current examination sub-set of tags from Training document.
Above-mentioned training parameter determining unit also to can be used for utilizing EM algorithm to train obtaining in Training document words corresponding to the likelihood score of each label in described current examination sub-set of tags.Therefore, this training parameter determining unit can comprise likelihood score initial value determination subelement, weighting coefficient initial values determination subelement and training parameter determination subelement further, and this likelihood score initial value determination subelement is for determining that in Training document, words is corresponding to the initial value of the likelihood score of each label in current examination sub-set of tags; This weighting coefficient initial values determination subelement carries out the initial value of each weighting coefficient of linear weighted function to words in Training document corresponding to the likelihood score of each label in current examination sub-set of tags for determining; This training parameter determination subelement is used for making all words in described Training document be target to the maximum corresponding to the continued product of the weighted likelihood degree of described current examination sub-set of tags, based on the initial value of described likelihood score and the initial value of described each weighting coefficient, utilize EM algorithm to train and to obtain in described Training document words corresponding to the likelihood score of each label in described current examination sub-set of tags.
Above-mentioned likelihood score initial value determination subelement also can be used for: the word distributed model obtaining each label in described current examination sub-set of tags; According to the word distributed model of each label in current examination sub-set of tags, in calculation training document words corresponding to the likelihood score of each label in described current examination sub-set of tags as the initial value of corresponding likelihood score.
Above-mentioned weighting coefficient initial values determination subelement also can be used for determining inverse words in described Training document being equaled to number of labels in described current examination sub-set of tags corresponding to the initial value that the likelihood score of each label in described current examination sub-set of tags carries out each weighting coefficient of linear weighted function.
At this, the explanation explanation for the present invention's many label text categorizing system each several part illustrates consistent with the explanation of the appropriate section in the present invention's many label text sorting technique.
Structure of the present invention, feature and action effect is described in detail above according to graphic shown embodiment; the foregoing is only preferred embodiment of the present invention; but the present invention does not limit practical range with shown in drawing; every change done according to conception of the present invention; or be revised as the Equivalent embodiments of equivalent variations; do not exceed yet instructions with diagram contain spiritual time, all should in protection scope of the present invention.