CN105224577A

CN105224577A - Multi-label text classification method and system

Info

Publication number: CN105224577A
Application number: CN201410310719.5A
Authority: CN
Inventors: 贺志阳; 吴及; 吕萍; 何婷婷; 乔玉平; 胡国平; 胡郁
Original assignee: Tsinghua University; iFlytek Co Ltd
Current assignee: Iflytek Medical Technology Co ltd; Tsinghua University
Priority date: 2014-07-01
Filing date: 2014-07-01
Publication date: 2016-01-06
Anticipated expiration: 2034-07-01
Also published as: CN105224577B

Abstract

The invention discloses a multi-label text classification method and a system, comprising the following steps: determining a subset of tags to be assessed for the documents to be classified; performing word segmentation processing on the document to be classified; obtaining the likelihood that the current calculation word corresponds to each label in the current examination label subset; carrying out linear weighting on the likelihood of the current calculated word corresponding to each label in the current assessment label subset to obtain the weighted likelihood of the current calculated word corresponding to the current assessment label subset; determining each weighting coefficient for linear weighting to make the continuous product of the weighting likelihood of all the words maximum, and taking the maximum continuous product as the likelihood that the document to be classified corresponds to the current assessment label subset; and selecting the tag subset with the maximum posterior probability as the classification result of the document to be classified in the tag subset to be checked. Compared with the existing multi-label classification method and system, the multi-label classification method and system have the characteristics of high overall performance and small calculated amount.

Description

A kind of many label text sorting technique and system

Technical field

The present invention relates to text classification field, particularly relate to a kind of many label text sorting technique and system.

Background technology

In recent years, along with the high speed development of internet, the especially arrival in mobile Internet epoch, the mankind enter large data age, all can produce the data of magnanimity every day, therefore mass data be analyzed, obtain valuable information and become the focus that academia and industry member be concerned about jointly.As the main external form of mass data, the treatment technology that text is correlated with receives the very big concern of people, and Text Classification also enters a new developing stage.

Traditional text sorting technique mainly pays close attention to single labeling, i.e. a corresponding class label of text.But in actual life, the class label of a text often more than, as one section is introduced the article of certain national economy, probably can relate to politics and culture simultaneously, now this article should simultaneously three labels such as at least corresponding economic, political, cultural; The article that one section of physical culture game situation report is relevant, probably also have larger length and introduce certain sports star, therefore it at least should comprise physical culture game situation, sports star's two labels.As can be seen here, many labelings can provide more abundant classified information, are follow-up possible application, as text classification management, monitoring, filter etc., provide larger help, therefore, many labelings problem, namely provides multiple label to one section of document and has practical significance.

At this, optimal many labelings method is each sub-set of tags modeling respectively for many labels, determines the probability model of each sub-set of tags; Treat classifying documents subsequently, calculate its posterior probability corresponding to all many sub-set of tags successively, the many sub-set of tags finally selecting posterior probability maximum are as the classification results of document.But the quantity of many sub-set of tags increases with the exponentially level that increases of many number of tags, supposes that the number of many labels in labeling problem more than is K, total then the number of many sub-set of tags is in theory 2 ^k-1, when the numerical value of K is larger, due to the enormous amount of many sub-set of tags, it is unpractical for building model respectively to all subsets.

For these reasons, in many labelings problem, traditional sorting technique mainly adopts Naive Bayes Classifier method to classify, it supposes independent and each label probability of occurrence equalization that distributes between each label, because each label is by correspondence word distributed model, this word distributed model can be regarded as a probability model, and on this basis, existing many labelings method is as follows:

Step one: train two probability models respectively for each label, namely comprises the probability model of this label and does not comprise the probability model of this label, the training method of probability model is described for the probability model comprising this label below:

Step 1: acquisition comprises the text data of this label in a large number as training data.

Step 2: the words probability of occurrence of each words in statistics training data.

Step 3: using the set of above-mentioned words probability of occurrence as the probability model comprising this label, for the document structure tree probability of this label of subsequent calculations.

Accordingly, the probability model not comprising this label can be trained and obtains on the text data not comprising this label gathered.

Step 2: obtain document C to be sorted.

Step 3: judge whether this document C to be sorted comprises each label in tag set successively, wherein, the determination methods whether this document C to be sorted comprises label X is as follows:

Step 1: treat classifying documents C and carry out word segmentation processing, obtains words string sequence Cx.

Step 2: calculate this words string sequence Cx corresponding to the likelihood score comprising label X, especially under naive Bayesian hypothesis, this likelihood score can be calculated as and equal each words in words string sequence Cx corresponding to the continued product of words generating probability comprising certain label X.

Step 3: calculate words string sequence Cx corresponding to the likelihood score not comprising label X.

Step 4: calculate words string sequence Cx corresponding to comprising the likelihood score of label X and words string sequence Cx corresponding to the likelihood ratio between the likelihood score not comprising label X.

Step 5: described likelihood ratio is greater than 1, then think that this document package to be sorted is containing label X, otherwise then think that this document to be sorted does not comprise this label.

Step 4: obtain many labelings result that this document to be sorted is corresponding.

In the above sorting technique based on naive Bayesian decision-making, the document probability distribution of each label is artificially supposed separate, this kind of independent hypothesis make to have when solving many labelings problem method simple, be easy to the advantage that realizes, efficiency is higher.But in actual applications, the separate hypothesis of label does not conform to reality, as one section of article introducing certain national economy probably also relates to politics and the culture of this country, obviously between economical, politics, culture three labels, there is certain probability relativity.So is obviously independently of one another irrational between each label of simple hypothesis, also this reason just, above-mentioned many labelings method is difficult to obtain desirable many labelings effect usually.

The another kind of method accurately can carrying out the classification of many label text sets up probability model respectively for all sub-set of tags determined, the posterior probability of document to be sorted corresponding to all sub-set of tags is calculated successively subsequently for document to be sorted, the sub-set of tags finally selecting posterior probability maximum is as the classification results of document, the sorting technique of label text more than this kind due to sub-set of tags quantity by with many number of tags increase exponentially level increase, therefore it is unpractical for building probability model respectively to all sub-set of tags, so this kind of sorting technique is not widely applied.

Summary of the invention

The unreasonable problem that embodiments of the invention exist for existing many label text sorting technique, provides the many label text sorting technique and system of considering the correlativity in sub-set of tags before each label in a kind of enforceable classification.

For achieving the above object, the technical solution used in the present invention is: a kind of many label text sorting technique, comprising:

Determine to wait to examine sub-set of tags for the document to be sorted received;

Word segmentation processing is carried out to described document to be sorted, obtains each words;

A sub-set of tags is extracted successively as current examination sub-set of tags examination sub-set of tags from described treating;

A words is extracted successively as current calculating words from described each words;

Obtain the likelihood score of described current calculating words corresponding to each label in described current examination sub-set of tags;

Corresponding to the likelihood score of each label in described current examination sub-set of tags, linear weighted function is carried out to described current calculating words, obtains the weighted likelihood degree of described current calculating words corresponding to described current examination sub-set of tags;

Determine each weighting coefficient carrying out linear weighted function, make all words maximum corresponding to the continued product of the weighted likelihood degree of described current examination sub-set of tags, and using maximum continued product as the likelihood score of described document to be sorted corresponding to described current examination sub-set of tags, wherein, for the label one_to_one corresponding in each weighting coefficient of all words and described current examination sub-set of tags, and each weighting coefficient sum equals 1;

According to the likelihood score of described document to be sorted corresponding to described current examination sub-set of tags, calculate the posterior probability of described document to be sorted corresponding to described current examination sub-set of tags;

In examination sub-set of tags, the sub-set of tags that the makes described posterior probability maximum classification results as described document to be sorted is chosen described treating.

Preferably, the described document to be sorted for receiving is determined to treat that examination sub-set of tags comprises:

Obtain the tag set comprising all labels;

Using described tag set as in all weighting coefficients determined during described current examination sub-set of tags, choose the label making weighting coefficient be more than or equal to pre-determined threshold and form new tag set;

Each label in described new tag set is combined, waits described in obtaining to examine sub-set of tags.

Preferably, the described document to be sorted for receiving is determined to treat that examination sub-set of tags also comprises:

When described tag set is all less than described pre-determined threshold as all weighting coefficients determined during current examination sub-set of tags, the label choosing the predetermined number making weighting coefficient maximum forms described new tag set.

Preferably, describedly determine each weighting coefficient carrying out linear weighted function, all words comprised corresponding to the continued product of the weighted likelihood degree of described current examination sub-set of tags is maximum:

Utilize EM algorithm to determine to carry out each weighting coefficient of linear weighted function, make all words maximum corresponding to the continued product of the weighted likelihood degree of described current examination sub-set of tags.

Preferably, described according to the likelihood score of described document to be sorted corresponding to described current examination sub-set of tags, calculate described document to be sorted and comprise corresponding to the posterior probability of described current examination sub-set of tags:

Calculate the prior probability of described current examination sub-set of tags;

Calculate the prior probability of described current examination sub-set of tags and the described document to be sorted product corresponding to the likelihood score of described current examination sub-set of tags, as the posterior probability of described document to be sorted corresponding to described current examination sub-set of tags.

Preferably, the prior probability of the described current examination sub-set of tags of described calculating comprises:

Obtain all Training document;

Obtain the label that all Training document relate to, composing training tag set;

Each label in described training tag set is sorted;

Order adjustment is carried out to many label for labelling of all Training document, makes the sequential bits in many label for labelling between each label consistent with the sequential bits between corresponding label in training tag set;

According to the many label for labelling through order adjustment of all Training document, training obtains discrete Markov chain, to make in described Markov chain in each state and described training tag set each label according to described sequence one_to_one corresponding;

The prior probability calculating the sub-set of tags of described current examination equals the product of the redirect probability between the corresponding states of described Markov chain.

Preferably, the described current calculating words of described acquisition comprises corresponding to the likelihood score of each label in current examination sub-set of tags:

Obtain the Training document that many label for labelling are described current examination sub-set of tags;

Corresponding to the likelihood score of each label in described current examination sub-set of tags, linear weighted function is carried out to words in described Training document, to obtain in described Training document words corresponding to the weighted likelihood degree of described current examination sub-set of tags;

To make all words in described Training document be target to the maximum corresponding to the continued product of the weighted likelihood degree of described current examination sub-set of tags, training to obtain in Training document words corresponding to the likelihood score of each label in described current examination sub-set of tags;

From described Training document in the likelihood score of words corresponding to each label in described current examination sub-set of tags, obtain the likelihood score of described current calculating words corresponding to each label in current examination sub-set of tags.

Preferably, described to make all words in described Training document be target to the maximum corresponding to the continued product of the weighted likelihood degree of described current examination sub-set of tags, training obtains words in Training document and comprises corresponding to the likelihood score of each label in described current examination sub-set of tags:

Determine that in described Training document, words is corresponding to the initial value of the likelihood score of each label in described current examination sub-set of tags;

Determine to carry out the initial value of each weighting coefficient of linear weighted function to words in described Training document corresponding to the likelihood score of each label in described current examination sub-set of tags;

Target is to the maximum corresponding to the continued product of the weighted likelihood degree of described current examination sub-set of tags to make all words in described Training document, based on the initial value of described likelihood score and the initial value of described each weighting coefficient, utilize EM algorithm to train and to obtain in Training document words corresponding to the likelihood score of each label in described current examination sub-set of tags.

Preferably, describedly determine that in described Training document, words comprises corresponding to the initial value of the likelihood score of each label in described current examination sub-set of tags:

Obtain the word distributed model of each label in described current examination sub-set of tags;

According to the word distributed model of each label in described current examination sub-set of tags, calculate the likelihood score corresponding to each label in described current examination sub-set of tags of words in described Training document as the initial value of corresponding likelihood score.

Preferably, describedly determine that the initial value to words in described Training document carries out each weighting coefficient of linear weighted function corresponding to the likelihood score of each label in described current examination sub-set of tags comprises:

Determine inverse words in described Training document being equaled to number of labels in described current examination sub-set of tags corresponding to the initial value that the likelihood score of each label in described current examination sub-set of tags carries out each weighting coefficient of linear weighted function.

To achieve these goals, the technical solution used in the present invention is: a kind of many label text categorizing system, comprising:

Treating examination sub-set of tags determination module, waiting to examine sub-set of tags for determining for the document to be sorted received;

Word-dividing mode, for carrying out word segmentation processing to described document to be sorted, obtains each words;

Current examination sub-set of tags extraction module, for extracting a sub-set of tags successively as current examination sub-set of tags from described treating in examination sub-set of tags;

Current calculating words extraction module, for extracting a words successively as current calculating words from described each words;

Words likelihood score acquisition module, for obtaining the likelihood score of described current calculating words corresponding to each label in described current examination sub-set of tags;

Weighted likelihood degree computing module, for carrying out linear weighted function to described current calculating words corresponding to the likelihood score of each label in described current examination sub-set of tags, obtains the weighted likelihood degree of described current calculating words corresponding to described current examination sub-set of tags;

Document likelihood score computing module, for determining each weighting coefficient carrying out linear weighted function, make all words maximum corresponding to the continued product of the weighted likelihood degree of described current examination sub-set of tags, and using maximum continued product as the likelihood score of described document to be sorted corresponding to described current examination sub-set of tags, wherein, for the label one_to_one corresponding in each weighting coefficient of all words and described current examination sub-set of tags, and each weighting coefficient sum equals 1;

Posterior probability computing module, for according to the likelihood score of described document to be sorted corresponding to described current examination sub-set of tags, calculates the posterior probability of described document to be sorted corresponding to described current examination sub-set of tags; And,

Classification results output module, in examination sub-set of tags, choosing the sub-set of tags that the makes described posterior probability maximum classification results as described document to be sorted described treating.

Preferably, treat described in that examination sub-set of tags determination module comprises:

Tag set acquiring unit, for obtain comprise all labels tag set as described current examination sub-set of tags;

Unit chosen by label, for using described tag set as in all weighting coefficients determined during described current examination sub-set of tags, choose the label making weighting coefficient be more than or equal to pre-determined threshold and form new tag set; And,

Treating examination sub-set of tags output unit, for combining each label in described new tag set, waiting described in obtaining to examine sub-set of tags.

Preferably, described wait examining sub-set of tags output unit also for when described tag set is all less than described pre-determined threshold as all weighting coefficients determined during current examination sub-set of tags, the label choosing the predetermined number making weighting coefficient maximum forms described new tag set.

Preferably, each weighting coefficient of described document likelihood score computing module also for utilizing EM algorithm to determine to carry out linear weighted function, makes all words maximum corresponding to the continued product of the weighted likelihood degree of described current examination sub-set of tags.

Preferably, described posterior probability computing module comprises:

Prior probability computing unit, for calculating the prior probability of described current examination sub-set of tags; And,

Posterior probability computing unit, for calculating the prior probability of described current examination sub-set of tags and the described document to be sorted product corresponding to the likelihood score of described current examination sub-set of tags, as the posterior probability of described document to be sorted corresponding to described current examination sub-set of tags.

Preferably, described prior probability computing unit comprises:

Training document obtains subelement, for obtaining all Training document;

Training tag set obtains subelement, for obtaining the label that all Training document relate to, and composing training tag set;

Sequence subelement, for sorting to each label in described training tag set;

Order adjusts subelement, for carrying out order adjustment to many label for labelling of all Training document, makes the sequential bits in many label for labelling between each label consistent with the sequential bits between corresponding label in training tag set;

Markov chain training subelement, for the many label for labelling through order adjustment according to all Training document, training obtains discrete Markov chain, to make in described Markov chain in each state and described training tag set each label according to described sequence one_to_one corresponding; And,

Prior probability computation subunit, the product of the redirect probability between the corresponding states equaling described Markov chain for the prior probability of the sub-set of tags calculating described current examination.

Preferably, described words likelihood score acquisition module comprises:

Training document obtains subelement, for obtaining the Training document that many label for labelling are described current examination sub-set of tags;

Training weighted likelihood degree acquiring unit, for carrying out linear weighted function to words in described Training document corresponding to the likelihood score of each label in described current examination sub-set of tags, to obtain in described Training document words corresponding to the weighted likelihood degree of described current examination sub-set of tags;

Training parameter determining unit, for to make all words in described Training document be target to the maximum corresponding to the continued product of the weighted likelihood degree of described current examination sub-set of tags, training to obtain in Training document words corresponding to the likelihood score of each label in described current examination sub-set of tags; And,

Words likelihood score acquiring unit, in the likelihood score of words from described Training document corresponding to each label in described current examination sub-set of tags, obtains the likelihood score of described current calculating words corresponding to each label in current examination sub-set of tags.

Preferably, described training parameter determining unit comprises:

Likelihood score initial value determination subelement, for determining that in described Training document, words is corresponding to the initial value of the likelihood score of each label in described current examination sub-set of tags;

Weighting coefficient initial values determination subelement, carries out the initial value of each weighting coefficient of linear weighted function to words in described Training document corresponding to the likelihood score of each label in described current examination sub-set of tags for determining; And,

Training parameter determination subelement, for to make all words in described Training document be target to the maximum corresponding to the continued product of the weighted likelihood degree of described current examination sub-set of tags, based on the initial value of described likelihood score and the initial value of described each weighting coefficient, utilize EM algorithm to train and to obtain in described Training document words corresponding to the likelihood score of each label in described current examination sub-set of tags.

Preferably, described likelihood score initial value determination subelement also for:

Preferably, described weighting coefficient initial values determination subelement is also for determining inverse words in described Training document being equaled to number of labels in described current examination sub-set of tags corresponding to the initial value that the likelihood score of each label in described current examination sub-set of tags carries out each weighting coefficient of linear weighted function.

Beneficial effect of the present invention is, the correlativity that many label text sorting technique of the present invention and system are considered in sub-set of tags between each label by treating words in classifying documents corresponding to the mode that the likelihood score of label each in sub-set of tags carries out linear weighted function, and by correlativity that the mode reasonably optimizing that the continued product of the linear weighted function result making the likelihood score corresponding to label each in sub-set of tags of each words in document to be sorted is maximum is embodied by weighting coefficient, and then make the relatively existing many labelings method and system of many labelings method and system of the present invention have the feature that overall performance is high and calculated amount is little concurrently.

Accompanying drawing explanation

Fig. 1 shows the process flow diagram of a kind of embodiment according to many label text sorting technique of the present invention;

Fig. 2 shows the process flow diagram of a kind of embodiment according to the prior probability calculating current examination sub-set of tags in many labelings method of the present invention;

Fig. 3 shows the Markov chain that method according to Fig. 2 is determined;

Fig. 4 shows the Markovian state redirect probability that the Markov chain correction according to Fig. 2 obtains;

Fig. 5 shows the frame principle figure of a kind of embodiment according to many labelings system of the present invention.

Embodiment

Be described below in detail embodiments of the invention, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has element that is identical or similar functions from start to finish.Being exemplary below by the embodiment be described with reference to the drawings, only for explaining the present invention, and can not limitation of the present invention being interpreted as.

As shown in Figure 1, the embodiment of many label text sorting technique of the present invention can comprise the steps:

Step S1: for the document d to be sorted received _obdetermine to wait to examine sub-set of tags, at this, if having 20 labels in tag set, then determine to wait to examine each sub-set of tags that the most basic skills of sub-set of tags obtains for carrying out various combination to 20 labels.。

Step S2: treat classifying documents d _obcarry out word segmentation processing, obtain each words.

Step S3: assignment i=1.

Step S4: from treating to extract sub-set of tags CS examination sub-set of tags _ias current examination sub-set of tags.

Step S5: assignment j=1.

Step S6: extract words W from each words _jas current calculating words.

Step S7: obtain current calculating words W _jcorresponding to current examination sub-set of tags CS _imiddle label C _klikelihood score P (W _j| C _k), wherein, the value of k is from 1 to current examination sub-set of tags CS _imiddle total number of labels integer, likelihood score P (W _j| C _k) value obtained in advance by model training process.

Step S8: to current calculating words W _jcorresponding to current examination sub-set of tags CS _iin the likelihood score of each label carry out linear weighted function, obtain current calculating words W _jcorresponding to current examination sub-set of tags CS _iweighted likelihood degree wherein P (C _k| d _ob) be for document d to be sorted _ob, with label C _kcorresponding weighting coefficient, should be understood that at this, each weighting coefficient of all words and current examination sub-set of tags CS _iin each label one_to_one corresponding, namely different words is corresponding to current examination sub-set of tags CS _ithe weighting coefficient of the likelihood score of middle same label is identical, and carries out linear weighted function and should meet the constraint condition that each weighting coefficient sum equals 1.

Step S9: judge j whether equal to carry out word segmentation processing acquisition words sum jmax, in this way then perform step S11, as otherwise perform step S10.

Step S10: assignment j=j+1, continue afterwards to perform step S6.

Step S11: determine each weighting coefficient P (C carrying out linear weighted function _k| d _ob), make all words corresponding to current examination sub-set of tags CS _ithe continued product of weighted likelihood degree maximum, and using maximum continued product as described document d to be sorted _obcorresponding to current examination sub-set of tags CS _ilikelihood score P (d _ob| CS _i), namely

P (d_{ob} | {CS}_{i}) = \max_{{P (C_{k} | d_{ob})}} Π_{j = 1}^{j \max} (Σ_{k = 1}^{N_{{CS}_{i}}} P (C_{k} | d_{ob}) \times P (w_{j} | C_{k})) .

Step S12: according to document d to be sorted _obcorresponding to current examination sub-set of tags CS _ilikelihood score P (d _ob| CS _i), calculate document d to be sorted _obcorresponding to current examination sub-set of tags CS _iposterior probability P (CS _i| d _ob), calculating posterior probability method according to Bayes' theorem is: wherein, P (d _ob) and P (CS _i) be respectively the prior probability of document to be sorted and current examination sub-set of tags CS _iprior probability.

Step S13: judge whether i equals to wait to examine the subset sum imax of sub-set of tags, then performs step S15 in this way, as otherwise perform step S14.

Step S14: assignment i=i+1, continue afterwards to perform step S4.

Step S15: treating, in examination sub-set of tags, to choose and make posterior probability P (CS _i| d _ob) maximum sub-set of tags is as document d to be sorted _obclassification results.

Many label text sorting technique of the present invention adopts said method, treats classifying documents d _obmiddle words W _jcorresponding to waiting that the mode examining the likelihood score of each label in sub-set of tags to carry out linear weighted function is considered to wait to examine the correlativity in sub-set of tags between each label, and by making document d to be sorted _obin each words corresponding to the correlativity waiting to examine the maximum mode reasonably optimizing of the continued product of the linear weighted function result of the likelihood score of each label in sub-set of tags to be embodied by weighting coefficient, and then make the relatively existing many labelings method and system of many labelings method and system of the present invention have higher overall performance.Relative to for all, many label text sorting technique of the present invention treats that examination sub-set of tags sets up probability model respectively, subsequently for document to be sorted, calculate document d to be sorted successively _obthe posterior probability examining sub-set of tags is waited corresponding to all, the maximum sub-set of tags of posterior probability is finally selected greatly to reduce calculated amount as many labelings method of the classification results of document, although sorting technique below can obtain the most accurate classification results, but because the exponentially level that increases with many number of tags increases by the quantity of sub-set of tags, therefore it is unpractical for building probability model respectively to all sub-set of tags, so this kind of sorting technique is not widely applied.

In order to reduce the calculated amount of the present invention's many labelings method further, for the document d to be sorted received in above-mentioned steps S1 _obdetermine to treat that examination sub-set of tags can comprise the steps: further

Step S101: obtain the tag set comprising all labels.

Step S102: using tag set as current examination sub-set of tags time, be such as current examination sub-set of tags CS _imaxaccording to all weighting coefficients that above-mentioned steps S4 to step S11 determines, and in all weighting coefficients, choose the label making weighting coefficient be more than or equal to pre-determined threshold form new tag set, this pre-determined threshold can be selected according to concrete many labelings task and required classification degree of accuracy, pre-determined threshold is less, and classification degree of accuracy is higher.

Step S103: combine label each in new tag set, obtains waiting to examine sub-set of tags, and even new tag set has m label, so will obtain 2 ^mwait for-1 to examine sub-set of tags.

If when tag set is all less than pre-determined threshold as all weighting coefficients determined during current examination sub-set of tags, the label can choosing the predetermined number making weighting coefficient maximum forms above-mentioned new tag set, the predetermined number even arranged is five, then make weighting coefficient from large to minispread, and choose label corresponding to the first five weighting coefficient and form above-mentioned new tag set.

In above-mentioned steps S11, EM algorithm (EM) can be utilized to determine to carry out each weighting coefficient P (C of linear weighted function _k| d _ob), make all words maximum corresponding to the continued product of the weighted likelihood degree of current examination sub-set of tags.Wherein, the calculation procedure of EM algorithm is as follows:

(1) document d to be sorted is first traveled through _obin all words, and calculate document d to be sorted successively _obwherein each words is corresponding to current examination sub-set of tags CS _imiddle label C _kcombination condition probability P (C _k| d _ob, w _j), be specifically calculated as follows:

wherein, C _k'for current examination sub-set of tags CS _iin label, the value of k' is identical with the value of k, is from 1 to current examination sub-set of tags CS _imiddle total number of labels integer.

(2) weighting coefficient P (C is upgraded according to following formulae discovery _k| d _ob):

wherein, n (d _ob, w _j) represent document d to be sorted _obmiddle words w _jnumber.

(3) according to the formula continuous iteration weighting coefficient P (C of above-mentioned (1) and (2) _k| d _ob), until meet iteration stopping condition, when carrying out iteration, need for weighting coefficient P (C _k| d _ob) initial value is set, such as, as long as this initial value meets and carries out linear weighted function and should meet the constraint condition that each weighting coefficient sum equals 1, weighting coefficient P (C _k| d _ob) initial value equal under normal circumstances, this iteration stopping condition can be iterations and has reached maximum iteration time or likelihood score P (d _ob| CS _i) Magnification (i.e. the Magnification of relatively last iteration result) reduce to be such as 2% ~ 5% setting ratio below, at this, according to EM algorithm, along with continuous iteration, document d to be sorted will be made _obcorresponding to current examination sub-set of tags CS _ilikelihood score P (d _ob| CS _i) increase gradually, when increasing to a certain degree, likelihood score P (d _ob| CS _i) when proceeding iteration, the trend increased will slow down gradually, until substantially do not change, in this case, proceed iteration and will there is no actual function definition, therefore, those skilled in the art can preset a suitable maximum iteration time according to concrete many labelings task or according to Magnification need be utilized to arrange constraint condition.

In above-mentioned steps S12, owing to examining sub-set of tags, the prior probability P (d of document to be sorted for any waiting _ob) all identical, therefore, calculate document d to be sorted _obcorresponding to current examination sub-set of tags CS _iposterior probability can comprise further:

Step S121: calculate current examination sub-set of tags CS _iprior probability P (CS _i).

Step S122: calculate current examination sub-set of tags CS _iprior probability P (CS _i) and document d to be sorted _obcorresponding to current examination sub-set of tags CS _ilikelihood score P (d _ob| CS _i) product, as document d to be sorted _obcorresponding to current examination sub-set of tags CS _iposterior probability P (CS _i| d _ob).

Many labelings method of the present invention additionally provides a kind of prior probability construction method based on Markov chain, and the method has taken into full account current examination sub-set of tags CS _iin correlativity between each label, and then improve the accuracy that posterior probability calculates, namely in above-mentioned steps S121, calculate current examination sub-set of tags CS _iprior probability P (CS _i) can comprise the steps: further

Step S1211: obtain all Training document.

Step S1212: obtain the label that all Training document relate to, composing training tag set.

Step S1213: each label in training tag set is sorted, each label in training tag set specifically can sort according to certain relation by the method for carrying out sorting, as hierarchical relationship, time relationship, spatial relationship etc., such as " group project ", " basketball ", " physical culture " three labels, can sort as follows according to its hierarchical relationship: " physical culture ", " group project ", " basketball "; If the not obvious relation of each label in training tag set, can select certain order arbitrary, as determined at random, this does not affect the final effect of method yet.

Step S1214: order adjustment is carried out to many label for labelling of all Training document, make the sequential bits in many label for labelling between each label consistent with the sequential bits between corresponding label in training tag set, tag set is such as trained to have five labels, and be A, B, C, D, E according to step S1213 sequence, so, if label for labelling more than relates to A, E, C tri-labels, then this many label for labelling is adjusted to A, C, E.

Step S1215: according to the many label for labelling through order adjustment of all Training document, training obtains discrete Markov chain, to make in Markov chain each state and each label in training tag set according to described sequence one_to_one corresponding, be specially the total number of labels that in statistics Training document, each state of Markov chain is corresponding, such as, suppose when above-mentioned training tag set has five labels, if many label for labelling are A in Training document, C, the Training document of E has 50, many label for labelling are B, C, the Training document of E has 100, many label for labelling are B, D, the Training document of E has 200, transfer frequency statistics then in Markov chain between each state as shown in Figure 3.

Step S1216: the prior probability calculating the sub-set of tags of described current examination equals the product of the redirect probability between the corresponding states of Markov chain, at this, as shown in Figure 3 and Figure 4, a state in Markov chain to the redirect probability of another state be correspondence go out frequency on arc divided by previous state all go out frequency sum on arc, such as wait examine sub-set of tags A, C, E prior probability be that initial state is to A, A to C, C to E, the product of the redirect probability between E to done state, is 1/7; Such as wait again examine sub-set of tags A, B, C prior probability be 0.

In above-mentioned steps S7, obtain current calculating words W _jcorresponding to current examination sub-set of tags CS _imiddle label C _klikelihood score P (W _j| C _k) can comprise the steps: further

Step S71: obtaining many label for labelling is current examination sub-set of tags CS _itraining document d _i.

Step S72: to Training document d _imiddle words is corresponding to current examination sub-set of tags CS _iin the likelihood score of each label carry out linear weighted function, obtain Training document d _imiddle words W _jxcorresponding to current examination sub-set of tags CS _iweighted likelihood degree, wherein, the value of jx is 1 to Training document d _iwords sum jxmax integer; At this, the concept of this weighted likelihood degree and document d to be sorted _obin all words corresponding to current examination sub-set of tags CS _ithe concept of weighted likelihood degree identical.

Step S73: to make Training document d _iin all words corresponding to current examination sub-set of tags CS _ithe continued product of weighted likelihood degree be target to the maximum, training obtains Training document d _imiddle words W _jxcorresponding to current examination sub-set of tags CS _iin the likelihood score of each label.

Step S74: from Training document d _imiddle words W _jxcorresponding to current examination sub-set of tags CS _iin each label likelihood score in, obtain current calculating words W _jcorresponding to current examination sub-set of tags CS _imiddle label C _klikelihood score.

In above-mentioned steps S73, to make Training document d _iin all words corresponding to current examination sub-set of tags CS _ithe continued product of weighted likelihood degree be target to the maximum, training obtains Training document d _imiddle words W _jxcorresponding to current examination sub-set of tags CS _iin the likelihood score of each label can comprise further: to make Training document d _iin all words corresponding to current examination sub-set of tags CS _ithe continued product of weighted likelihood degree be target to the maximum, utilize EM algorithm (EM) to train and obtain Training document d _imiddle words W _jxcorresponding to current examination sub-set of tags CS _iin the likelihood score of each label.

According to EM algorithm, above-mentioned to make Training document d _iin all words corresponding to current examination sub-set of tags CS _ithe continued product of weighted likelihood degree be target to the maximum, training obtains Training document d _imiddle words W _jxcorresponding to current examination sub-set of tags CS _iin the likelihood score of each label can comprise the steps: further

Step S731: determine Training document d _imiddle words W _jxcorresponding to current examination sub-set of tags CS _imiddle label C _klikelihood score P (W _jx| C _k) initial value.

Step S732: determine Training document d _imiddle words W _jxcorresponding to current examination sub-set of tags CS _iin the likelihood score of each label carry out each weighting coefficient P (C of linear weighted function _k| d _i) initial value.

Step S733: be target to the maximum corresponding to the continued product of the weighted likelihood degree of described current examination sub-set of tags to make all words in described Training document, based on the initial value of above-mentioned likelihood score and the initial value of each weighting coefficient, utilize EM algorithm to train and obtain Training document d _imiddle words W _jxcorresponding to current examination sub-set of tags CS _imiddle label C _klikelihood score, according to EM algorithm, concrete computing formula is as follows:

(1) first Training document d is traveled through _iin all words, and calculation training document d successively _iwherein each words is corresponding to current examination sub-set of tags CS _imiddle label C _kcombination condition probability

P (C_{k} | d_{j}, w_{jx}) : P (C_{k} | d_{j}, w_{jx}) = \frac{P (w_{jx} | C_{k}) \times P (C_{k} | d_{i})}{Σ_{k^{'} = 1}^{N_{{CS}_{i}}} P (w_{jx} | C_{k^{'}}) \times P (C_{k^{'}} | d_{i})} .

(2) weighting coefficient P (C is upgraded according to following formulae discovery _k| d _i):

wherein, n (d _i, w _jx) represent Training document d _imiddle words w _jxnumber.

(3) Training document d is determined according to following formulae discovery renewal _imiddle words W _jxcorresponding to current examination sub-set of tags CS _imiddle label C _klikelihood score P (w _jx| C _k):

wherein, W _jx'for Training document d _imiddle words, the value of jx ' is identical with the value of jx, is 1 to Training document d _iwords sum jxmax integer.

(4) Training document d is obtained according to the continuous iteration of formula of above-mentioned (1), (2) and (3) _imiddle words W _jxcorresponding to current examination sub-set of tags CS _imiddle label C _klikelihood score P (W _jx| C _k), until meet iteration stopping condition, the explanation about iteration stopping condition refers to above-mentioned to weighting coefficient P (C _k| d _ob) carry out the explanation of the iteration stopping condition of iteration.

Training document d is determined in above-mentioned steps S731 _imiddle words W _jxcorresponding to current examination sub-set of tags CS _imiddle label C _klikelihood score P (W _jx| C _k) initial value can comprise: obtain current examination sub-set of tags CS _iin the word distributed model of each label; According to current examination sub-set of tags CS _iin the word distributed model of each label, calculation training document d _imiddle words W _jxcorresponding to current examination sub-set of tags CS _imiddle label C _klikelihood score as the initial value of corresponding likelihood score, because the word distributed model of label and sub-set of tags have nothing to do, therefore, sub-set of tags is examined for all the treating determined, only each label involved by sub-set of tags need be examined to set up word distributed model for all waiting, for all, this waits that the mode examining sub-set of tags to set up word distributed model respectively greatly reduces the quantity needing the word distributed model set up relatively.

Determine in above-mentioned steps S732 Training document d _imiddle words W _jxcorresponding to current examination sub-set of tags CS _iin the likelihood score of each label carry out each weighting coefficient P (C of linear weighted function _k| d _i) initial value can comprise: determine Training document d _imiddle words W _jxcorresponding to current examination sub-set of tags CS _iin the likelihood score of each label carry out each weighting coefficient P (C of linear weighted function _k| d _i) initial value equal current examination sub-set of tags CS _ithe inverse of middle number of labels, namely equals

Corresponding with many label text categorizing system of the present invention, as shown in Figure 5, many label text categorizing system of the present invention comprises receiver module A, treats examination sub-set of tags determination module 1, word-dividing mode 2, current examination sub-set of tags extraction module 3, current calculating words extraction module 4, words likelihood score acquisition module 5, weighted likelihood degree computing module 6, document likelihood score computing module 7, posterior probability computing module 8 and classification results output module 9, wherein, this receiver module A is for receiving the document to be sorted of user's input; This treats that examination sub-set of tags determination module 1 is waited to examine sub-set of tags for determining for the document to be sorted received; This word-dividing mode 2 carries out word segmentation processing for treating classifying documents, obtains each words; This current examination sub-set of tags extraction module 3 is for extracting a sub-set of tags successively as current examination sub-set of tags from waiting to examine in sub-set of tags; This current calculating words extraction module 4 for extracting a words successively as current calculating words from each words; This words likelihood score acquisition module 5 is for obtaining the likelihood score of current calculating words corresponding to each label in current examination sub-set of tags; This weighted likelihood degree computing module 6, for carrying out linear weighted function to current calculating words corresponding to the likelihood score of each label in current examination sub-set of tags, obtains the weighted likelihood degree of current calculating words corresponding to current examination sub-set of tags; The document likelihood score computing module 7 is for determining each weighting coefficient carrying out linear weighted function, make all words maximum corresponding to the continued product of the weighted likelihood degree of current examination sub-set of tags, and using maximum continued product as the likelihood score of document to be sorted corresponding to current examination sub-set of tags, wherein, for the label one_to_one corresponding in each weighting coefficient of all words and described current examination sub-set of tags, and each weighting coefficient sum equals 1; This posterior probability computing module 8, for according to the likelihood score of document to be sorted corresponding to current examination sub-set of tags, calculates the posterior probability of document to be sorted corresponding to current examination sub-set of tags; This classification results output module 9 is for treating, in examination sub-set of tags, to choose the sub-set of tags that the makes posterior probability maximum classification results as document to be sorted.

Above-mentionedly treat that examination sub-set of tags determination module 1 can comprise tag set acquiring unit further, unit chosen by label and treat examination sub-set of tags output unit, this tag set acquiring unit for obtain comprise all labels tag set as current examination sub-set of tags; This label choose unit for using tag set as in all weighting coefficients determined during current examination sub-set of tags, choose the label making weighting coefficient be more than or equal to pre-determined threshold and form new tag set; This treats that examination sub-set of tags output unit is for combining label each in new tag set, obtains waiting to examine sub-set of tags.Wherein until examination sub-set of tags output unit also for when tag set is all less than described pre-determined threshold as all weighting coefficients determined during current examination sub-set of tags, the label choosing the predetermined number making weighting coefficient maximum forms described new tag set.

The each weighting coefficient of above-mentioned document likelihood score computing module 7 also for utilizing EM algorithm to determine to carry out linear weighted function, makes all words maximum corresponding to the continued product of the weighted likelihood degree of described current examination sub-set of tags.

Above-mentioned posterior probability computing module 8 can comprise prior probability computing unit and posterior probability computing unit further, and this prior probability computing unit is for calculating the prior probability of current examination sub-set of tags; This posterior probability computing unit for calculating the prior probability of current examination sub-set of tags and the document to be sorted product corresponding to the likelihood score of current examination sub-set of tags, as the posterior probability of document to be sorted corresponding to current examination sub-set of tags.

Above-mentioned prior probability computing unit can comprise Training document further and obtain subelement, training tag set acquisition subelement, sequence subelement, sequentially adjusts subelement, Markov chain training subelement and prior probability computation subunit, and this Training document obtains subelement for obtaining all Training document; This training tag set obtains the label that subelement relates to for obtaining all Training document, composing training tag set; This sequence subelement is used for sorting to each label in training tag set; This order adjustment subelement is used for carrying out order adjustment to many label for labelling of all Training document, makes the sequential bits in many label for labelling between each label consistent with the sequential bits between corresponding label in training tag set; This Markov chain training subelement is used for the many label for labelling through order adjustment according to all Training document, training obtains discrete Markov chain, to make in Markov chain each state and each label in training tag set according to described sequence one_to_one corresponding; This prior probability computation subunit equals the product of the redirect probability between the corresponding states of Markov chain for the prior probability of the sub-set of tags calculating current examination.

Above-mentioned words likelihood score acquisition module 5 can comprise Training document further and obtain subelement, training weighted likelihood degree acquiring unit, training parameter determining unit and words likelihood score acquiring unit, and this Training document obtains subelement for obtaining the Training document that many label for labelling are current examination sub-set of tags; This training weighted likelihood degree acquiring unit is used for carrying out linear weighted function to words in Training document corresponding to the likelihood score of each label in current examination sub-set of tags, to obtain in Training document words corresponding to the weighted likelihood degree of current examination sub-set of tags; This training parameter determining unit is used for making all words in Training document be target to the maximum corresponding to the continued product of the weighted likelihood degree of current examination sub-set of tags, and training to obtain in Training document words corresponding to the likelihood score of each label in described current examination sub-set of tags; This words likelihood score acquiring unit is used in the likelihood score of words corresponding to each label in described current examination sub-set of tags, obtaining the likelihood score of current calculating words corresponding to each label in current examination sub-set of tags from Training document.

Above-mentioned training parameter determining unit also to can be used for utilizing EM algorithm to train obtaining in Training document words corresponding to the likelihood score of each label in described current examination sub-set of tags.Therefore, this training parameter determining unit can comprise likelihood score initial value determination subelement, weighting coefficient initial values determination subelement and training parameter determination subelement further, and this likelihood score initial value determination subelement is for determining that in Training document, words is corresponding to the initial value of the likelihood score of each label in current examination sub-set of tags; This weighting coefficient initial values determination subelement carries out the initial value of each weighting coefficient of linear weighted function to words in Training document corresponding to the likelihood score of each label in current examination sub-set of tags for determining; This training parameter determination subelement is used for making all words in described Training document be target to the maximum corresponding to the continued product of the weighted likelihood degree of described current examination sub-set of tags, based on the initial value of described likelihood score and the initial value of described each weighting coefficient, utilize EM algorithm to train and to obtain in described Training document words corresponding to the likelihood score of each label in described current examination sub-set of tags.

Above-mentioned likelihood score initial value determination subelement also can be used for: the word distributed model obtaining each label in described current examination sub-set of tags; According to the word distributed model of each label in current examination sub-set of tags, in calculation training document words corresponding to the likelihood score of each label in described current examination sub-set of tags as the initial value of corresponding likelihood score.

Above-mentioned weighting coefficient initial values determination subelement also can be used for determining inverse words in described Training document being equaled to number of labels in described current examination sub-set of tags corresponding to the initial value that the likelihood score of each label in described current examination sub-set of tags carries out each weighting coefficient of linear weighted function.

At this, the explanation explanation for the present invention's many label text categorizing system each several part illustrates consistent with the explanation of the appropriate section in the present invention's many label text sorting technique.

Structure of the present invention, feature and action effect is described in detail above according to graphic shown embodiment; the foregoing is only preferred embodiment of the present invention; but the present invention does not limit practical range with shown in drawing; every change done according to conception of the present invention; or be revised as the Equivalent embodiments of equivalent variations; do not exceed yet instructions with diagram contain spiritual time, all should in protection scope of the present invention.

Claims

1. the sorting technique of label text more than, is characterized in that, comprising:

2. method according to claim 1, is characterized in that, the described document to be sorted for receiving is determined to treat that examination sub-set of tags comprises:

Obtain the tag set comprising all labels;

3. method according to claim 2, is characterized in that, the described document to be sorted for receiving is determined to treat that examination sub-set of tags also comprises:

4. method according to claim 1, is characterized in that, describedly determines each weighting coefficient carrying out linear weighted function, and all words are comprised corresponding to the continued product of the weighted likelihood degree of described current examination sub-set of tags is maximum:

5. method according to any one of claim 1 to 4, it is characterized in that, described according to the likelihood score of described document to be sorted corresponding to described current examination sub-set of tags, calculate described document to be sorted and comprise corresponding to the posterior probability of described current examination sub-set of tags:

6. method according to claim 5, is characterized in that, the prior probability of the described current examination sub-set of tags of described calculating comprises:

Obtain all Training document;

Each label in described training tag set is sorted;

7. method according to any one of claim 1 to 4, is characterized in that, the described current calculating words of described acquisition comprises corresponding to the likelihood score of each label in current examination sub-set of tags:

8. method according to claim 7, it is characterized in that, described to make all words in described Training document be target to the maximum corresponding to the continued product of the weighted likelihood degree of described current examination sub-set of tags, training obtains words in Training document and comprises corresponding to the likelihood score of each label in described current examination sub-set of tags:

9. method according to claim 8, is characterized in that: describedly determine that in described Training document, words comprises corresponding to the initial value of the likelihood score of each label in described current examination sub-set of tags:

10. method according to claim 8, is characterized in that, describedly determines that the initial value to words in described Training document carries out each weighting coefficient of linear weighted function corresponding to the likelihood score of each label in described current examination sub-set of tags comprises:

More than 11. 1 kinds, label text categorizing system, is characterized in that, comprising:

12. systems according to claim 11, is characterized in that, described in treat that examination sub-set of tags determination module comprises:

13. systems according to claim 12, it is characterized in that, described wait examining sub-set of tags output unit also for when described tag set is all less than described pre-determined threshold as all weighting coefficients determined during current examination sub-set of tags, the label choosing the predetermined number making weighting coefficient maximum forms described new tag set.

14. systems according to claim 11, it is characterized in that, the each weighting coefficient of described document likelihood score computing module also for utilizing EM algorithm to determine to carry out linear weighted function, makes all words maximum corresponding to the continued product of the weighted likelihood degree of described current examination sub-set of tags.

15., according to claim 11 to the system according to any one of 14, is characterized in that, described posterior probability computing module comprises:

16. systems according to claim 15, is characterized in that, described prior probability computing unit comprises:

Training document obtains subelement, for obtaining all Training document;

Sequence subelement, for sorting to each label in described training tag set;

17., according to claim 11 to the system according to any one of 14, is characterized in that, described words likelihood score acquisition module comprises:

18. systems according to claim 17, is characterized in that, described training parameter determining unit comprises:

Training parameter determination subelement, for to make all words in described Training document be target to the maximum corresponding to the continued product of the weighted likelihood degree of described current examination sub-set of tags, based on the initial value of described likelihood score and the initial value of described each weighting coefficient, utilize EM algorithm to train and to obtain in Training document words corresponding to the likelihood score of each label in described current examination sub-set of tags.

19. systems according to claim 18, is characterized in that: described likelihood score initial value determination subelement also for:

20. systems according to claim 18, it is characterized in that, described weighting coefficient initial values determination subelement is also for determining inverse words in described Training document being equaled to number of labels in described current examination sub-set of tags corresponding to the initial value that the likelihood score of each label in described current examination sub-set of tags carries out each weighting coefficient of linear weighted function.