CN101727463A

CN101727463A - Text training method and text classifying method

Info

Publication number: CN101727463A
Application number: CN200810225033A
Authority: CN
Inventors: 谭松波; 许洪波; 程学旗
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2008-10-24
Filing date: 2008-10-24
Publication date: 2010-06-09

Abstract

The invention provides a text training method. The method comprises the following steps: 1) computing a centre vector of each classification of training sample sets; 2) classifying samples in the training sample sets according to the centre vector of the training sample sets; and 3) for the samples which are classified incorrectly, correcting the centre vector of a classification A to which the incorrectly classified samples belong or/and the centre vector of a classification B to which the samples are classified by mistake according to a set drag weight and a set push weight. The text can be classified with high precision and fast speed according to the centre vector obtained by the training method.

Description

A kind of text training method and sorting technique

Technical field

The present invention relates to area of pattern recognition, be specifically related to a kind of text training method and sorting technique.

Background technology

Along with the fast development of Internet technology, online text presents exponential increase.In the face of vast as the open sea text data, people press for a kind of efficient tool that can handle and organize extensive text automatically.Text automatic classification is exactly that a large amount of texts are classified automatically according to its content, thereby helps people to handle and organize text data effectively.Therefore, research and development high-performance file classification method just more and more becomes a research focus of information retrieval field.The machine learning method of many classics is introduced in text classification.Relatively commonly used have center sorting technique, Rocchio, arest neighbors (KNN), Winnow, Bayes (NB), support vector machine (SVM), voting method methods such as (Voting).

In numerous file classification methods, the nicety of grading of support vector machine and voting method is the highest.But their training and classification time are all very long, are difficult to satisfy the needs of extensive text classification.On the contrary, center sorting technique, Bayes, Rocchio and Winnow then are typical linear classifiers, the training and the scale of classification time and problem are linear, so can satisfy the classification needs of extensive text from the time, but their precision is often not ideal.So generally speaking, present sorting technique still is in the situation of attending to one thing and lose sight of another, and is difficult to reach the requirement of high-performance file classification method.

Summary of the invention

The technical problem to be solved in the present invention is to realize balance (trade-off) between precision and speed, and a kind of text training method is provided, and the center that training obtains according to this method is fast to classify not only precision height but also speed of text.

According to an aspect of the present invention, provide a kind of text training method, comprised the following steps:

1) center vector of the every class training sample set of calculating;

2) sample of described training sample being concentrated according to the center vector of training sample set is classified;

3) to the incorrect sample of classifying, according to the power dragweight and push away power pushweight far away and revise the center vector of affiliated classification A of the incorrect sample of described classification of furthering that sets or/and by the center vector of the wrong classification B that assigns to.

In said method, described step 3) comprises:

31) to the weight d of feature l among the sample vector d _lSituation greater than 0 is according to formula

Center vector to the affiliated classification A of the incorrect sample of described classification is revised, wherein C ^sOther center vector of representation class, o represents the iteration step number;

32) if

According to formula

To described C _{A, l} ^{S, o+1}Carry out normalization, wherein C ^NExpression is with described C ^sCenter vector after the normalization.

In said method, described step 3) comprises:

33) to the weight d of feature l among the sample vector d _lSituation greater than 0 is according to formula

The center vector of the classification B that the incorrect sample of described classification is assigned to by mistake is revised, wherein C ^sOther center vector of representation class, o represents the iteration step number;

34) if

According to formula

To described C _{B, l} ^{S, o+1}Carry out normalization, wherein C ^NExpression is with described C ^sCenter vector after the normalization.

Wherein, described step 33) also comprise:

331) if described C _{B, l} ^{S, o+1}Less than 0, then with C _{B, l} ^{S, o+1}Be changed to 0.

Wherein, the described power dragweight that furthers is 1.0.

It is wherein, described that to push away power pushweight far away be 1.0.

In said method, described step 2) and step 3) carry out the maximal value time of iteration step number.

Wherein, the peaked span of described iteration step number is [5,8].

According to a further aspect in the invention, also provide a kind of file classification method, comprised the following steps:

According to the resulting center vector of above-mentioned arbitrary text training method new text is classified.

Beneficial effect of the present invention is, carries out text classification according to the resulting center of text training method of the present invention, has not only improved nicety of grading, and hands-on and classification time also keep shorter.

Description of drawings

Fig. 1 is the text training method process flow diagram of the specific embodiment according to the present invention.

Embodiment

In order to make purpose of the present invention, technical scheme and advantage clearer,, file classification method is according to an embodiment of the invention further described below in conjunction with accompanying drawing.Should be appreciated that specific embodiment described herein only in order to explanation the present invention, and be not used in qualification the present invention.

The present invention attempts to improve their nicety of grading by the center vector of revising (or being referred to as to optimize) classification under the constant situation of the time complexity that keeps the center sorting technique.

The classification thinking of center sorting technique is simple, according to arithmetic mean is that every class text collection generates a center vector of representing such, come then at new text then, determine the vector of new text, calculate the distance (similarity) between this vector and every class center vector, judge that at last new text belongs to the class nearest with sample, concrete steps are as follows:

Step 1: the center vector that calculates every class training sample set;

Step 2: after new text arrived, participle was a proper vector with new text representation;

Step 3: calculate the similarity between the every class center vector of new text feature vector sum;

Step 4: the similarity of more every class center vector and new text feature vector, new text is assigned in that classification of similarity maximum.

The training result that the classification of analytic centre sorting technique makes mistakes can be found, sample t why can be by the mistake branch, is because it and the similarity of the correct classification A similarity less than the classification B that goes into the mistake branch, that is: Sim (t, C _A)＜Sim (t, C _B), C wherein _AAnd C _BThe center vector of representing classification A and B respectively, Sim (t, C _A) and Sim (t, C _B) represent the similarity of sample t and classification A and classification B respectively.Under selected feature set, sample t determines, causes that so wrong uncertain factor can only be the center vector C of classification A _AOr the center vector C of classification B _BIf the center vector of these two classes is made suitable correction, make it to satisfy Sim (t, C _A)＞Sim (t, C _B) (be Dist (t, C _A)＜Dist (t, C _B), Dist (t, C wherein _A) and Dist (t, C _B) represent the center vector C of sample t and classification A respectively _ACenter vector C with classification B _BDistance), then sample t just can correctly be classified.

The present invention proposes sorting technique based on correction strategy for this reason.The basic way of correction strategy is to utilize the wrong sample that divides to divide the center vector of the classification of going into or/and the center vector of affiliated classification is made correction to mistake in above-mentioned steps 1.Preferably, above-mentioned two center vectors are revised simultaneously, then both guaranteed the consistance of error correction, can reach high-caliber performance balance point again very soon.For example,, so just should add the similarity between large sample t and the category-A center vector, the similarity between minimizing and the category-B center vector if sample t is assigned to category-B in the category-A by mistake.Obviously, after " correction " operation, the possibility that sample t is correctly classified has improved greatly.The present invention classifies to whole training samples, all will carry out " correction " operation to each wrong sample that divides.Only need repeat the stable point that a small amount of number of times correction operation just can reach performance to whole training samples.

According to above-mentioned analysis, shown in Fig. 1 process flow diagram, the idiographic flow of the step of the center vector of the every class training sample set of the calculating of one embodiment of the invention is as follows:

At first calculate the center vector of every class training sample set according to formula (1).It will be understood by those skilled in the art that computing method, in this step, can also adopt other method to calculate the center vector of every class training sample set except formula (1).

C_{i} = \frac{1}{| c_{i} |} \underset{d &Element; c_{i}}{Σ} d - - - (1)

Wherein, d represents sample files vector, C _iThe center vector of expression classification i, | C _i| the number of files in the representation class.

Then, set necessary parameter, the maximal value of iteration step number (max-iteration-step) for example, its preferable range is [5,8], the power of furthering (dragweight) and push away power far away (pushweight), above-mentioned weights preferred value is 1.0.

Center vector according to every class training sample set is classified to every piece in training set sample text.If classification is correct, then directly the next chapter sample text in the training set is classified, finish up to all sample text classification.If the step of following correction center vector is then carried out in classification error, again the next chapter sample text is classified, finish up to all sample texts classification.After all sample texts execution being judged the operation of classification correctness and correction center vector for max-iteration-step time, end.

Assigned to category-B if belong to the sample d of category-A by mistake, according to a preferred embodiment of the present invention, adopt formula (2) to formula (5) to revise the center vector of category-A and the center vector of category-B.

Calculate through the center vector of revised category-A once according to formula (2):

C_{A, l}^{S, o + 1} = C_{A, l}^{S, o} + dragweight \times d_{l}

If d _l＞0 (2)

If

Then do not do further processing, otherwise

The resulting vector of formula (2) is carried out normalization:

C_{A, l}^{N, o + 1} = \frac{C_{A, l}^{S, o + 1}}{{| | \overset{&RightArrow;}{C_{A}^{S, o + 1}} | |}_{2}} - - - (3)

In like manner, calculate through the center vector of revised category-B once according to formula (4), and if

Then carry out normalization according to formula (5):

C_{B, l}^{S, o + 1} = {[C_{B, l}^{S, o} - pushweight \times d_{l}]}_{+}

If d _l＞0 (4)

C_{B, l}^{N, o + 1} = \frac{C_{B, l}^{S, o + 1}}{{| | \overset{&RightArrow;}{C_{B}^{S, o + 1}} | |}_{2}} - - - (5)

Wherein d represents sample files vector, d _lThe weight of feature l in the expression text vector.C ^SThe center vector that expression is calculated according to formula (1), C ₁ ^SExpression center vector C ^SComponent on feature l; C ^NExpression is with C ^SNormalized center vector, C ₁ ^NExpression center vector C ^NComponent on feature l.O represents the iteration step number, || || ₂Expression 2-norm.

In the formula (4) | z| ₊Be defined as | z| ₊=max{z, 0} is used for removing the negative of center vector.The utilization of this function is a preferred version that improves the center vector computational accuracy, because according to experiment, the center vector that contains negative may cause precise decreasing.It will be understood by those skilled in the art that in formula (4), do not adopt this function still can realize basic effect of the present invention.

According to the abovementioned embodiments of the present invention, the maximal value that adopts the iteration step number it will be understood by those skilled in the art that the execution that can adopt other boundary condition to come terminating method as the boundary condition that method stops, for example, the number of the sample text that classification is correct accounts for the ratio of the total number of sample text.

If the training number of files is D, the test document number is T, and total speech number is W, and total classification number is K, and the greatest iteration step number is M.The time complexity of computing center's sorting technique sorter is O (DW+KW) so, and wherein O (DW) is used for computing center's vector, and O (KW) is used for calculating the normalization center vector.Because D＞K is so the training time complexity of center sorting technique is about O (DW).The correction stage need classify to D piece of writing document, need revise the center vector of 2 classes to wrong document, so time complexity is O (D (KW+2W)), because K 〉=2, so the time demand is about O (DKW).So the time complexity of whole training stage is O (DW+MDKW), be about O (MDKW).So the training stage of correction strategy is that number of files D is linear with the scale of problem still.The sorter that correction strategy finally obtains still is a K center vector, so the time complexity of classification is the same with the center sorting technique, all is O (TKW).So, also be a linear classifier based on sorter of the present invention.

A Chinese corpus is experimentized, and evaluation index adopts precision (Accuracy) and time.Employing center sorting technique and support vector machine method be sorting technique as a comparison.

The training set language material comprises 2898 pieces of economic class articles, and the test set language material comprises 1449 pieces of economic class articles, is divided into 4 classifications, and dictionary size is 40748, and the training step number was 5 steps.

The inventive method has reached basic the same precision with support vector machine method, is approximately 86%, exceeds (80%) 6 percentage point of center sorting technique.The classification time of the inventive method is 0.698 second, and is suitable substantially with center sorting technique (0.688 second), well below support vector machine method (206.021 seconds).Therefore, the inventive method has obtained the nicety of grading suitable with support vector machine method, but actual run time will be well below support vector machine method.

Should be noted that and understand, under the situation that does not break away from the desired the spirit and scope of the present invention of accompanying Claim, can make various modifications and improvement the present invention of foregoing detailed description.Therefore, the scope of claimed technical scheme is not subjected to the restriction of given any specific exemplary teachings.

Claims

1. a text training method comprises the following steps:

1) center vector of the every class training sample set of calculating;

2. method according to claim 1 is characterized in that, described step 3) comprises:

32) if

According to formula

3. method according to claim 1 is characterized in that, described step 3) comprises:

34) if According to formula To described C _{B, l} ^{S, o+1}Carry out normalization, wherein C ^NExpression is with described C ^SCenter vector after the normalization.

4. method according to claim 3 is characterized in that, described step 33) also comprise:

5. method according to claim 2 is characterized in that, described step 3) also comprises:

34) if

According to formula

6. method according to claim 5 is characterized in that, described step 33) also comprise:

7. method according to claim 1 is characterized in that, the described power dragweight that furthers is 1.0.

8. method according to claim 1 is characterized in that, described to push away power pushweight far away be 1.0.

9. method according to claim 1 is characterized in that, described step 2) and step 3) carry out the maximal value time of iteration step number.

10. method according to claim 9 is characterized in that, the peaked span of described iteration step number is [5,8].

11. a file classification method comprises the following steps:

According to the resulting center vector of each described text training method of claim 1 to 10 new text is classified.