CN101727463A - Text training method and text classifying method - Google Patents

Text training method and text classifying method Download PDF

Info

Publication number
CN101727463A
CN101727463A CN200810225033A CN200810225033A CN101727463A CN 101727463 A CN101727463 A CN 101727463A CN 200810225033 A CN200810225033 A CN 200810225033A CN 200810225033 A CN200810225033 A CN 200810225033A CN 101727463 A CN101727463 A CN 101727463A
Authority
CN
China
Prior art keywords
classification
center vector
sample
vector
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN200810225033A
Other languages
Chinese (zh)
Inventor
谭松波
许洪波
程学旗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN200810225033A priority Critical patent/CN101727463A/en
Publication of CN101727463A publication Critical patent/CN101727463A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a text training method. The method comprises the following steps: 1) computing a centre vector of each classification of training sample sets; 2) classifying samples in the training sample sets according to the centre vector of the training sample sets; and 3) for the samples which are classified incorrectly, correcting the centre vector of a classification A to which the incorrectly classified samples belong or/and the centre vector of a classification B to which the samples are classified by mistake according to a set drag weight and a set push weight. The text can be classified with high precision and fast speed according to the centre vector obtained by the training method.

Description

A kind of text training method and sorting technique
Technical field
The present invention relates to area of pattern recognition, be specifically related to a kind of text training method and sorting technique.
Background technology
Along with the fast development of Internet technology, online text presents exponential increase.In the face of vast as the open sea text data, people press for a kind of efficient tool that can handle and organize extensive text automatically.Text automatic classification is exactly that a large amount of texts are classified automatically according to its content, thereby helps people to handle and organize text data effectively.Therefore, research and development high-performance file classification method just more and more becomes a research focus of information retrieval field.The machine learning method of many classics is introduced in text classification.Relatively commonly used have center sorting technique, Rocchio, arest neighbors (KNN), Winnow, Bayes (NB), support vector machine (SVM), voting method methods such as (Voting).
In numerous file classification methods, the nicety of grading of support vector machine and voting method is the highest.But their training and classification time are all very long, are difficult to satisfy the needs of extensive text classification.On the contrary, center sorting technique, Bayes, Rocchio and Winnow then are typical linear classifiers, the training and the scale of classification time and problem are linear, so can satisfy the classification needs of extensive text from the time, but their precision is often not ideal.So generally speaking, present sorting technique still is in the situation of attending to one thing and lose sight of another, and is difficult to reach the requirement of high-performance file classification method.
Summary of the invention
The technical problem to be solved in the present invention is to realize balance (trade-off) between precision and speed, and a kind of text training method is provided, and the center that training obtains according to this method is fast to classify not only precision height but also speed of text.
According to an aspect of the present invention, provide a kind of text training method, comprised the following steps:
1) center vector of the every class training sample set of calculating;
2) sample of described training sample being concentrated according to the center vector of training sample set is classified;
3) to the incorrect sample of classifying, according to the power dragweight and push away power pushweight far away and revise the center vector of affiliated classification A of the incorrect sample of described classification of furthering that sets or/and by the center vector of the wrong classification B that assigns to.
In said method, described step 3) comprises:
31) to the weight d of feature l among the sample vector d lSituation greater than 0 is according to formula
Figure G2008102250330D0000021
Center vector to the affiliated classification A of the incorrect sample of described classification is revised, wherein C sOther center vector of representation class, o represents the iteration step number;
32) if
Figure G2008102250330D0000022
According to formula
Figure G2008102250330D0000023
To described C A, l S, o+1Carry out normalization, wherein C NExpression is with described C sCenter vector after the normalization.
In said method, described step 3) comprises:
33) to the weight d of feature l among the sample vector d lSituation greater than 0 is according to formula
Figure G2008102250330D0000024
The center vector of the classification B that the incorrect sample of described classification is assigned to by mistake is revised, wherein C sOther center vector of representation class, o represents the iteration step number;
34) if
Figure G2008102250330D0000025
According to formula
Figure G2008102250330D0000026
To described C B, l S, o+1Carry out normalization, wherein C NExpression is with described C sCenter vector after the normalization.
Wherein, described step 33) also comprise:
331) if described C B, l S, o+1Less than 0, then with C B, l S, o+1Be changed to 0.
Wherein, the described power dragweight that furthers is 1.0.
It is wherein, described that to push away power pushweight far away be 1.0.
In said method, described step 2) and step 3) carry out the maximal value time of iteration step number.
Wherein, the peaked span of described iteration step number is [5,8].
According to a further aspect in the invention, also provide a kind of file classification method, comprised the following steps:
According to the resulting center vector of above-mentioned arbitrary text training method new text is classified.
Beneficial effect of the present invention is, carries out text classification according to the resulting center of text training method of the present invention, has not only improved nicety of grading, and hands-on and classification time also keep shorter.
Description of drawings
Fig. 1 is the text training method process flow diagram of the specific embodiment according to the present invention.
Embodiment
In order to make purpose of the present invention, technical scheme and advantage clearer,, file classification method is according to an embodiment of the invention further described below in conjunction with accompanying drawing.Should be appreciated that specific embodiment described herein only in order to explanation the present invention, and be not used in qualification the present invention.
The present invention attempts to improve their nicety of grading by the center vector of revising (or being referred to as to optimize) classification under the constant situation of the time complexity that keeps the center sorting technique.
The classification thinking of center sorting technique is simple, according to arithmetic mean is that every class text collection generates a center vector of representing such, come then at new text then, determine the vector of new text, calculate the distance (similarity) between this vector and every class center vector, judge that at last new text belongs to the class nearest with sample, concrete steps are as follows:
Step 1: the center vector that calculates every class training sample set;
Step 2: after new text arrived, participle was a proper vector with new text representation;
Step 3: calculate the similarity between the every class center vector of new text feature vector sum;
Step 4: the similarity of more every class center vector and new text feature vector, new text is assigned in that classification of similarity maximum.
The training result that the classification of analytic centre sorting technique makes mistakes can be found, sample t why can be by the mistake branch, is because it and the similarity of the correct classification A similarity less than the classification B that goes into the mistake branch, that is: Sim (t, C A)<Sim (t, C B), C wherein AAnd C BThe center vector of representing classification A and B respectively, Sim (t, C A) and Sim (t, C B) represent the similarity of sample t and classification A and classification B respectively.Under selected feature set, sample t determines, causes that so wrong uncertain factor can only be the center vector C of classification A AOr the center vector C of classification B BIf the center vector of these two classes is made suitable correction, make it to satisfy Sim (t, C A)>Sim (t, C B) (be Dist (t, C A)<Dist (t, C B), Dist (t, C wherein A) and Dist (t, C B) represent the center vector C of sample t and classification A respectively ACenter vector C with classification B BDistance), then sample t just can correctly be classified.
The present invention proposes sorting technique based on correction strategy for this reason.The basic way of correction strategy is to utilize the wrong sample that divides to divide the center vector of the classification of going into or/and the center vector of affiliated classification is made correction to mistake in above-mentioned steps 1.Preferably, above-mentioned two center vectors are revised simultaneously, then both guaranteed the consistance of error correction, can reach high-caliber performance balance point again very soon.For example,, so just should add the similarity between large sample t and the category-A center vector, the similarity between minimizing and the category-B center vector if sample t is assigned to category-B in the category-A by mistake.Obviously, after " correction " operation, the possibility that sample t is correctly classified has improved greatly.The present invention classifies to whole training samples, all will carry out " correction " operation to each wrong sample that divides.Only need repeat the stable point that a small amount of number of times correction operation just can reach performance to whole training samples.
According to above-mentioned analysis, shown in Fig. 1 process flow diagram, the idiographic flow of the step of the center vector of the every class training sample set of the calculating of one embodiment of the invention is as follows:
At first calculate the center vector of every class training sample set according to formula (1).It will be understood by those skilled in the art that computing method, in this step, can also adopt other method to calculate the center vector of every class training sample set except formula (1).
C i = 1 | c i | Σ d ∈ c i d - - - ( 1 )
Wherein, d represents sample files vector, C iThe center vector of expression classification i, | C i| the number of files in the representation class.
Then, set necessary parameter, the maximal value of iteration step number (max-iteration-step) for example, its preferable range is [5,8], the power of furthering (dragweight) and push away power far away (pushweight), above-mentioned weights preferred value is 1.0.
Center vector according to every class training sample set is classified to every piece in training set sample text.If classification is correct, then directly the next chapter sample text in the training set is classified, finish up to all sample text classification.If the step of following correction center vector is then carried out in classification error, again the next chapter sample text is classified, finish up to all sample texts classification.After all sample texts execution being judged the operation of classification correctness and correction center vector for max-iteration-step time, end.
Assigned to category-B if belong to the sample d of category-A by mistake, according to a preferred embodiment of the present invention, adopt formula (2) to formula (5) to revise the center vector of category-A and the center vector of category-B.
Calculate through the center vector of revised category-A once according to formula (2):
C A , l S , o + 1 = C A , l S , o + dragweight × d l If d l>0 (2)
If
Figure G2008102250330D0000043
Then do not do further processing, otherwise
Figure G2008102250330D0000044
The resulting vector of formula (2) is carried out normalization:
C A , l N , o + 1 = C A , l S , o + 1 | | C A S , o + 1 → | | 2 - - - ( 3 )
In like manner, calculate through the center vector of revised category-B once according to formula (4), and if
Figure G2008102250330D0000046
Then carry out normalization according to formula (5):
C B , l S , o + 1 = [ C B , l S , o - pushweight × d l ] + If d l>0 (4)
C B , l N , o + 1 = C B , l S , o + 1 | | C B S , o + 1 → | | 2 - - - ( 5 )
Wherein d represents sample files vector, d lThe weight of feature l in the expression text vector.C SThe center vector that expression is calculated according to formula (1), C 1 SExpression center vector C SComponent on feature l; C NExpression is with C SNormalized center vector, C 1 NExpression center vector C NComponent on feature l.O represents the iteration step number, || || 2Expression 2-norm.
In the formula (4) | z| +Be defined as | z| +=max{z, 0} is used for removing the negative of center vector.The utilization of this function is a preferred version that improves the center vector computational accuracy, because according to experiment, the center vector that contains negative may cause precise decreasing.It will be understood by those skilled in the art that in formula (4), do not adopt this function still can realize basic effect of the present invention.
According to the abovementioned embodiments of the present invention, the maximal value that adopts the iteration step number it will be understood by those skilled in the art that the execution that can adopt other boundary condition to come terminating method as the boundary condition that method stops, for example, the number of the sample text that classification is correct accounts for the ratio of the total number of sample text.
If the training number of files is D, the test document number is T, and total speech number is W, and total classification number is K, and the greatest iteration step number is M.The time complexity of computing center's sorting technique sorter is O (DW+KW) so, and wherein O (DW) is used for computing center's vector, and O (KW) is used for calculating the normalization center vector.Because D>K is so the training time complexity of center sorting technique is about O (DW).The correction stage need classify to D piece of writing document, need revise the center vector of 2 classes to wrong document, so time complexity is O (D (KW+2W)), because K 〉=2, so the time demand is about O (DKW).So the time complexity of whole training stage is O (DW+MDKW), be about O (MDKW).So the training stage of correction strategy is that number of files D is linear with the scale of problem still.The sorter that correction strategy finally obtains still is a K center vector, so the time complexity of classification is the same with the center sorting technique, all is O (TKW).So, also be a linear classifier based on sorter of the present invention.
A Chinese corpus is experimentized, and evaluation index adopts precision (Accuracy) and time.Employing center sorting technique and support vector machine method be sorting technique as a comparison.
The training set language material comprises 2898 pieces of economic class articles, and the test set language material comprises 1449 pieces of economic class articles, is divided into 4 classifications, and dictionary size is 40748, and the training step number was 5 steps.
The inventive method has reached basic the same precision with support vector machine method, is approximately 86%, exceeds (80%) 6 percentage point of center sorting technique.The classification time of the inventive method is 0.698 second, and is suitable substantially with center sorting technique (0.688 second), well below support vector machine method (206.021 seconds).Therefore, the inventive method has obtained the nicety of grading suitable with support vector machine method, but actual run time will be well below support vector machine method.
Should be noted that and understand, under the situation that does not break away from the desired the spirit and scope of the present invention of accompanying Claim, can make various modifications and improvement the present invention of foregoing detailed description.Therefore, the scope of claimed technical scheme is not subjected to the restriction of given any specific exemplary teachings.

Claims (11)

1. a text training method comprises the following steps:
1) center vector of the every class training sample set of calculating;
2) sample of described training sample being concentrated according to the center vector of training sample set is classified;
3) to the incorrect sample of classifying, according to the power dragweight and push away power pushweight far away and revise the center vector of affiliated classification A of the incorrect sample of described classification of furthering that sets or/and by the center vector of the wrong classification B that assigns to.
2. method according to claim 1 is characterized in that, described step 3) comprises:
31) to the weight d of feature l among the sample vector d lSituation greater than 0 is according to formula
Figure F2008102250330C0000011
Center vector to the affiliated classification A of the incorrect sample of described classification is revised, wherein C SOther center vector of representation class, o represents the iteration step number;
32) if
Figure F2008102250330C0000012
According to formula
Figure F2008102250330C0000013
To described C A, l S, o+1Carry out normalization, wherein C NExpression is with described C SCenter vector after the normalization.
3. method according to claim 1 is characterized in that, described step 3) comprises:
33) to the weight d of feature l among the sample vector d lSituation greater than 0 is according to formula
Figure F2008102250330C0000014
The center vector of the classification B that the incorrect sample of described classification is assigned to by mistake is revised, wherein C SOther center vector of representation class, o represents the iteration step number;
34) if According to formula To described C B, l S, o+1Carry out normalization, wherein C NExpression is with described C SCenter vector after the normalization.
4. method according to claim 3 is characterized in that, described step 33) also comprise:
331) if described C B, l S, o+1Less than 0, then with C B, l S, o+1Be changed to 0.
5. method according to claim 2 is characterized in that, described step 3) also comprises:
33) to the weight d of feature l among the sample vector d lSituation greater than 0 is according to formula
Figure F2008102250330C0000017
The center vector of the classification B that the incorrect sample of described classification is assigned to by mistake is revised, wherein C SOther center vector of representation class, o represents the iteration step number;
34) if
Figure F2008102250330C0000018
According to formula
Figure F2008102250330C0000019
To described C B, l S, o+1Carry out normalization, wherein C NExpression is with described C SCenter vector after the normalization.
6. method according to claim 5 is characterized in that, described step 33) also comprise:
331) if described C B, l S, o+1Less than 0, then with C B, l S, o+1Be changed to 0.
7. method according to claim 1 is characterized in that, the described power dragweight that furthers is 1.0.
8. method according to claim 1 is characterized in that, described to push away power pushweight far away be 1.0.
9. method according to claim 1 is characterized in that, described step 2) and step 3) carry out the maximal value time of iteration step number.
10. method according to claim 9 is characterized in that, the peaked span of described iteration step number is [5,8].
11. a file classification method comprises the following steps:
According to the resulting center vector of each described text training method of claim 1 to 10 new text is classified.
CN200810225033A 2008-10-24 2008-10-24 Text training method and text classifying method Pending CN101727463A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200810225033A CN101727463A (en) 2008-10-24 2008-10-24 Text training method and text classifying method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200810225033A CN101727463A (en) 2008-10-24 2008-10-24 Text training method and text classifying method

Publications (1)

Publication Number Publication Date
CN101727463A true CN101727463A (en) 2010-06-09

Family

ID=42448363

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200810225033A Pending CN101727463A (en) 2008-10-24 2008-10-24 Text training method and text classifying method

Country Status (1)

Country Link
CN (1) CN101727463A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101984435A (en) * 2010-11-17 2011-03-09 百度在线网络技术(北京)有限公司 Method and device for distributing texts
CN102289514A (en) * 2011-09-07 2011-12-21 中国科学院计算技术研究所 Social label automatic labelling method and social label automatic labeller
CN110633604A (en) * 2018-06-25 2019-12-31 富士通株式会社 Information processing method and information processing apparatus
CN111259155A (en) * 2020-02-18 2020-06-09 中国地质大学(武汉) Word frequency weighting method and text classification method based on specificity

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101984435A (en) * 2010-11-17 2011-03-09 百度在线网络技术(北京)有限公司 Method and device for distributing texts
CN101984435B (en) * 2010-11-17 2012-10-10 百度在线网络技术(北京)有限公司 Method and device for distributing texts
CN102289514A (en) * 2011-09-07 2011-12-21 中国科学院计算技术研究所 Social label automatic labelling method and social label automatic labeller
CN102289514B (en) * 2011-09-07 2016-03-30 中国科学院计算技术研究所 The method of Social Label automatic marking and Social Label automatic marking device
CN110633604A (en) * 2018-06-25 2019-12-31 富士通株式会社 Information processing method and information processing apparatus
CN110633604B (en) * 2018-06-25 2023-04-25 富士通株式会社 Information processing method and information processing apparatus
CN111259155A (en) * 2020-02-18 2020-06-09 中国地质大学(武汉) Word frequency weighting method and text classification method based on specificity
CN111259155B (en) * 2020-02-18 2023-04-07 中国地质大学(武汉) Word frequency weighting method and text classification method based on specificity

Similar Documents

Publication Publication Date Title
CN104750844B (en) Text eigenvector based on TF-IGM generates method and apparatus and file classification method and device
CN100583101C (en) Text categorization feature selection and weight computation method based on field knowledge
CN102622373B (en) Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm
CN106776538A (en) The information extracting method of enterprise's noncanonical format document
CN107193959A (en) A kind of business entity's sorting technique towards plain text
CN105069141A (en) Construction method and construction system for stock standard news library
CN103020167B (en) A kind of computer Chinese file classification method
CN110795564B (en) Text classification method lacking negative cases
CN106446230A (en) Method for optimizing word classification in machine learning text
CN101021838A (en) Text handling method and system
CN110659367B (en) Text classification number determination method and device and electronic equipment
CN111859983B (en) Natural language labeling method based on artificial intelligence and related equipment
CN109522544A (en) Sentence vector calculation, file classification method and system based on Chi-square Test
CN105205124A (en) Semi-supervised text sentiment classification method based on random feature subspace
CN101882136B (en) Method for analyzing emotion tendentiousness of text
Jarvis Data mining with learner corpora
CN115080750B (en) Weak supervision text classification method, system and device based on fusion prompt sequence
CN114996464B (en) Text grading method and device using ordered information
CN103500216A (en) Method for extracting file information
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
Adeleke et al. Automating quranic verses labeling using machine learning approach
CN104182463A (en) Semantic-based text classification method
CN101727463A (en) Text training method and text classifying method
Dhar et al. Bengali news headline categorization using optimized machine learning pipeline
CN101470699A (en) Information extraction model training apparatus, information extraction apparatus and information extraction system and method thereof

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Open date: 20100609