CN106570076A

CN106570076A - Computer text classification system

Info

Publication number: CN106570076A
Application number: CN201610905152.5A
Authority: CN
Inventors: 何正娣
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2016-10-11
Filing date: 2016-10-11
Publication date: 2017-04-19

Abstract

The invention discloses a computer text classification system. The system comprises a text preprocessing module, a text feature extraction module, a text training processing module, a classification processing module, a text type marking module and an effect improvement module. According to the system, based on an information theory, the classification process is further refined, so that the functions of the modules in the classification system are made clear, and the classification efficiency and the classification processing rate are ensured; and the effect improvement module is added, so that the classification processing correctness is improved.

Description

A kind of computer version categorizing system

Technical field

The present invention relates to a kind of artificial intelligence field, is related to a kind of Text Classification System.

Background technology

Text classification is used as fields such as information filtering, information retrieval, text database, digital library and mail classification Technical foundation, have a wide range of applications.The development of network and popularization, greatly facilitate our acquisition information.But it is big Amount information gives people many difficulties brought to the process of information, it is impossible to required information is quickly obtained, while can also bring Some reverse side information.Information Filtering Technology can be used to solve these problems, and the essence of information filtering is a classification problem, both Can be used to by user dislike information filter, it is also possible to for by user's information filtering interested out.Existing text The efficiency of categorizing system process is too low, and the effect on driving birds is not good of classification, error rate is too high.

The main object of the present invention is to provide the high Text Classification System of a kind of high efficiency, classification handling rate height, accuracy.

The content of the invention

In view of this, the technical problem to be solved in the present invention is to provide a kind of text retrieval characteristic of division system of selection, uses In solution insurmountable problem set forth above.

To reach the effect of above-mentioned technical proposal, the technical scheme is that：A kind of computer version categorizing system, bag Include Text Pretreatment module, Text character extraction module, text training managing module, classification processing module, text category flag Module and effect improve module；

Text Pretreatment module mainly carries out pretreatment to text, and the text to being input into first utilizes participle software by text Disconnect, punctuation mark, space are removed, be divided into set of words, then set of words is further processed, will be not intended to The word of justice is removed, and set of words is simplified in formation；

The main basis of Text character extraction module is feature selection approach, produce in set of words spy from simplifying first Word subset is levied, the process for producing feature word subset is the process for not stopping to search for, the algorithm of search adopts branch-and-bound search Algorithm, is then evaluated the feature word subset for producing with the evaluation function based on genetic algorithm, obtains evaluation of estimate, and will Evaluation of estimate with stop threshold value be compared, if evaluation of estimate than stop threshold value greatly if stop search, otherwise continue to search, Jing Cross to evaluate to filter and produce new feature word subset, using mutual information method the frequency that feature word occurs, comprehensive characteristics are calculated The frequency that word occurs, obtains the mapping table between feature word and frequency；

Text training managing module is processed the mapping table between feature word and frequency, randomly selects other texts This, calculates inverse document index, will inverse document index as input, by the weighted value for training classifier calculated feature word, from And obtain term weighing matrix；

, according to the term weighing matrix, the classification for arranging classification using svm classifier algorithm is interval, will for classification processing module Word is classified, and obtains word's kinds vector set, and the word of a classification belongs to same in word's kinds vector set Individual vector；

Text category flag module is used to be marked word's kinds vector set, sets up the category label of word Table, word's kinds are added in the category label table of word as the mark value of word classification with special value of symbol by mark value Vector set obtains labeled word's kinds vector set；

Effect improves module and carries out error statistics to labeled word's kinds vector set, the process of statistics be one with Press proof this extraction process, is ranked up according to the distribution law of word, and it is the region that emphasis is extracted to sort in front 30% region, right The classifying quality of the sample of extraction is tested, and is adjusted using nuisance parameter, if the frequency of adjustment is too high, illustrates what is classified Effect is not good enough, returns Text character extraction module, threshold value is modified and re-starts feature extraction until the frequency that will be adjusted Rate is dropped in safe scope.

Description of the drawings

Fig. 1 is a kind of structure chart of Text Classification System.

Specific embodiment

In order that the technical problem to be solved, technical scheme and beneficial effect become more apparent, below tie Drawings and Examples are closed, the present invention will be described in detail.It should be noted that specific embodiment described herein is only used To explain the present invention, it is not intended to limit the present invention, the product that can realize said function belongs to equivalent and improvement, includes Within protection scope of the present invention.Concrete grammar is as follows：

Embodiment one：The conventional method of text classification is characterized selection.This category feature has stronger class discrimination ability, Can concentrate and occur in a certain class or a few apoplexy due to endogenous wind.Therefore, when some words occur in the text, very big assurance can determines These texts are belonging to which kind of or which class.Different features has different abilities to express to the classification of text, it is clear that feature Selection seeks to select the classification to text feature compared with high rule complexity.This ability is referred to as text categories differentiation Degree, that is to say the ability of feature differentiation classification.

The text categories discrimination of feature is exactly the size comprising classification information amount in feature.Comprising text categories information Amount is big, and its text categories discrimination is just big；Conversely, its text categories discrimination is just little.Intuitively say, existed by a feature Whether occurs the ability of the category attribute to judge the document in document.Specifically, if feature occur in a document with The no classification on judging the document is without at all impact, then this feature is nonsensical to text classification, then the area of this feature Sub-category ability very little, feature selection should be very little to its evaluation of estimate.

The text categories discrimination of feature is bigger, and the classification information amount that feature is carried is more, then whether certain feature There is big class discrimination degree to be exactly to very useful feature of classifying.Such as one Feature Words are only present in medical classification, its class Other discrimination reaches maximum.However, this feature is rare word, it is few to occur in the text.If this kind of word is selected as spy Levy, then this feature is occurred without at all in other texts, i.e., the characteristic vector of all texts is zero in the dimension.This The sparse problem of the matrix that frequently encounters in text classification is had led to, so as to produce Expired Drugs, what this phenomenon was produced Reason is to be too dependent on training set.Therefore, when carrying out the feature selection of text, it should select high frequency words as far as possible, that is, consider special The generality levied, this is referred to as the generalization ability being characterized.It is apparent that the feature that text categories discrimination is big and generalization ability is strong It is that, to classification useful feature the most, and the feature that discrimination is little, generalization ability is weak is then most useless feature.Class discrimination degree The big and weak feature of generalization ability higher-dimension when it is more useful because having enough features to avoid Sparse Problem to protect Card recall rate, and the feature for having more preferable distinction is favorably improved the accuracy rate of classification.Class discrimination degree is little and generalization ability Strong feature is more useful when low-dimensional, because in the case of Sparse, although the ability of feature differentiation classification is not By force.

Embodiment two：Nuisance parameter value is found, its text classification effect can be optimized.For whole Text Classification System, should Algorithm considers different nuisance parameter values, and with training set the classifying quality index of each nuisance parameter value is estimated.By using friendship Fork checking, obtains the average of evaluation index and the estimated value of variance, and in two systems statistical significance.Optimal nuisance parameter Value is exactly to obtain highest statistical significance relative at A=0 (i.e. information gain feature selection).For each redundancy ginseng Number, cross validation is carried out using training set to Text Classification System.Training set is randomly divided into several pieces, selects a Validation test collection is used as, remaining is used as checking training set.With checking training set to feature selection and the system of sorting algorithm composition It is trained, then text classification effect, Calculation Estimation index is obtained with validation test collection.Proportionally checking collection is cut into into instruction Practicing collection and test set circulation carries out repeating experiment.For example：Four-legs intersection checking is done, training set is divided equally into into four parts.For the first time, test Card training set uses 1,2 and 3 label subset, the 4th label subset to test next time as validation test collection, uses 1,2 and 4 mark Number as training set, the 3rd label is used as test set.

For different nuisance parameter values, the average and variance of experiment effect index are tested.From the experiment of this paper From the point of view of, verify enough using four-legs intersection.Effect can not be remarkably reinforced using more cross validations, more intersections are tested Although card can have preferably estimation to average and variance, while also increasing the consumption of training time.Consider two redundancy ginsengs Number, carries out cross validation to it respectively, each obtains one group of effectiveness indicator data.It is superfluous using one to define notable statistical test Remaining parameter ratio has more preferable effect using another nuisance parameter.Algorithm tests different nuisance parameters straight from the beginning of initial value To one optimum of discovery.When the statistically significant highest of certain nuisance parameter value, then it is assumed that it is an optimum.

The process for defining feature selection from the angle of information theory is exactly to concentrate to select and classification mark from candidate feature The process of the maximum character subset of the mutual information of label.On this basis, the approximate of four kinds of mutual information feature selection approach is compared Algorithm.Because in addition to information gain, all Algorithm of documents categorization are required for nuisance parameter, it is therefore desirable to which a kind of packaging type search is superfluous The algorithm of remaining parameter best value.From the point of view of comparative result, for fixed value 0.4 is used using nuisance parameter, in all cases There is different degrees of raising, improve more obvious in some cases.Thus also further demonstrate that nuisance parameter and choose Characteristic number be related to the sorting algorithm for using.

Embodiment three：According to former Algorithm of documents categorization evaluation experimental, dashed forward the most with Rochcio, KNN and SVM feature Go out, the classification accuracy rate of wherein KNN, SvM method is higher, the time complexity and space complexity of Rochcio methods is minimum, because This herein mainly to these three methods carried out analysis, compare.

Rochcio graders are the methods based on vector space model and minimum range, and its greatest feature is that have well Feedback function, can according to its formula to classify vector space be modified.The method was carried earliest by Hull in 1994 Go out, since then, Rochcio methods are just widely used.Roeehio formula are：

Wherein W '_jc：The weight of class C center vector, β is the number of positive example in training sample, and γ is counter-example in training sample Number.

Vector distance measure formulas are：

Rochcio Method And Principles are simple, calculate rapid.Calculation procedure is：It is the higher-dimension in vector space by text representation Vector, according in training set just. the vector of example gives positive weights, and the vector of counter-example gives negative weights, is added averagely every to calculate The center of one classification.For the text for belonging to test set, it is calculated to the similarity of each class center, this text is sorted out In the classification maximum with its similarity.From its calculating process, if to those between class distances than larger and inter- object distance ratio Less categorical distribution situation, Rochcio graders can reach preferable nicety of grading, and for those do not reach it is this " good The categorical distribution situation of good distribution ", Rochcio classifier methods effects are poor.But because it calculates simple, rapid, so Among this method Jing is commonly used for the higher application of time requirement of classifying, and become the mark compared with other sorting techniques It is accurate.The whole realization step of this grader and evaluation procedure can be expressed as follows：

Apoplexy due to endogenous wind scheming formula is：

Wherein, n_iIt is L_iThe number of class Chinese version, and D_ijIt is classification L_iJ-th text vector.It is determined that the class of text After not, system finds the text of correlation in the range of the restriction of text library and recommends user.System using the feature after dimensionality reduction to Quantity space carries out text representation, and using TFDIF characteristic item weighting is carried out.In order to reduce index and matching primitives amount, using TFDIF The dimensional feature item of value highest 20 represents document D：

D=[z1, TFIDF (z1), Z2, TFIDF (z2) ..., Z20, TFID) F (z20)]

Text categorization task can be regarded as filling in table { 0,1 }, and abscissa is a series of classification, and vertical coordinate is , to there is a numeral, O represents this document and is not belonging to this classification, and 1 represents this for some documents, every document and each classification Document belongs to this classification.Uncertain link in order to reduce experiment, in addition it is also necessary to which this categorizing process is carried out more precisely It is bright.

Automatic Text Categorization has two kinds of typical method of testings, and one kind is one method of testing of training, and second is k point of intersection Evaluation methodology.A method of testing is trained to be classical evaluation methodology, original training set conjunction T is divided into training set and test set by it Two parts are closed, using training set feature selection and classifier training are carried out, grader is tested using test set.And Original training set is closed and is divided into k parts { 1T, TZ ..., Tk } by k point of cross assessment method, carries out k test, finally takes the flat of them Average is used as final result.

Ttria=T-T_i, Ttes=T_i, i=1,2 ..., k

K point of cross assessment method is usually used in the situation that original training set closes very little, it is therefore an objective to make full use of initial sample to enter Row training.Most strict also most accurate cross assessment method is L00 methods, it is assumed that have m sample, uses a sample conduct every time Test sample, all as training sample, the meansigma methodss for finally testing m time are used as final result for remaining sample.

The present invention will be described in detail for above-described embodiment.It should be noted that specific embodiment described herein Only to explain the present invention, it is not intended to limit the present invention, the product that can realize said function belongs to equivalent and improvement, It is included within protection scope of the present invention.

The invention has the beneficial effects as follows：Based on theory of information, the process to classifying further is refined the present invention, be specify that The function of modules in computer version categorizing system, it is ensured that the efficiency and classification handling rate of classification, adds effect complete Kind module, improves the accuracy that classification is processed.

Claims

1. a kind of computer version categorizing system, it is characterised in that including Text Pretreatment module, Text character extraction module, Text training managing module, classification processing module, text category flag module and effect improve module；

Text of the calculating section in computer first to the Text Pretreatment module mainly to being input into carries out pretreatment, first Text is disconnected by the text of the input using participle software, punctuation mark, space are removed, be divided into set of words, connect And the set of words is further processed, insignificant word is removed, set of words is simplified in formation；

The main basis of the Text character extraction module is feature selection approach, is produced from described simplifying in set of words first Go out feature word subset, the process for producing the feature word subset is the process for not stopping to search for, the algorithm of search adopts branch Bound search algorithm, is then evaluated the feature word subset for producing with the evaluation function based on genetic algorithm, is obtained To evaluation of estimate, and institute's evaluation values are compared with the threshold value for stopping, if institute's evaluation values are bigger than the threshold value of the stopping Stop search, otherwise continue to search, through evaluating to filter new feature word subset is produced, new feature word Collection is made up of feature word, and using mutual information method the frequency that the feature word occurs is calculated, and the comprehensive feature word occurs Frequency, obtain the mapping table between the frequency that the feature word and the feature word occur；

Mapping table between the frequency that the text training managing module occurs to the feature word and the feature word enters Row is processed, and randomly selects other texts, calculates inverse document frequency, using the inverse document frequency after calculating as input, By the weighted value of feature word described in training classifier calculated, so as to obtain term weighing matrix；

, according to the term weighing matrix, the classification for arranging classification using svm classifier algorithm is interval, will for the classification processing module The word simplified in set of words is classified, and obtains word's kinds vector set, and the word of a classification belongs to described Same vector in word's kinds vector set；

The text category flag module is used to be marked word's kinds vector set, sets up the category label of word Table, in the category label table of the word with special value of symbol as word classification mark value, by the word classification Mark value adds word's kinds vector set to obtain labeled word's kinds vector set；

The effect improves module and carries out error statistics to the labeled word's kinds vector set, and the process of statistics is one Individual random sampling process, first by the labeled word's kinds in the labeled word's kinds vector set to Amount is extracted, and is ranked up according to the distribution law of word, and it is the region that emphasis is extracted to sort in front 30% region, to what is extracted The classifying quality of sample is tested, and is adjusted using nuisance parameter, if the frequency of adjustment is too high, illustrates the effect classified Fruit it is not good enough, come back to the Text character extraction module, the threshold value is modified re-start feature extraction until The frequency of the adjustment is dropped in safe scope.