CN107273500A

CN107273500A - Text classifier generation method, file classification method, device and computer equipment

Info

Publication number: CN107273500A
Application number: CN201710457280.2A
Authority: CN
Inventors: 谢瑜; 张昊; 朱频频
Original assignee: BEIJING SAIXI TECHNOLOGY DEVELOPMENT Co Ltd; Shanghai Zhizhen Intelligent Network Technology Co Ltd; China Electronics Standardization Institute
Current assignee: BEIJING SAIXI TECHNOLOGY DEVELOPMENT Co Ltd; Shanghai Zhizhen Intelligent Network Technology Co Ltd; China Electronics Standardization Institute
Priority date: 2017-06-16
Filing date: 2017-06-16
Publication date: 2017-10-20

Abstract

The present invention discloses a kind of text classifier generation method, file classification method, device and computer equipment, the problem of classifying quality that the sample cross concentrated to solve training sample is brought is poor.The text classifier generation method includes：Training sample is concentrated at least two original sample classifications that there is sample cross merge operation and obtain new samples classification；The training sample set includes multiple original sample classifications, belong in the multiple original sample classification one of each training sample；First grader is obtained according to the training sample set training after union operation；Original sample classification according to belonging to the training sample and the training sample for belonging to the new samples classification that belong to the new samples classification, which is trained, obtains the second grader, second grader is used to carry out subseries again to class categories in the classification results of first grader for the text to be sorted of the new samples classification, to be divided into corresponding original sample classification.

Description

Text classifier generation method, file classification method, device and computer equipment

Technical field

The present invention relates to communication technical field, more particularly to a kind of text classifier generation method, file classification method, Device and computer equipment.

Background technology

In text classification, the quality of training sample largely determines the effect of grader.

For example, when sample cross between class occurs in training sample, then there are two classes or multiple classes of sample cross, necessarily The effect of overall grader can be influenceed, and two class or multi-class classification accuracy are relatively low, and classifying quality is poor.

The content of the invention

The technical problem to be solved in the present invention is to provide a kind of text classifier generation method, file classification method, device And computer equipment, to solve asking for the classifying quality difference that the sample cross that training sample is concentrated in the prior art is brought Topic.

On the one hand, the present invention provides a kind of text classifier generation method, including：Training sample is concentrated and there is sample friendship At least two original sample classifications of fork merge operation and obtain new samples classification；The training sample set includes multiple original Sample class, belong in the multiple original sample classification one of each training sample；According to the training sample after union operation This training is got to the first grader；According to the training sample that belongs to the new samples classification and described belong to the new samples Original sample classification belonging to the training sample of classification, which is trained, obtains the second grader, and second grader is used for institute State class categories in the classification results of the first grader and carry out subseries again for the text to be sorted of the new samples classification, to divide Enter in corresponding original sample classification.

Optionally, the training sample set training according to after union operation, which obtains the first grader, includes：Closed to described And the training sample set after operating uses following at least one classification algorithm training, obtains first grader：Simple pattra leaves The closest KNN sorting algorithms of this NB sorting algorithm, support vector machines sorting algorithm, K and random forest sorting algorithm.

Optionally, according to the training sample and the instruction for belonging to the new samples classification for belonging to the new samples classification Practice the original sample classification belonging to sample and be trained and obtain the second grader and include：According to the instruction for belonging to the new samples classification Original sample classification belonging to white silk sample and the training sample for belonging to the new samples classification is trained, and is respectively obtained Second grader of each new samples classification.

Optionally, to the training sample for belonging to the new samples classification and described the new samples classification is belonged to Original sample classification belonging to training sample uses following at least one classification algorithm training, obtains second grader：Piao Plain Bayes NB sorting algorithms, support vector machines sorting algorithm, the closest KNN sorting algorithms of K and random forest classification are calculated Method；The class categories that second grader includes and the original sample classification belonging to the training sample in the new samples classification It is corresponding.

Optionally, at least two original sample classifications that training sample concentration has sample cross are merged described Obtain before new samples classification, methods described also includes：Training corpus is pre-processed to carry out the training corpus Filter and/or uniform format；Pretreated training corpus is subjected to word segmentation processing according to dictionary for word segmentation and obtains the training sample Collection.

Optionally, the training corpus includes sentence and/or text fragments.

Optionally, methods described also includes：The neologisms that training corpus execution new word discovery is operated and will be seen that are added Enter the dictionary for word segmentation.

Optionally, the new word discovery operation is realized by following at least one mode：Mutual information, co-occurrence probabilities and information Entropy.

Optionally, methods described also includes：Test the classification accuracy of each class categories in first grader；Survey Try the classification accuracy of each class categories in second grader；Wherein, the classification accuracy of first grader point Not Wei P1j, wherein j for more than or equal to 1 and less than or equal to m integer, m be merge operation after training sample concentrate sample This classification number；The classification accuracy of second grader is respectively P1h*P2k, and wherein k is more than or equal to 1 and less than or equal to n Integer, n be the new samples classification in training sample belonging to original sample classification number；P1h is first grader Described in new samples classification classification accuracy, h for more than or equal to 1 and less than or equal to g integer, g be the new samples classification Number；Detect that whether the classification accuracy of each class categories in first grader is both greater than the first probability threshold value, and described Whether the classification accuracy of each class categories is both greater than the second probability threshold value in second grader；If it is, determining described the One grader and second classifier training success.

Optionally, detect whether the classification accuracy of each class categories in first grader is both greater than the first probability Whether the classification accuracy of each class categories is both greater than the second probability threshold value in threshold value, and second grader, afterwards also Including：If not, concentrating at least two original sample classifications that there is sample cross to re-start merging the training sample Operation and classifier training, until the classification accuracy of each class categories is both greater than the first probability threshold in first grader Value, and untill the classification accuracy of each class categories is both greater than the second probability threshold value in second grader.

Optionally, methods described also includes：First probability threshold value and second probability threshold value is adjusted to filter out The first grader and the second grader of different classifications accuracy rate.

Optionally, the classification accuracy of first grader is totally tested using cross validation mode based on sample；Base The classification accuracy of second grader is totally tested using cross validation mode in sample.

Optionally, the cross validation mode overall based on sample includes：Using in the sample totality 60% to 90% sample is tested as training sample set using remaining sample as text to be sorted, and the sample is totally wrapped The multiple original sample classification is included, belong in the multiple original sample classification one of each sample.

Optionally, at least two original sample categories combinations that there will be sample cross include into new samples classification： There will be all original sample categories combinations of sample cross into a new samples classification.

Optionally, it is described at least two original sample classifications progress concentrated training sample and there is sample cross Union operation is obtained before new samples classification, and methods described also includes：Class categories are obtained according to training sample set training For the 3rd grader of the multiple original sample classification, the classification for testing each class categories of the 3rd grader is accurate Rate, preliminary screening goes out the original sample classification that classification accuracy is less than the 3rd threshold value；The original sample gone out in preliminary screening Identify there is the original sample classification of sample cross in classification.

On the other hand, the present invention also provides a kind of file classification method, and the text classifier provided using the present invention is generated The grader of method generation is classified, and the sorting technique includes：Text set to be sorted is inputted into first grader, obtained To the first classification results；By class categories in first classification results for the new samples classification text input to be sorted with Corresponding second grader of the new samples classification, obtains the second classification results.

Optionally, methods described also includes：It is the original without merging by class categories in first classification results The classification results of beginning sample class and final classification result of second classification results collectively as the text to be sorted.

On the other hand, the present invention also provides a kind of text classifier generating means, including：Combining unit, for that will train At least two original sample classifications that there is sample cross in sample set merge operation and obtain new samples classification；The training Sample set includes multiple original sample classifications, belong in the multiple original sample classification one of each training sample；First Training unit, for obtaining the first grader according to the training sample set training after union operation；Second training unit, for root According to original belonging to the training sample and the training sample for belonging to the new samples classification for belonging to the new samples classification Sample class, which is trained, obtains the second grader, and second grader is used in the classification results of first grader Class categories carry out subseries again for the text to be sorted of the new samples classification, to be divided into corresponding original sample classification.

Optionally, first training unit, specifically for：To the training sample set after the union operation using following At least one classification algorithm training, obtains first grader：Naive Bayesian NB sorting algorithms, support vector machines point The closest KNN sorting algorithms of class algorithm, K and random forest sorting algorithm.

Optionally, second training unit, specifically for：According to the training sample for belonging to the new samples classification and Original sample classification belonging to the training sample for belonging to the new samples classification is trained, and is respectively obtained each described new Second grader of sample class.

Optionally, second taxon obtains described specifically for using following at least one classification algorithm training Second grader：The closest KNN sorting algorithms of naive Bayesian NB sorting algorithms, support vector machines sorting algorithm, K and with Machine forest classified algorithm；Belonging to training sample in class categories and the new samples classification that second grader includes Original sample classification is corresponding.

Optionally, described device also includes：Pretreatment unit, for there is sample cross in described concentrate training sample At least two original sample classifications merge before obtaining new samples classification, training corpus is pre-processed with to described Training corpus is filtered and/or uniform format；Participle unit, for pretreated training corpus to be entered according to dictionary for word segmentation Row word segmentation processing obtains the training sample set.

Optionally, the training corpus includes sentence or text fragments.

Optionally, described device also includes new word discovery unit, for performing new word discovery operation to the training corpus And the neologisms that will be seen that add the dictionary for word segmentation.

Optionally, described device also includes：First test cell, for testing each classification class in first grader Other classification accuracy；Second test cell, the classification accuracy for testing each class categories in second grader； Wherein, the classification accuracy of first grader is respectively P1j, and wherein j is the integer more than or equal to 1 and less than or equal to m, m To merge the sample class number that the training sample after operation is concentrated；The classification accuracy of second grader is respectively P1h*P2k, wherein k are the integer more than or equal to 1 and less than or equal to n, and n is belonging to the training sample in the new samples classification Original sample classification number；P1h be first grader described in new samples classification classification accuracy, h be more than or equal to 1 and Integer less than or equal to g, g is the new samples classification number；Detection unit, each classifies for detecting in first grader The classification whether classification accuracy of classification is both greater than each class categories in the first probability threshold value, and second grader is accurate Whether true rate is both greater than the second probability threshold value；Determining unit, if the testing result for the detection unit is yes, determines institute State the first grader and second classifier training success.

Optionally, described device also includes：Returning unit, will if the testing result for the detection unit is no The training sample concentrates at least two original sample classifications that there is sample cross to re-start union operation and grader instruction Practice, until the classification accuracy of each class categories is both greater than the first probability threshold value, and described second in first grader Untill the classification accuracy of each class categories is both greater than the second probability threshold value in grader.

Optionally, described device also includes：Adjustment unit, for adjusting first probability threshold value and second probability Threshold value is to filter out the first grader and the second grader of different classifications accuracy rate.

Optionally, first test cell, specifically for totally testing described using cross validation mode based on sample The classification accuracy of first grader；Second test cell, specifically for totally using cross validation mode based on sample Test the classification accuracy of second grader.

Optionally, the combining unit, specifically for there will be all original sample categories combinations of sample cross into one Individual new samples classification.

Optionally, described device also includes screening unit, for there is sample cross in described concentrate training sample At least two original sample classifications are merged before operation obtains new samples classification, are obtained according to training sample set training Class categories are the 3rd grader of the multiple original sample classification, test point of each class categories of the 3rd grader Class accuracy rate, preliminary screening goes out the original sample classification that classification accuracy is less than the 3rd threshold value；The original gone out in preliminary screening Identify there is the original sample classification of sample cross in beginning sample class.

On the other hand, the present invention also provides a kind of document sorting apparatus, any text classification provided using the present invention The grader of device generating means generation is classified, and the sorter includes：First input block, for by text to be sorted Collection input first grader, obtains the first classification results；Second input block, for by first classification results points Class classification is text input to be sorted the second grader corresponding with the new samples classification of the new samples classification, obtains the Two classification results.

Optionally, described device also includes：As a result output unit, for being by class categories in first classification results The classification results of the original sample classification without merging are with second classification results collectively as the text to be sorted Final classification result.

On the other hand, the present invention also provides a kind of computer equipment, including processor and memory；Memory is used to store Computer instruction, processor is used to run the computer instruction of the memory storage, any with perform that the present invention provides Text classifier generation method.

On the other hand, the present invention also provides a kind of computer equipment, including processor and memory；Memory is used to store Computer instruction, processor is used to run the computer instruction of the memory storage, any with perform that the present invention provides File classification method.

On the other hand, the present invention also provides the instruction that is stored with a kind of computer-readable recording medium, the storage medium, Any text classifier generation method that the present invention is provided is performed during the instruction operation.

On the other hand, the present invention also provides the instruction that is stored with a kind of computer-readable recording medium, the storage medium, Any file classification method that the present invention is provided is performed during the instruction operation.

Text classifier generation method, file classification method, device and computer equipment that embodiments of the invention are provided, By the training of the first grader, can there will be the original sample classification of sample cross exactly with sample cross is not present Original sample class discrimination is opened, and by the training of the second grader, the original sample classification that can there will be sample cross is independent Separate, finer classification based training is carried out in the range of more specifically, so as to substantially increase the classification of text classifier Accuracy rate.

Brief description of the drawings

Fig. 1 is a kind of flow chart of text classifier generation method provided in an embodiment of the present invention；

Fig. 2 is a kind of detail flowchart of text classifier generation method provided in an embodiment of the present invention；

Fig. 3 is a kind of flow chart of file classification method provided in an embodiment of the present invention；

Fig. 4 is a kind of structural representation of text classifier generating means provided in an embodiment of the present invention；

Fig. 5 is a kind of structural representation of document sorting apparatus provided in an embodiment of the present invention.

Embodiment

Below in conjunction with accompanying drawing, the present invention is described in detail.It should be appreciated that specific embodiment described herein is only To explain the present invention, the present invention is not limited.

As shown in figure 1, the embodiment of the present invention provides a kind of text classifier generation method, including：

S11, training sample is concentrated at least two original sample classifications that there is sample cross merge operation and obtained New samples classification；The training sample set includes multiple original sample classifications, and each training sample belongs to the multiple original sample One in this classification；

S12, the first grader is obtained according to the training sample set training after union operation；

S13, according to the training sample and the training for belonging to the new samples classification for belonging to the new samples classification Original sample classification belonging to sample, which is trained, obtains the second grader, and second grader is used to classify to described first Class categories carry out subseries again for the text to be sorted of the new samples classification in the classification results of device, to be divided into corresponding original In beginning sample class.

The text classifier generation method that embodiments of the invention are provided, when according to training sample training text grader Using with different levels grader generating mode, training sample is first concentrated to the original sample of two or more that there is sample cross This categories combination obtains the first grader into new samples classification according to the training sample set training after union operation, then to new sample This classification carries out finer classification based training and obtains the second grader.So, can be exactly by the training of the first grader The original sample classification and the original sample class discrimination in the absence of sample cross that there will be sample cross are opened, and pass through the second classification The training of device, the original sample classification that can there will be sample cross is separately separated out, carries out more in the range of more specifically Careful classification based training, so as to substantially increase the classification accuracy of text classifier.

Specifically, the sample cross described in embodiments of the invention refers to the training sample concentration in offer, sample number According to affiliated classification it is very not clear accurate, be for example but in B classes in the presence of the sample data that should belong to A classes originally Situation, then it is assumed that there is sample cross between A classes and B classes.Sample cross is also known as that class is overlapping or data set is overlapping.Due to Text classifier is trained by using these training sample sets, and this sample cross situation of training sample set is inevitable The classification accuracy that it can be influenceed to train the text classifier come.Text classifier generation method provided in an embodiment of the present invention Situation about between this sample class intersecting can be directed to be efficiently modified.It is specifically described below.

In step s 11, training sample, which is concentrated, includes multiple original sample classifications, and these original sample classifications are that correspond to The desired target classification of user, belong in multiple original sample classifications one of each training sample.The one of the present invention In individual embodiment, training sample, which is concentrated, includes tetra- original sample classifications of A, B, C, D, wherein original sample classification A and original sample There is sample cross between this classification C, then A and C can be merged to operation generation new samples classification G, the training after merging Sample set includes original sample classification B, D and new samples classification G.

Accordingly, in step s 12, the first grader is obtained namely according to the training sample set training after union operation The all elements that training sample is concentrated are respectively divided in sample class B, D, G.Can be trained to by such classification One grader.

Certainly, the original sample classification that there is sample cross can be with many more than two, the new samples classification occurred after merging Classification accuracy as long as being conducive to correcting sample cross, can be improved, embodiments of the invention are not made to this with more than one Limit.

Handed over for example, there is sample between class in above-described embodiment or between tetra- original sample classifications of A, B, C, D , or there is sample cross in fork, accordingly between C, D while there is sample cross between A, B, merging operation both can be with It is into a new samples classification G1 by tetra- original sample categories combinations of A, B, C, D, i.e. there will be all original of sample cross Sample class is merged into a new samples classification, and A, B can also be merged into a new samples classification G2, C, D are merged into one Individual new samples classification G3, i.e. there will be the original sample categories combination of sample cross into multiple new samples classifications.

Optionally, the training sample set after union operation can be obtained using one or more of classification algorithm training First grader：The closest KNN sorting algorithms of naive Bayesian NB sorting algorithms, support vector machines sorting algorithm, K With random forest sorting algorithm etc..

It should be noted that the above-mentioned original sample classification that there is sample cross can be known or need Identification., can be in the following way when the sample size of training sample set and the huge data volume of original sample classification Identify there is the original sample classification of sample cross：

Operated in described merge at least two original sample classifications that training sample concentration has sample cross To before new samples classification, it is the of the multiple original sample classification to be trained according to the training sample set and obtain class categories Three graders, test the classification accuracy of each class categories of the 3rd grader, and preliminary screening goes out classification accuracy and is less than The original sample classification of 3rd threshold value；

The original sample classification that there is sample cross is identified in the original sample classification that preliminary screening goes out.

The principle of above-mentioned recognition methods is, if training sample set has sample cross between class, will certainly influence training The classification accuracy of the grader gone out, therefore the sample cross situation of training sample set can be carried out just by classification accuracy Step screening, then identifies the original sample classification that there is sample cross from the original sample classification filtered out again.

Wherein it is possible to using to less than the 3rd threshold value original sample classification carry out artificial nucleus couple or machine data matching There is the original sample classification of sample cross to identify in mode.

For example, the setting of the 3rd threshold value is higher, then the detection to sample cross is sensitiveer.The original sample of training sample set There is a situation where that the sample data that should belong to A classes originally has but been in B classes in classification, and original should not in A classes Belong to the sample of other classes, then now the classification accuracy of the B classes of the 3rd grader can be less than the 3rd threshold value, although point of A classes Class accuracy rate is not affected, but is matched according to artificial nucleus couple or machine data, can show that training sample concentrates A classes and B classes There is sample cross.But it is in the presence of the sample data that should belong to A classes originally in the original sample classification of training sample set In B classes, and originally belong to the sample data in B classes and be but in A classes, then now the classification of the B classes of the 3rd grader is accurate The classification accuracy of true rate and A classes can all be less than the 3rd threshold value, can determine A classes and B by artificial nucleus couple or machine data matching In class whether also with other class sample cross.

The classification accuracy of each class categories of the 3rd grader is tested, is specifically as follows：Totally adopted based on sample The classification accuracy of the 3rd grader is tested with cross validation mode.Train after the first grader, be in step s 13 It can train to form the second grader.Specifically, can be according to the training sample and the category for belonging to the new samples classification The original sample classification belonging to training sample in the new samples classification is trained, and respectively obtains each new samples class Other second grader.

Still by taking above-described embodiment as an example, if there is sample cross, A between original sample classification A and original sample classification C Merged with C and new samples classification G is generated after operation, classification based training is carried out with the training sample set after merging, by training sample The each element of concentration is divided into sample class B, D or G, obtains the first grader.Obtain after the first grader, use new samples classification Classification based training is carried out in G, element in G is subdivided into A classes and C classes, wherein A and C are the training sample for belonging to new samples classification G Affiliated original sample classification.

So, by the training of the second grader, the original sample classification that can there will be sample cross is separately separated out Come, finer classification based training is carried out in the range of more specifically, so as to substantially increase the classification accuracy of text classifier.

Optionally, one or more of can be included by obtaining the sorting algorithm of the second grader：NB points of naive Bayesian The closest KNN sorting algorithms of class algorithm, support vector machines sorting algorithm, K and random forest sorting algorithm；Wherein, second point The class categories that class device includes are corresponding with the original sample classification belonging to the training sample in the new samples classification.

Further, the training sample set in above-described embodiment can be obtained by training corpus by certain processing.For , in one embodiment of the invention, there is sample cross in described concentrate training sample in acquisition above-mentioned training sample set At least two original sample classifications merge before obtaining new samples classification, text classifier provided in an embodiment of the present invention Generation method may also include：

Training corpus is pre-processed to filter and/or uniform format the training corpus；

Pretreated training corpus is subjected to word segmentation processing according to dictionary for word segmentation and obtains the training sample set.

Specifically, the training corpus of collection includes sentence and/or text fragments, and concrete form can be voice, text Word, image etc. are a variety of, and it is text formatting to first have to that by pretreatment training corpus uniform format will be obtained, and filters out invalid lattice Formula, is preserved stand-by.Then, pretreated training corpus is carried out word segmentation processing to obtain training sample according to dictionary for word segmentation Collection.

Further, dictionary for word segmentation can be expanded, for example, new word discovery operation can be performed to the training corpus simultaneously The neologisms that will be seen that add the dictionary for word segmentation, so, new word can be obtained using new word discovery method, according to acquisition New word can update dictionary for word segmentation, then when carrying out word segmentation processing, can be divided according to the dictionary for word segmentation after renewal Word, so as to make dictionary for word segmentation carry out constantly improve, effectively improves the accuracy rate of word segmentation processing.

Optionally, the new word discovery operation can be realized by following one or more of modes：Mutual information, co-occurrence probabilities And comentropy.

In order to determine the first grader of generation and the classifying quality of the second grader, further, the embodiment of the present invention The text classifier generation method of offer can also include：

Test the classification accuracy of each class categories in first grader；

Test the classification accuracy of each class categories in second grader；

Wherein, the classification accuracy of first grader is respectively P1j, and wherein j is more than or equal to 1 and less than or equal to m Integer, m be merge operation after training sample concentrate sample class number；

The classification accuracy of second grader is respectively P1h*P2k, and wherein k is more than or equal to 1 and less than or equal to n Integer, n is the original sample classification number belonging to the training sample in the new samples classification；P1h is in first grader The classification accuracy of the new samples classification, h is the integer more than or equal to 1 and less than or equal to g, and g is the new samples classification number；

Detect whether the classification accuracy of each class categories in first grader is both greater than the first probability threshold value, and Whether the classification accuracy of each class categories is both greater than the second probability threshold value in second grader；

If it is, determining first grader and second classifier training success.

For example, if the first probability threshold value is 0.98, the second probability threshold value is 0.95, during test, if first point In the classification results of class device, the classification accuracy of each class categories is both greater than in the classification results of the 0.98, and second grader, The classification accuracy of each class categories is both greater than 0.95, then explanation is generated by text classifier provided in an embodiment of the present invention The classification accuracy for the text classifier that method is generated has reached the requirement of user.

Optionally, can be using each class categories attribute phase with training sample set when carrying out classification accuracy detection With or similar data tested, these data are labeled with the class categories of correlation.Wherein, with point of training sample set The same or analogous data of class category attribute can be gone out by algorithm construction, can also be obtained according to cross validation mode.

Specifically, the classification standard for totally testing first grader using cross validation mode based on sample can be used True rate；The classification accuracy of second grader is totally tested using cross validation mode based on sample.

Wherein, sample totally refers to the whole sample datas related to this classification task.Based on the overall intersection of sample Verification mode can be that another part can be adopted as test sample collection as training sample set using the overall part of sample With the sample of (such as 80%) 60% to 90% in the sample totality as training sample set, using remaining sample as treating Classifying text is tested, and the sample generally includes the multiple original sample classification, and each sample belongs to described many One in individual original sample classification.

Optionally, in first grader is detected the classification accuracy of each class categories whether to be both greater than first general Whether the classification accuracy of each class categories is both greater than after the second probability threshold value in rate threshold value, and second grader, The text classifier generation method that embodiments of the invention are provided may also include：Exist if not, the training sample is concentrated At least two original sample classifications of sample cross re-start union operation and classifier training, until first grader In the classification accuracies of each class categories be both greater than each class categories in the first probability threshold value, and second grader Untill classification accuracy is both greater than the second probability threshold value.

That is, after according to the first grader and the effect of the second grader, if it find that any one class categories Classification accuracy is less than corresponding probability threshold value, all illustrates that the first grader and the second grader not yet reach the requirement of user, Therefore need again to merge the original sample classification that there is sample cross operation and classifier training, until rate of accuracy reached Untill above-mentioned threshold requirement.

Optionally, in order to which the classification accuracy met to grader has different requirements, in one embodiment of the present of invention In, first probability threshold value and second probability threshold value can also be adjusted to filter out first point of different classifications accuracy rate Class device and the second grader.

Text classifier generation method provided in an embodiment of the present invention is described in detail below by specific embodiment.

As shown in Fig. 2 the text classifier generation method that the present embodiment is provided specifically may include following steps：

S201, pretreatment：It is text formatting that training corpus uniform format, which will be obtained, filters invalid form, is preserved stand-by；

S202, new word discovery：The neologisms candidate word of training corpus is found out using existing new word discovery instrument, through artificial mistake Dictionary for word segmentation is added after filter；

S203, participle is carried out to pretreated training corpus according to dictionary for word segmentation；

S204, screening sample：Realize that it is classified to original sample category construction grader, and based on sample totally using friendship Pitch checking mode testing classification device accuracy rate P0 (accuracy rate of each class be P01, P02 ..., P0i ...).According to The result (the classification accuracy rate that there is sample cross is all relatively low) of classification selected (artificial selected or simple matching way) is present The classification collection of sample cross (cross one another classification is a classification collection, accordingly, it is possible to there is one or more classification collection)；

S205, sample restructuring：Two classes or multiple classes that there will be sample cross are merged, and other classes keep constant；

S206, training the first grader of generation：Sort operation is carried out to merging the training sample set after operation, equally Based on sample totally using cross validation mode testing classification accuracy rate P1 (accuracy rate of each class be P11, P12、……、P1j、……)。

S207, training the second grader of generation：New samples category construction grader (one or more) to merging generation, And based on sample totally using cross validation mode test its classify accuracy rate P2 (accuracy rate of each class be P21, P22、……、P2k、……)。

S208, accuracy rate test：Detect whether the classification accuracy of each class categories in first grader is big Whether the classification accuracy of each class categories is both greater than the second probability threshold in the first probability threshold value, and second grader Value；

If it is, determining first grader and second classifier training success；

If not, concentrating at least two original sample classifications that there is sample cross to re-start conjunction the training sample And operate and classifier training, until the classification accuracy of each class categories is both greater than the first probability in first grader Untill the classification accuracy of each class categories is both greater than the second probability threshold value in threshold value, and second grader.

Accordingly, as shown in figure 3, embodiments of the invention also provide a kind of file classification method, text sorting technique The grader of any text classifier generation method generation provided using previous embodiment is classified, the sorting technique Including：

S31, inputs first grader by text set to be sorted, obtains the first classification results；

S32, text input to be sorted and institute by class categories in first classification results for the new samples classification Corresponding second grader of new samples classification is stated, the second classification results are obtained.

The file classification method that embodiments of the invention are provided, applies any text classification of previous embodiment offer The text classifier of device generation method generation.So, by the training of the first grader, can there will be sample cross exactly Original sample classification with being opened in the absence of the original sample class discrimination of sample cross, can be with by the training of the second grader The original sample classification that there will be sample cross is separately separated out, and finer classification instruction is carried out in the range of more specifically Practice, so as to substantially increase the accuracy rate of text classification.

Optionally, file classification method provided in an embodiment of the present invention may also include：By in first classification results points Class classification is treated for the classification results of the original sample classification without merging with second classification results collectively as described The final classification result of classifying text.

The file classification method that embodiments of the invention are provided, applies any text classification of previous embodiment offer Detailed description has been carried out in the text classifier of device generation method generation, specific assorting process and principle above, herein Repeat no more.

Accordingly, as shown in figure 4, embodiments of the invention also provide a kind of text classifier generating means, including：

Combining unit 41, for concentrating at least two original sample classifications that there is sample cross to be closed training sample And operation obtains new samples classification；The training sample set includes multiple original sample classifications, and each training sample belongs to described One in multiple original sample classifications；

First training unit 42, for obtaining the first grader according to the training sample set training after union operation；

Second training unit 43, for according to the training sample that belongs to the new samples classification and described belonging to described new Original sample classification belonging to the training sample of sample class, which is trained, obtains the second grader, and second grader is used for Subseries again is carried out for the text to be sorted of the new samples classification to class categories in the classification results of first grader, To be divided into corresponding original sample classification.

The text classifier generating means that embodiments of the invention are provided, when according to training sample training text grader Using with different levels grader generating mode, training sample is first concentrated to the original sample of two or more that there is sample cross This categories combination obtains the first grader into new samples classification according to the training sample set training after union operation, then to new sample This classification carries out finer classification based training and obtains the second grader.So, can be exactly by the training of the first grader The original sample classification and the original sample class discrimination in the absence of sample cross that there will be sample cross are opened, and pass through the second classification The training of device, the original sample classification that can there will be sample cross is separately separated out, carries out more in the range of more specifically Careful classification based training, so as to substantially increase the classification accuracy of text classifier.

Specifically, the sample cross described in embodiments of the invention refers to the training sample concentration in offer, sample number According to affiliated classification it is very not clear accurate, be for example but in B classes in the presence of the sample data that should belong to A classes originally Situation, then it is assumed that there is sample cross between A classes and B classes.Because text classifier is by using these training sample sets Train and come, this sample cross situation of training sample set will necessarily influence the classification that it trains the text classifier come Accuracy.Text classifier generating means provided in an embodiment of the present invention can be directed to situation about intersecting between this sample class and carry out It is efficiently modified.It is specifically described below.

Optionally, the combining unit, can specifically for there will be all original sample categories combinations of sample cross into One new samples classification.

Optionally, when combining unit 41 merges operation, training sample, which is concentrated, includes multiple original sample classifications, this A little original sample classifications are that correspond to the desired target classification of user, and each training sample belongs to multiple original sample classifications In one.In one embodiment of the invention, training sample, which is concentrated, includes tetra- original sample classifications of A, B, C, D, wherein There is sample cross between original sample classification A and original sample classification C, then A and C can be merged into operation generates new sample This classification G, the training sample after merging, which is concentrated, includes original sample classification B, D and new samples classification G.

Accordingly, the first training unit 42 obtains the first grader also just according to the training sample set training after union operation It is that all elements for concentrating training sample are respectively divided in sample class B, D, G.It can be trained to by such classification First grader.

Optionally, the first training unit 42, is particularly used in：

Following at least one classification algorithm training is used to the training sample set after the union operation, described first is obtained Grader：Closest KNN sorting algorithms of naive Bayesian NB sorting algorithms, support vector machines sorting algorithm, K and random gloomy Woods sorting algorithm.

It should be noted that the above-mentioned original sample classification that there is sample cross can be known or need Identification.When the sample size of training sample set and the huge data volume of original sample classification, the text classifier life Screening unit is may also include into device, is used for：

The classification accuracy of each class categories of the 3rd grader is tested, is specifically as follows：Totally adopted based on sample The classification accuracy of the 3rd grader is tested with cross validation mode.

Train after the first grader, the second training unit 43 can train to form the second grader.Specifically, can be with Original according to belonging to the training sample and the training sample for belonging to the new samples classification that belong to the new samples classification Beginning sample class is trained, and respectively obtains the second grader of each new samples classification.

Optionally, the second taxon 43 is particularly used in using following at least one classification algorithm training, obtains described Second grader：The closest KNN sorting algorithms of naive Bayesian NB sorting algorithms, support vector machines sorting algorithm, K and with Machine forest classified algorithm；Belonging to training sample in class categories and the new samples classification that second grader includes Original sample classification is corresponding.

Further, the training sample set in above-described embodiment can be obtained by training corpus by certain processing.For The above-mentioned training sample set of acquisition, text classifier generating means provided in an embodiment of the present invention may also include：

Pretreatment unit, for training sample to be concentrated at least two original sample classifications that there is sample cross described Merge before obtaining new samples classification, training corpus is pre-processed to be filtered to the training corpus and/or Uniform format；

Participle unit, the training is obtained for pretreated training corpus to be carried out into word segmentation processing according to dictionary for word segmentation Sample set.

Optionally, described device may also include new word discovery unit, for performing new word discovery behaviour to the training corpus The neologisms made and will be seen that add the dictionary for word segmentation.So, new word can be obtained using new word discovery method, according to obtaining The new word taken can update dictionary for word segmentation, then when carrying out word segmentation processing, can be entered according to the dictionary for word segmentation after renewal Row participle, so as to make dictionary for word segmentation carry out constantly improve, effectively improves the accuracy rate of word segmentation processing.

In order to determine the first grader of generation and the classifying quality of the second grader, further, described device may be used also Including：

First test cell, the classification accuracy for testing each class categories in first grader；

Second test cell, the classification accuracy for testing each class categories in second grader；

Detection unit, for detect each class categories in first grader classification accuracy whether both greater than the Whether the classification accuracy of each class categories is both greater than the second probability threshold value in one probability threshold value, and second grader；

Determining unit, if the testing result for the detection unit is yes, determines first grader and described The success of second classifier training.

Optionally, first test cell, is particularly used in and totally tests institute using cross validation mode based on sample State the classification accuracy of the first grader；Second test cell, specifically for totally using cross validation side based on sample Formula tests the classification accuracy of second grader.

Optionally, described device may also include：Returning unit, if the testing result for the detection unit is no, At least two original sample classifications that there is sample cross are concentrated to re-start union operation and grader the training sample Training, until the classification accuracy of each class categories is both greater than the first probability threshold value in first grader, and described the Untill the classification accuracy of each class categories is both greater than the second probability threshold value in two graders.

It is in one embodiment of the invention, optional in order to which the classification accuracy met to grader has different requirements , described device may also include：Adjustment unit, for adjusting first probability threshold value and second probability threshold value to screen Go out the first grader and the second grader of different classifications accuracy rate.

Accordingly, as shown in figure 5, embodiments of the invention also provide a kind of document sorting apparatus, previous embodiment is utilized The grader of any text classifier generating means generation provided is classified, and the sorter includes：

First input block 51, for text set to be sorted to be inputted into first grader, obtains the first classification results；

Second input block 52, for the treating point for the new samples classification by class categories in first classification results Class text inputs the second grader corresponding with the new samples classification, obtains the second classification results.

The document sorting apparatus that embodiments of the invention are provided, applies any text classification of previous embodiment offer The text classifier of device generating means generation.So, by the training of the first grader, can there will be sample cross exactly Original sample classification with being opened in the absence of the original sample class discrimination of sample cross, can be with by the training of the second grader The original sample classification that there will be sample cross is separately separated out, and finer classification instruction is carried out in the range of more specifically Practice, so as to substantially increase the accuracy rate of text classification.

Further, the document sorting apparatus, in addition to：

As a result output unit, for being the original sample without merging by class categories in first classification results The classification results of classification and final classification result of second classification results collectively as the text to be sorted.

Accordingly, embodiments of the invention also provide a kind of computer equipment, including processor and memory；Memory is used In storage computer instruction, processor is used for the computer instruction for running the memory storage, is carried with performing previous embodiment Any text classifier generation method supplied, therefore corresponding technique effect can be also realized, have been carried out above specifically Bright, here is omitted.

Accordingly, embodiments of the invention also provide a kind of computer equipment, including processor and memory；Memory is used In storage computer instruction, processor is used for the computer instruction for running the memory storage, to perform foregoing implement Any file classification method that example is provided, therefore corresponding technique effect can be also realized, have been carried out describing in detail above, Here is omitted.

Accordingly, embodiments of the invention are also provided in a kind of computer-readable recording medium, the storage medium and stored There is instruction, any text classifier generation method that previous embodiment is provided is performed during the instruction operation, therefore also can be real Now corresponding technique effect, has been carried out describing in detail above, here is omitted.

Accordingly, embodiments of the invention are also provided in a kind of computer-readable recording medium, the storage medium and stored There is instruction, any file classification method that previous embodiment is provided is performed during the instruction operation, therefore can also realize corresponding Technique effect, have been carried out above describe in detail, here is omitted.

It should be noted that herein, term " comprising ", "comprising" or its any other variant are intended to non-row His property is included, so that process, method, article or device including a series of key elements not only include those key elements, and And also including other key elements being not expressly set out, or also include for this process, method, article or device institute inherently Key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that including this Also there is other identical element in process, method, article or the device of key element.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can add the mode of required general hardware platform to realize by software, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Understood based on such, technical scheme is substantially done to prior art in other words Going out the part of contribution can be embodied in the form of software product, and the computer software product is stored in a storage medium In (such as ROM/RAM, magnetic disc, CD), including some instructions are to cause a station terminal equipment (can be mobile phone, computer, clothes It is engaged in device, air conditioner, or network equipment etc.) perform method described in each embodiment of the invention.

The preferred embodiments of the present invention are these are only, are not intended to limit the scope of the invention, it is every to utilize this hair Equivalent structure or equivalent flow conversion that bright specification and accompanying drawing content are made, or directly or indirectly it is used in other related skills Art field, is included within the scope of the present invention.

Claims

1. a kind of text classifier generation method, it is characterised in that including：

Training sample is concentrated at least two original sample classifications that there is sample cross merge operation and obtain new samples class Not；The training sample set includes multiple original sample classifications, and each training sample belongs in the multiple original sample classification One；

First grader is obtained according to the training sample set training after union operation；

According to belonging to the training sample and the training sample for belonging to the new samples classification that belong to the new samples classification Original sample classification be trained and obtain the second grader, second grader is used for the classification to first grader As a result middle class categories carry out subseries again for the text to be sorted of the new samples classification, to be divided into corresponding original sample class Not in.

2. according to the method described in claim 1, it is characterised in that the training sample set according to after union operation is trained Include to the first grader：

Following at least one classification algorithm training is used to the training sample set after the union operation, first classification is obtained Device：The closest KNN sorting algorithms of naive Bayesian NB sorting algorithms, support vector machines sorting algorithm, K and random forest point Class algorithm.

3. according to the method described in claim 1, it is characterised in that according to the training sample for belonging to the new samples classification and Original sample classification belonging to the training sample for belonging to the new samples classification, which is trained, to be obtained the second grader and includes：

According to belonging to the training sample and the training sample for belonging to the new samples classification that belong to the new samples classification Original sample classification be trained, respectively obtain the second grader of each new samples classification.

4. according to the method described in claim 1, it is characterised in that

To the training sample for belonging to the new samples classification and the training sample institute for belonging to the new samples classification The original sample classification of category uses following at least one classification algorithm training, obtains second grader：Naive Bayesian NB The closest KNN sorting algorithms of sorting algorithm, support vector machines sorting algorithm, K and random forest sorting algorithm；Described second The class categories that grader includes are corresponding with the original sample classification belonging to the training sample in the new samples classification.

5. according to the method described in claim 1, it is characterised in that there is sample cross extremely in described concentrate training sample Few two original sample classifications are merged before obtaining new samples classification, and methods described also includes：

6. method according to claim 5, it is characterised in that the training corpus includes sentence and/or text fragments.

7. method according to claim 5, it is characterised in that also include：New word discovery behaviour is performed to the training corpus The neologisms made and will be seen that add the dictionary for word segmentation.

8. method according to claim 7, it is characterised in that the new word discovery operation passes through following at least one mode Realize：Mutual information, co-occurrence probabilities and comentropy.

9. according to the method described in claim 1, it is characterised in that also include：

Test the classification accuracy of each class categories in first grader；

Test the classification accuracy of each class categories in second grader；

Wherein, the classification accuracy of first grader is respectively P1j, and wherein j is whole more than or equal to 1 and less than or equal to m Number, m is to merge the sample class number that the training sample after operation is concentrated；

The classification accuracy of second grader is respectively P1h*P2k, and wherein k is whole more than or equal to 1 and less than or equal to n Number, n is the original sample classification number belonging to the training sample in the new samples classification；P1h is institute in first grader The classification accuracy of new samples classification is stated, h is the integer more than or equal to 1 and less than or equal to g, and g is the new samples classification number；

Detect that whether the classification accuracy of each class categories in first grader is both greater than the first probability threshold value, and described Whether the classification accuracy of each class categories is both greater than the second probability threshold value in second grader；

If it is, determining first grader and second classifier training success.

10. method according to claim 9, it is characterised in that each class categories in detection first grader The classification accuracy whether classification accuracy is both greater than each class categories in the first probability threshold value, and second grader is No both greater than the second probability threshold value, also includes afterwards：

If not, concentrating at least two original sample classifications that there is sample cross to re-start merging behaviour the training sample Make and classifier training, until the classification accuracy of each class categories is both greater than the first probability threshold in first grader Value, and untill the classification accuracy of each class categories is both greater than the second probability threshold value in second grader.

11. method according to claim 10, it is characterised in that also include：

First probability threshold value and second probability threshold value is adjusted to filter out the first grader of different classifications accuracy rate With the second grader.

12. method according to claim 9, it is characterised in that

The classification accuracy of first grader is totally tested using cross validation mode based on sample；

The classification accuracy of second grader is totally tested using cross validation mode based on sample.

13. method according to claim 12, it is characterised in that the cross validation mode bag overall based on sample Include：Using in the sample totality 60% to 90% sample as training sample set, using remaining sample as to be sorted Text is tested, and the sample generally includes the multiple original sample classification, and each sample belongs to the multiple original One in beginning sample class.

14. according to the method described in claim 1, it is characterised in that at least two original samples that there will be sample cross This categories combination includes into new samples classification：

There will be all original sample categories combinations of sample cross into a new samples classification.

15. the method according to any one of claim 1 to 14, it is characterised in that it is described described by training sample set The middle at least two original sample classifications that there is sample cross are merged before operation obtains new samples classification, and methods described is also Including：

The 3rd grader for obtaining that class categories are the multiple original sample classification, test are trained according to the training sample set The classification accuracy of each class categories of 3rd grader, it is original less than the 3rd threshold value that preliminary screening goes out classification accuracy Sample class；

16. a kind of file classification method, it is characterised in that utilize the text classifier any one of claim 1 to 15 The grader of generation method generation is classified, and the sorting technique includes：

Text set to be sorted is inputted into first grader, the first classification results are obtained；

By to be sorted text input and the new samples of the class categories in first classification results for the new samples classification Corresponding second grader of classification, obtains the second classification results.

17. method according to claim 16, it is characterised in that also include：

By class categories in first classification results for the original sample classification without merging classification results with it is described Final classification result of second classification results collectively as the text to be sorted.

18. a kind of text classifier generating means, it is characterised in that including：

Combining unit, for concentrating at least two original sample classifications that there is sample cross to merge operation training sample Obtain new samples classification；The training sample set includes multiple original sample classifications, and each training sample belongs to the multiple original One in beginning sample class；

First training unit, for obtaining the first grader according to the training sample set training after union operation；

Second training unit, for according to the training sample for belonging to the new samples classification and described belonging to the new samples class Original sample classification belonging to other training sample, which is trained, obtains the second grader, and second grader is used for described Class categories carry out subseries again for the text to be sorted of the new samples classification in the classification results of first grader, to be divided into In corresponding original sample classification.

19. device according to claim 18, it is characterised in that first training unit, specifically for：

20. device according to claim 18, it is characterised in that second training unit, specifically for：

21. device according to claim 18, it is characterised in that

Second taxon obtains second grader specifically for using following at least one classification algorithm training： The closest KNN sorting algorithms of naive Bayesian NB sorting algorithms, support vector machines sorting algorithm, K and random forest classification are calculated Method；The class categories that second grader includes and the original sample classification belonging to the training sample in the new samples classification It is corresponding.

22. device according to claim 18, it is characterised in that also include：

Pretreatment unit, for training sample to be concentrated at least two original sample classifications progress that there is sample cross described Merging is obtained before new samples classification, and training corpus is pre-processed to filter and/or form the training corpus It is unified；

Participle unit, the training sample is obtained for pretreated training corpus to be carried out into word segmentation processing according to dictionary for word segmentation Collection.

23. device according to claim 22, it is characterised in that the training corpus includes sentence or text fragments.

24. device according to claim 22, it is characterised in that also including new word discovery unit, for the training Language material performs the neologisms addition dictionary for word segmentation that new word discovery is operated and will be seen that.

25. device according to claim 24, it is characterised in that the new word discovery operation passes through following at least one side Formula is realized：Mutual information, co-occurrence probabilities and comentropy.

26. device according to claim 18, it is characterised in that also include：

Detection unit, for detecting it is general whether the classification accuracy of each class categories in first grader is both greater than first Whether the classification accuracy of each class categories is both greater than the second probability threshold value in rate threshold value, and second grader；

Determining unit, if the testing result for the detection unit is yes, determines first grader and described second Classifier training success.

27. device according to claim 26, it is characterised in that also include：

Returning unit, if the testing result for the detection unit is no, the training sample is concentrated and there is sample friendship At least two original sample classifications of fork re-start union operation and classifier training, until each in first grader The classification that the classification accuracy of class categories is both greater than each class categories in the first probability threshold value, and second grader is accurate Untill true rate is both greater than the second probability threshold value.

28. device according to claim 27, it is characterised in that also include：

Adjustment unit, for adjusting first probability threshold value and second probability threshold value to filter out different classifications accuracy rate The first grader and the second grader.

29. device according to claim 27, it is characterised in that

First test cell, specifically for totally testing first grader using cross validation mode based on sample Classification accuracy；

Second test cell, specifically for totally testing second grader using cross validation mode based on sample Classification accuracy.

30. device according to claim 29, it is characterised in that the cross validation mode bag overall based on sample Include：Using in the sample totality 60% to 90% sample as training sample set, using remaining sample as to be sorted Text is tested, and the sample generally includes the multiple original sample classification, and each sample belongs to the multiple original One in beginning sample class.

31. device according to claim 18, it is characterised in that the combining unit, specifically for there will be sample friendship All original sample categories combinations of fork are into a new samples classification.

32. the device according to any one of claim 18 to 31, it is characterised in that also including screening unit, for It is described to concentrate at least two original sample classifications that there is sample cross to merge operation and obtain new samples class training sample Before not, the 3rd grader for obtaining that class categories are the multiple original sample classification is trained according to the training sample set, The classification accuracy of each class categories of the 3rd grader is tested, preliminary screening goes out classification accuracy less than the 3rd threshold value Original sample classification；

33. a kind of document sorting apparatus, it is characterised in that utilize the text classifier any one of claim 18 to 32 The grader of generating means generation is classified, and the sorter includes：

First input block, for text set to be sorted to be inputted into first grader, obtains the first classification results；

Second input block, for by class categories in first classification results be the new samples classification text to be sorted Input the second grader corresponding with the new samples classification, obtains the second classification results.

34. device according to claim 33, it is characterised in that also include：

As a result output unit, for being the original sample classification without merging by class categories in first classification results Classification results and second classification results collectively as the text to be sorted final classification result.

35. a kind of computer equipment, it is characterised in that including processor and memory；Memory is used to store computer instruction, Processor is used for the computer instruction for running the memory storage, with the text any one of perform claim requirement 1 to 15 This classifier generation method.

36. a kind of computer equipment, it is characterised in that including processor and memory；Memory is used to store computer instruction, Processor is used for the computer instruction for running the memory storage, with the text classification side described in perform claim requirement 16 or 17 Method.

37. a kind of computer-readable recording medium, it is characterised in that be stored with instruction in the storage medium, the instruction fortune Text classifier generation method during row any one of perform claim requirement 1 to 15.

38. a kind of computer-readable recording medium, it is characterised in that be stored with instruction in the storage medium, the instruction fortune File classification method during row described in perform claim requirement 16 or 17.