CN107273500A - Text classifier generation method, file classification method, device and computer equipment - Google Patents
Text classifier generation method, file classification method, device and computer equipment Download PDFInfo
- Publication number
- CN107273500A CN107273500A CN201710457280.2A CN201710457280A CN107273500A CN 107273500 A CN107273500 A CN 107273500A CN 201710457280 A CN201710457280 A CN 201710457280A CN 107273500 A CN107273500 A CN 107273500A
- Authority
- CN
- China
- Prior art keywords
- classification
- sample
- grader
- training
- new samples
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention discloses a kind of text classifier generation method, file classification method, device and computer equipment, the problem of classifying quality that the sample cross concentrated to solve training sample is brought is poor.The text classifier generation method includes:Training sample is concentrated at least two original sample classifications that there is sample cross merge operation and obtain new samples classification;The training sample set includes multiple original sample classifications, belong in the multiple original sample classification one of each training sample;First grader is obtained according to the training sample set training after union operation;Original sample classification according to belonging to the training sample and the training sample for belonging to the new samples classification that belong to the new samples classification, which is trained, obtains the second grader, second grader is used to carry out subseries again to class categories in the classification results of first grader for the text to be sorted of the new samples classification, to be divided into corresponding original sample classification.
Description
Technical field
The present invention relates to communication technical field, more particularly to a kind of text classifier generation method, file classification method,
Device and computer equipment.
Background technology
In text classification, the quality of training sample largely determines the effect of grader.
For example, when sample cross between class occurs in training sample, then there are two classes or multiple classes of sample cross, necessarily
The effect of overall grader can be influenceed, and two class or multi-class classification accuracy are relatively low, and classifying quality is poor.
The content of the invention
The technical problem to be solved in the present invention is to provide a kind of text classifier generation method, file classification method, device
And computer equipment, to solve asking for the classifying quality difference that the sample cross that training sample is concentrated in the prior art is brought
Topic.
On the one hand, the present invention provides a kind of text classifier generation method, including:Training sample is concentrated and there is sample friendship
At least two original sample classifications of fork merge operation and obtain new samples classification;The training sample set includes multiple original
Sample class, belong in the multiple original sample classification one of each training sample;According to the training sample after union operation
This training is got to the first grader;According to the training sample that belongs to the new samples classification and described belong to the new samples
Original sample classification belonging to the training sample of classification, which is trained, obtains the second grader, and second grader is used for institute
State class categories in the classification results of the first grader and carry out subseries again for the text to be sorted of the new samples classification, to divide
Enter in corresponding original sample classification.
Optionally, the training sample set training according to after union operation, which obtains the first grader, includes:Closed to described
And the training sample set after operating uses following at least one classification algorithm training, obtains first grader:Simple pattra leaves
The closest KNN sorting algorithms of this NB sorting algorithm, support vector machines sorting algorithm, K and random forest sorting algorithm.
Optionally, according to the training sample and the instruction for belonging to the new samples classification for belonging to the new samples classification
Practice the original sample classification belonging to sample and be trained and obtain the second grader and include:According to the instruction for belonging to the new samples classification
Original sample classification belonging to white silk sample and the training sample for belonging to the new samples classification is trained, and is respectively obtained
Second grader of each new samples classification.
Optionally, to the training sample for belonging to the new samples classification and described the new samples classification is belonged to
Original sample classification belonging to training sample uses following at least one classification algorithm training, obtains second grader:Piao
Plain Bayes NB sorting algorithms, support vector machines sorting algorithm, the closest KNN sorting algorithms of K and random forest classification are calculated
Method;The class categories that second grader includes and the original sample classification belonging to the training sample in the new samples classification
It is corresponding.
Optionally, at least two original sample classifications that training sample concentration has sample cross are merged described
Obtain before new samples classification, methods described also includes:Training corpus is pre-processed to carry out the training corpus
Filter and/or uniform format;Pretreated training corpus is subjected to word segmentation processing according to dictionary for word segmentation and obtains the training sample
Collection.
Optionally, the training corpus includes sentence and/or text fragments.
Optionally, methods described also includes:The neologisms that training corpus execution new word discovery is operated and will be seen that are added
Enter the dictionary for word segmentation.
Optionally, the new word discovery operation is realized by following at least one mode:Mutual information, co-occurrence probabilities and information
Entropy.
Optionally, methods described also includes:Test the classification accuracy of each class categories in first grader;Survey
Try the classification accuracy of each class categories in second grader;Wherein, the classification accuracy of first grader point
Not Wei P1j, wherein j for more than or equal to 1 and less than or equal to m integer, m be merge operation after training sample concentrate sample
This classification number;The classification accuracy of second grader is respectively P1h*P2k, and wherein k is more than or equal to 1 and less than or equal to n
Integer, n be the new samples classification in training sample belonging to original sample classification number;P1h is first grader
Described in new samples classification classification accuracy, h for more than or equal to 1 and less than or equal to g integer, g be the new samples classification
Number;Detect that whether the classification accuracy of each class categories in first grader is both greater than the first probability threshold value, and described
Whether the classification accuracy of each class categories is both greater than the second probability threshold value in second grader;If it is, determining described the
One grader and second classifier training success.
Optionally, detect whether the classification accuracy of each class categories in first grader is both greater than the first probability
Whether the classification accuracy of each class categories is both greater than the second probability threshold value in threshold value, and second grader, afterwards also
Including:If not, concentrating at least two original sample classifications that there is sample cross to re-start merging the training sample
Operation and classifier training, until the classification accuracy of each class categories is both greater than the first probability threshold in first grader
Value, and untill the classification accuracy of each class categories is both greater than the second probability threshold value in second grader.
Optionally, methods described also includes:First probability threshold value and second probability threshold value is adjusted to filter out
The first grader and the second grader of different classifications accuracy rate.
Optionally, the classification accuracy of first grader is totally tested using cross validation mode based on sample;Base
The classification accuracy of second grader is totally tested using cross validation mode in sample.
Optionally, the cross validation mode overall based on sample includes:Using in the sample totality 60% to
90% sample is tested as training sample set using remaining sample as text to be sorted, and the sample is totally wrapped
The multiple original sample classification is included, belong in the multiple original sample classification one of each sample.
Optionally, at least two original sample categories combinations that there will be sample cross include into new samples classification:
There will be all original sample categories combinations of sample cross into a new samples classification.
Optionally, it is described at least two original sample classifications progress concentrated training sample and there is sample cross
Union operation is obtained before new samples classification, and methods described also includes:Class categories are obtained according to training sample set training
For the 3rd grader of the multiple original sample classification, the classification for testing each class categories of the 3rd grader is accurate
Rate, preliminary screening goes out the original sample classification that classification accuracy is less than the 3rd threshold value;The original sample gone out in preliminary screening
Identify there is the original sample classification of sample cross in classification.
On the other hand, the present invention also provides a kind of file classification method, and the text classifier provided using the present invention is generated
The grader of method generation is classified, and the sorting technique includes:Text set to be sorted is inputted into first grader, obtained
To the first classification results;By class categories in first classification results for the new samples classification text input to be sorted with
Corresponding second grader of the new samples classification, obtains the second classification results.
Optionally, methods described also includes:It is the original without merging by class categories in first classification results
The classification results of beginning sample class and final classification result of second classification results collectively as the text to be sorted.
On the other hand, the present invention also provides a kind of text classifier generating means, including:Combining unit, for that will train
At least two original sample classifications that there is sample cross in sample set merge operation and obtain new samples classification;The training
Sample set includes multiple original sample classifications, belong in the multiple original sample classification one of each training sample;First
Training unit, for obtaining the first grader according to the training sample set training after union operation;Second training unit, for root
According to original belonging to the training sample and the training sample for belonging to the new samples classification for belonging to the new samples classification
Sample class, which is trained, obtains the second grader, and second grader is used in the classification results of first grader
Class categories carry out subseries again for the text to be sorted of the new samples classification, to be divided into corresponding original sample classification.
Optionally, first training unit, specifically for:To the training sample set after the union operation using following
At least one classification algorithm training, obtains first grader:Naive Bayesian NB sorting algorithms, support vector machines point
The closest KNN sorting algorithms of class algorithm, K and random forest sorting algorithm.
Optionally, second training unit, specifically for:According to the training sample for belonging to the new samples classification and
Original sample classification belonging to the training sample for belonging to the new samples classification is trained, and is respectively obtained each described new
Second grader of sample class.
Optionally, second taxon obtains described specifically for using following at least one classification algorithm training
Second grader:The closest KNN sorting algorithms of naive Bayesian NB sorting algorithms, support vector machines sorting algorithm, K and with
Machine forest classified algorithm;Belonging to training sample in class categories and the new samples classification that second grader includes
Original sample classification is corresponding.
Optionally, described device also includes:Pretreatment unit, for there is sample cross in described concentrate training sample
At least two original sample classifications merge before obtaining new samples classification, training corpus is pre-processed with to described
Training corpus is filtered and/or uniform format;Participle unit, for pretreated training corpus to be entered according to dictionary for word segmentation
Row word segmentation processing obtains the training sample set.
Optionally, the training corpus includes sentence or text fragments.
Optionally, described device also includes new word discovery unit, for performing new word discovery operation to the training corpus
And the neologisms that will be seen that add the dictionary for word segmentation.
Optionally, the new word discovery operation is realized by following at least one mode:Mutual information, co-occurrence probabilities and information
Entropy.
Optionally, described device also includes:First test cell, for testing each classification class in first grader
Other classification accuracy;Second test cell, the classification accuracy for testing each class categories in second grader;
Wherein, the classification accuracy of first grader is respectively P1j, and wherein j is the integer more than or equal to 1 and less than or equal to m, m
To merge the sample class number that the training sample after operation is concentrated;The classification accuracy of second grader is respectively
P1h*P2k, wherein k are the integer more than or equal to 1 and less than or equal to n, and n is belonging to the training sample in the new samples classification
Original sample classification number;P1h be first grader described in new samples classification classification accuracy, h be more than or equal to 1 and
Integer less than or equal to g, g is the new samples classification number;Detection unit, each classifies for detecting in first grader
The classification whether classification accuracy of classification is both greater than each class categories in the first probability threshold value, and second grader is accurate
Whether true rate is both greater than the second probability threshold value;Determining unit, if the testing result for the detection unit is yes, determines institute
State the first grader and second classifier training success.
Optionally, described device also includes:Returning unit, will if the testing result for the detection unit is no
The training sample concentrates at least two original sample classifications that there is sample cross to re-start union operation and grader instruction
Practice, until the classification accuracy of each class categories is both greater than the first probability threshold value, and described second in first grader
Untill the classification accuracy of each class categories is both greater than the second probability threshold value in grader.
Optionally, described device also includes:Adjustment unit, for adjusting first probability threshold value and second probability
Threshold value is to filter out the first grader and the second grader of different classifications accuracy rate.
Optionally, first test cell, specifically for totally testing described using cross validation mode based on sample
The classification accuracy of first grader;Second test cell, specifically for totally using cross validation mode based on sample
Test the classification accuracy of second grader.
Optionally, the cross validation mode overall based on sample includes:Using in the sample totality 60% to
90% sample is tested as training sample set using remaining sample as text to be sorted, and the sample is totally wrapped
The multiple original sample classification is included, belong in the multiple original sample classification one of each sample.
Optionally, the combining unit, specifically for there will be all original sample categories combinations of sample cross into one
Individual new samples classification.
Optionally, described device also includes screening unit, for there is sample cross in described concentrate training sample
At least two original sample classifications are merged before operation obtains new samples classification, are obtained according to training sample set training
Class categories are the 3rd grader of the multiple original sample classification, test point of each class categories of the 3rd grader
Class accuracy rate, preliminary screening goes out the original sample classification that classification accuracy is less than the 3rd threshold value;The original gone out in preliminary screening
Identify there is the original sample classification of sample cross in beginning sample class.
On the other hand, the present invention also provides a kind of document sorting apparatus, any text classification provided using the present invention
The grader of device generating means generation is classified, and the sorter includes:First input block, for by text to be sorted
Collection input first grader, obtains the first classification results;Second input block, for by first classification results points
Class classification is text input to be sorted the second grader corresponding with the new samples classification of the new samples classification, obtains the
Two classification results.
Optionally, described device also includes:As a result output unit, for being by class categories in first classification results
The classification results of the original sample classification without merging are with second classification results collectively as the text to be sorted
Final classification result.
On the other hand, the present invention also provides a kind of computer equipment, including processor and memory;Memory is used to store
Computer instruction, processor is used to run the computer instruction of the memory storage, any with perform that the present invention provides
Text classifier generation method.
On the other hand, the present invention also provides a kind of computer equipment, including processor and memory;Memory is used to store
Computer instruction, processor is used to run the computer instruction of the memory storage, any with perform that the present invention provides
File classification method.
On the other hand, the present invention also provides the instruction that is stored with a kind of computer-readable recording medium, the storage medium,
Any text classifier generation method that the present invention is provided is performed during the instruction operation.
On the other hand, the present invention also provides the instruction that is stored with a kind of computer-readable recording medium, the storage medium,
Any file classification method that the present invention is provided is performed during the instruction operation.
Text classifier generation method, file classification method, device and computer equipment that embodiments of the invention are provided,
By the training of the first grader, can there will be the original sample classification of sample cross exactly with sample cross is not present
Original sample class discrimination is opened, and by the training of the second grader, the original sample classification that can there will be sample cross is independent
Separate, finer classification based training is carried out in the range of more specifically, so as to substantially increase the classification of text classifier
Accuracy rate.
Brief description of the drawings
Fig. 1 is a kind of flow chart of text classifier generation method provided in an embodiment of the present invention;
Fig. 2 is a kind of detail flowchart of text classifier generation method provided in an embodiment of the present invention;
Fig. 3 is a kind of flow chart of file classification method provided in an embodiment of the present invention;
Fig. 4 is a kind of structural representation of text classifier generating means provided in an embodiment of the present invention;
Fig. 5 is a kind of structural representation of document sorting apparatus provided in an embodiment of the present invention.
Embodiment
Below in conjunction with accompanying drawing, the present invention is described in detail.It should be appreciated that specific embodiment described herein is only
To explain the present invention, the present invention is not limited.
As shown in figure 1, the embodiment of the present invention provides a kind of text classifier generation method, including:
S11, training sample is concentrated at least two original sample classifications that there is sample cross merge operation and obtained
New samples classification;The training sample set includes multiple original sample classifications, and each training sample belongs to the multiple original sample
One in this classification;
S12, the first grader is obtained according to the training sample set training after union operation;
S13, according to the training sample and the training for belonging to the new samples classification for belonging to the new samples classification
Original sample classification belonging to sample, which is trained, obtains the second grader, and second grader is used to classify to described first
Class categories carry out subseries again for the text to be sorted of the new samples classification in the classification results of device, to be divided into corresponding original
In beginning sample class.
The text classifier generation method that embodiments of the invention are provided, when according to training sample training text grader
Using with different levels grader generating mode, training sample is first concentrated to the original sample of two or more that there is sample cross
This categories combination obtains the first grader into new samples classification according to the training sample set training after union operation, then to new sample
This classification carries out finer classification based training and obtains the second grader.So, can be exactly by the training of the first grader
The original sample classification and the original sample class discrimination in the absence of sample cross that there will be sample cross are opened, and pass through the second classification
The training of device, the original sample classification that can there will be sample cross is separately separated out, carries out more in the range of more specifically
Careful classification based training, so as to substantially increase the classification accuracy of text classifier.
Specifically, the sample cross described in embodiments of the invention refers to the training sample concentration in offer, sample number
According to affiliated classification it is very not clear accurate, be for example but in B classes in the presence of the sample data that should belong to A classes originally
Situation, then it is assumed that there is sample cross between A classes and B classes.Sample cross is also known as that class is overlapping or data set is overlapping.Due to
Text classifier is trained by using these training sample sets, and this sample cross situation of training sample set is inevitable
The classification accuracy that it can be influenceed to train the text classifier come.Text classifier generation method provided in an embodiment of the present invention
Situation about between this sample class intersecting can be directed to be efficiently modified.It is specifically described below.
In step s 11, training sample, which is concentrated, includes multiple original sample classifications, and these original sample classifications are that correspond to
The desired target classification of user, belong in multiple original sample classifications one of each training sample.The one of the present invention
In individual embodiment, training sample, which is concentrated, includes tetra- original sample classifications of A, B, C, D, wherein original sample classification A and original sample
There is sample cross between this classification C, then A and C can be merged to operation generation new samples classification G, the training after merging
Sample set includes original sample classification B, D and new samples classification G.
Accordingly, in step s 12, the first grader is obtained namely according to the training sample set training after union operation
The all elements that training sample is concentrated are respectively divided in sample class B, D, G.Can be trained to by such classification
One grader.
Certainly, the original sample classification that there is sample cross can be with many more than two, the new samples classification occurred after merging
Classification accuracy as long as being conducive to correcting sample cross, can be improved, embodiments of the invention are not made to this with more than one
Limit.
Handed over for example, there is sample between class in above-described embodiment or between tetra- original sample classifications of A, B, C, D
, or there is sample cross in fork, accordingly between C, D while there is sample cross between A, B, merging operation both can be with
It is into a new samples classification G1 by tetra- original sample categories combinations of A, B, C, D, i.e. there will be all original of sample cross
Sample class is merged into a new samples classification, and A, B can also be merged into a new samples classification G2, C, D are merged into one
Individual new samples classification G3, i.e. there will be the original sample categories combination of sample cross into multiple new samples classifications.
Optionally, the training sample set after union operation can be obtained using one or more of classification algorithm training
First grader:The closest KNN sorting algorithms of naive Bayesian NB sorting algorithms, support vector machines sorting algorithm, K
With random forest sorting algorithm etc..
It should be noted that the above-mentioned original sample classification that there is sample cross can be known or need
Identification., can be in the following way when the sample size of training sample set and the huge data volume of original sample classification
Identify there is the original sample classification of sample cross:
Operated in described merge at least two original sample classifications that training sample concentration has sample cross
To before new samples classification, it is the of the multiple original sample classification to be trained according to the training sample set and obtain class categories
Three graders, test the classification accuracy of each class categories of the 3rd grader, and preliminary screening goes out classification accuracy and is less than
The original sample classification of 3rd threshold value;
The original sample classification that there is sample cross is identified in the original sample classification that preliminary screening goes out.
The principle of above-mentioned recognition methods is, if training sample set has sample cross between class, will certainly influence training
The classification accuracy of the grader gone out, therefore the sample cross situation of training sample set can be carried out just by classification accuracy
Step screening, then identifies the original sample classification that there is sample cross from the original sample classification filtered out again.
Wherein it is possible to using to less than the 3rd threshold value original sample classification carry out artificial nucleus couple or machine data matching
There is the original sample classification of sample cross to identify in mode.
For example, the setting of the 3rd threshold value is higher, then the detection to sample cross is sensitiveer.The original sample of training sample set
There is a situation where that the sample data that should belong to A classes originally has but been in B classes in classification, and original should not in A classes
Belong to the sample of other classes, then now the classification accuracy of the B classes of the 3rd grader can be less than the 3rd threshold value, although point of A classes
Class accuracy rate is not affected, but is matched according to artificial nucleus couple or machine data, can show that training sample concentrates A classes and B classes
There is sample cross.But it is in the presence of the sample data that should belong to A classes originally in the original sample classification of training sample set
In B classes, and originally belong to the sample data in B classes and be but in A classes, then now the classification of the B classes of the 3rd grader is accurate
The classification accuracy of true rate and A classes can all be less than the 3rd threshold value, can determine A classes and B by artificial nucleus couple or machine data matching
In class whether also with other class sample cross.
The classification accuracy of each class categories of the 3rd grader is tested, is specifically as follows:Totally adopted based on sample
The classification accuracy of the 3rd grader is tested with cross validation mode.Train after the first grader, be in step s 13
It can train to form the second grader.Specifically, can be according to the training sample and the category for belonging to the new samples classification
The original sample classification belonging to training sample in the new samples classification is trained, and respectively obtains each new samples class
Other second grader.
Still by taking above-described embodiment as an example, if there is sample cross, A between original sample classification A and original sample classification C
Merged with C and new samples classification G is generated after operation, classification based training is carried out with the training sample set after merging, by training sample
The each element of concentration is divided into sample class B, D or G, obtains the first grader.Obtain after the first grader, use new samples classification
Classification based training is carried out in G, element in G is subdivided into A classes and C classes, wherein A and C are the training sample for belonging to new samples classification G
Affiliated original sample classification.
So, by the training of the second grader, the original sample classification that can there will be sample cross is separately separated out
Come, finer classification based training is carried out in the range of more specifically, so as to substantially increase the classification accuracy of text classifier.
Optionally, one or more of can be included by obtaining the sorting algorithm of the second grader:NB points of naive Bayesian
The closest KNN sorting algorithms of class algorithm, support vector machines sorting algorithm, K and random forest sorting algorithm;Wherein, second point
The class categories that class device includes are corresponding with the original sample classification belonging to the training sample in the new samples classification.
Further, the training sample set in above-described embodiment can be obtained by training corpus by certain processing.For
, in one embodiment of the invention, there is sample cross in described concentrate training sample in acquisition above-mentioned training sample set
At least two original sample classifications merge before obtaining new samples classification, text classifier provided in an embodiment of the present invention
Generation method may also include:
Training corpus is pre-processed to filter and/or uniform format the training corpus;
Pretreated training corpus is subjected to word segmentation processing according to dictionary for word segmentation and obtains the training sample set.
Specifically, the training corpus of collection includes sentence and/or text fragments, and concrete form can be voice, text
Word, image etc. are a variety of, and it is text formatting to first have to that by pretreatment training corpus uniform format will be obtained, and filters out invalid lattice
Formula, is preserved stand-by.Then, pretreated training corpus is carried out word segmentation processing to obtain training sample according to dictionary for word segmentation
Collection.
Further, dictionary for word segmentation can be expanded, for example, new word discovery operation can be performed to the training corpus simultaneously
The neologisms that will be seen that add the dictionary for word segmentation, so, new word can be obtained using new word discovery method, according to acquisition
New word can update dictionary for word segmentation, then when carrying out word segmentation processing, can be divided according to the dictionary for word segmentation after renewal
Word, so as to make dictionary for word segmentation carry out constantly improve, effectively improves the accuracy rate of word segmentation processing.
Optionally, the new word discovery operation can be realized by following one or more of modes:Mutual information, co-occurrence probabilities
And comentropy.
In order to determine the first grader of generation and the classifying quality of the second grader, further, the embodiment of the present invention
The text classifier generation method of offer can also include:
Test the classification accuracy of each class categories in first grader;
Test the classification accuracy of each class categories in second grader;
Wherein, the classification accuracy of first grader is respectively P1j, and wherein j is more than or equal to 1 and less than or equal to m
Integer, m be merge operation after training sample concentrate sample class number;
The classification accuracy of second grader is respectively P1h*P2k, and wherein k is more than or equal to 1 and less than or equal to n
Integer, n is the original sample classification number belonging to the training sample in the new samples classification;P1h is in first grader
The classification accuracy of the new samples classification, h is the integer more than or equal to 1 and less than or equal to g, and g is the new samples classification number;
Detect whether the classification accuracy of each class categories in first grader is both greater than the first probability threshold value, and
Whether the classification accuracy of each class categories is both greater than the second probability threshold value in second grader;
If it is, determining first grader and second classifier training success.
For example, if the first probability threshold value is 0.98, the second probability threshold value is 0.95, during test, if first point
In the classification results of class device, the classification accuracy of each class categories is both greater than in the classification results of the 0.98, and second grader,
The classification accuracy of each class categories is both greater than 0.95, then explanation is generated by text classifier provided in an embodiment of the present invention
The classification accuracy for the text classifier that method is generated has reached the requirement of user.
Optionally, can be using each class categories attribute phase with training sample set when carrying out classification accuracy detection
With or similar data tested, these data are labeled with the class categories of correlation.Wherein, with point of training sample set
The same or analogous data of class category attribute can be gone out by algorithm construction, can also be obtained according to cross validation mode.
Specifically, the classification standard for totally testing first grader using cross validation mode based on sample can be used
True rate;The classification accuracy of second grader is totally tested using cross validation mode based on sample.
Wherein, sample totally refers to the whole sample datas related to this classification task.Based on the overall intersection of sample
Verification mode can be that another part can be adopted as test sample collection as training sample set using the overall part of sample
With the sample of (such as 80%) 60% to 90% in the sample totality as training sample set, using remaining sample as treating
Classifying text is tested, and the sample generally includes the multiple original sample classification, and each sample belongs to described many
One in individual original sample classification.
Optionally, in first grader is detected the classification accuracy of each class categories whether to be both greater than first general
Whether the classification accuracy of each class categories is both greater than after the second probability threshold value in rate threshold value, and second grader,
The text classifier generation method that embodiments of the invention are provided may also include:Exist if not, the training sample is concentrated
At least two original sample classifications of sample cross re-start union operation and classifier training, until first grader
In the classification accuracies of each class categories be both greater than each class categories in the first probability threshold value, and second grader
Untill classification accuracy is both greater than the second probability threshold value.
That is, after according to the first grader and the effect of the second grader, if it find that any one class categories
Classification accuracy is less than corresponding probability threshold value, all illustrates that the first grader and the second grader not yet reach the requirement of user,
Therefore need again to merge the original sample classification that there is sample cross operation and classifier training, until rate of accuracy reached
Untill above-mentioned threshold requirement.
Optionally, in order to which the classification accuracy met to grader has different requirements, in one embodiment of the present of invention
In, first probability threshold value and second probability threshold value can also be adjusted to filter out first point of different classifications accuracy rate
Class device and the second grader.
Text classifier generation method provided in an embodiment of the present invention is described in detail below by specific embodiment.
As shown in Fig. 2 the text classifier generation method that the present embodiment is provided specifically may include following steps:
S201, pretreatment:It is text formatting that training corpus uniform format, which will be obtained, filters invalid form, is preserved stand-by;
S202, new word discovery:The neologisms candidate word of training corpus is found out using existing new word discovery instrument, through artificial mistake
Dictionary for word segmentation is added after filter;
S203, participle is carried out to pretreated training corpus according to dictionary for word segmentation;
S204, screening sample:Realize that it is classified to original sample category construction grader, and based on sample totally using friendship
Pitch checking mode testing classification device accuracy rate P0 (accuracy rate of each class be P01, P02 ..., P0i ...).According to
The result (the classification accuracy rate that there is sample cross is all relatively low) of classification selected (artificial selected or simple matching way) is present
The classification collection of sample cross (cross one another classification is a classification collection, accordingly, it is possible to there is one or more classification collection);
S205, sample restructuring:Two classes or multiple classes that there will be sample cross are merged, and other classes keep constant;
S206, training the first grader of generation:Sort operation is carried out to merging the training sample set after operation, equally
Based on sample totally using cross validation mode testing classification accuracy rate P1 (accuracy rate of each class be P11,
P12、……、P1j、……)。
S207, training the second grader of generation:New samples category construction grader (one or more) to merging generation,
And based on sample totally using cross validation mode test its classify accuracy rate P2 (accuracy rate of each class be P21,
P22、……、P2k、……)。
S208, accuracy rate test:Detect whether the classification accuracy of each class categories in first grader is big
Whether the classification accuracy of each class categories is both greater than the second probability threshold in the first probability threshold value, and second grader
Value;
If it is, determining first grader and second classifier training success;
If not, concentrating at least two original sample classifications that there is sample cross to re-start conjunction the training sample
And operate and classifier training, until the classification accuracy of each class categories is both greater than the first probability in first grader
Untill the classification accuracy of each class categories is both greater than the second probability threshold value in threshold value, and second grader.
Accordingly, as shown in figure 3, embodiments of the invention also provide a kind of file classification method, text sorting technique
The grader of any text classifier generation method generation provided using previous embodiment is classified, the sorting technique
Including:
S31, inputs first grader by text set to be sorted, obtains the first classification results;
S32, text input to be sorted and institute by class categories in first classification results for the new samples classification
Corresponding second grader of new samples classification is stated, the second classification results are obtained.
The file classification method that embodiments of the invention are provided, applies any text classification of previous embodiment offer
The text classifier of device generation method generation.So, by the training of the first grader, can there will be sample cross exactly
Original sample classification with being opened in the absence of the original sample class discrimination of sample cross, can be with by the training of the second grader
The original sample classification that there will be sample cross is separately separated out, and finer classification instruction is carried out in the range of more specifically
Practice, so as to substantially increase the accuracy rate of text classification.
Optionally, file classification method provided in an embodiment of the present invention may also include:By in first classification results points
Class classification is treated for the classification results of the original sample classification without merging with second classification results collectively as described
The final classification result of classifying text.
The file classification method that embodiments of the invention are provided, applies any text classification of previous embodiment offer
Detailed description has been carried out in the text classifier of device generation method generation, specific assorting process and principle above, herein
Repeat no more.
Accordingly, as shown in figure 4, embodiments of the invention also provide a kind of text classifier generating means, including:
Combining unit 41, for concentrating at least two original sample classifications that there is sample cross to be closed training sample
And operation obtains new samples classification;The training sample set includes multiple original sample classifications, and each training sample belongs to described
One in multiple original sample classifications;
First training unit 42, for obtaining the first grader according to the training sample set training after union operation;
Second training unit 43, for according to the training sample that belongs to the new samples classification and described belonging to described new
Original sample classification belonging to the training sample of sample class, which is trained, obtains the second grader, and second grader is used for
Subseries again is carried out for the text to be sorted of the new samples classification to class categories in the classification results of first grader,
To be divided into corresponding original sample classification.
The text classifier generating means that embodiments of the invention are provided, when according to training sample training text grader
Using with different levels grader generating mode, training sample is first concentrated to the original sample of two or more that there is sample cross
This categories combination obtains the first grader into new samples classification according to the training sample set training after union operation, then to new sample
This classification carries out finer classification based training and obtains the second grader.So, can be exactly by the training of the first grader
The original sample classification and the original sample class discrimination in the absence of sample cross that there will be sample cross are opened, and pass through the second classification
The training of device, the original sample classification that can there will be sample cross is separately separated out, carries out more in the range of more specifically
Careful classification based training, so as to substantially increase the classification accuracy of text classifier.
Specifically, the sample cross described in embodiments of the invention refers to the training sample concentration in offer, sample number
According to affiliated classification it is very not clear accurate, be for example but in B classes in the presence of the sample data that should belong to A classes originally
Situation, then it is assumed that there is sample cross between A classes and B classes.Because text classifier is by using these training sample sets
Train and come, this sample cross situation of training sample set will necessarily influence the classification that it trains the text classifier come
Accuracy.Text classifier generating means provided in an embodiment of the present invention can be directed to situation about intersecting between this sample class and carry out
It is efficiently modified.It is specifically described below.
Optionally, the combining unit, can specifically for there will be all original sample categories combinations of sample cross into
One new samples classification.
Optionally, when combining unit 41 merges operation, training sample, which is concentrated, includes multiple original sample classifications, this
A little original sample classifications are that correspond to the desired target classification of user, and each training sample belongs to multiple original sample classifications
In one.In one embodiment of the invention, training sample, which is concentrated, includes tetra- original sample classifications of A, B, C, D, wherein
There is sample cross between original sample classification A and original sample classification C, then A and C can be merged into operation generates new sample
This classification G, the training sample after merging, which is concentrated, includes original sample classification B, D and new samples classification G.
Accordingly, the first training unit 42 obtains the first grader also just according to the training sample set training after union operation
It is that all elements for concentrating training sample are respectively divided in sample class B, D, G.It can be trained to by such classification
First grader.
Certainly, the original sample classification that there is sample cross can be with many more than two, the new samples classification occurred after merging
Classification accuracy as long as being conducive to correcting sample cross, can be improved, embodiments of the invention are not made to this with more than one
Limit.
Handed over for example, there is sample between class in above-described embodiment or between tetra- original sample classifications of A, B, C, D
, or there is sample cross in fork, accordingly between C, D while there is sample cross between A, B, merging operation both can be with
It is into a new samples classification G1 by tetra- original sample categories combinations of A, B, C, D, i.e. there will be all original of sample cross
Sample class is merged into a new samples classification, and A, B can also be merged into a new samples classification G2, C, D are merged into one
Individual new samples classification G3, i.e. there will be the original sample categories combination of sample cross into multiple new samples classifications.
Optionally, the first training unit 42, is particularly used in:
Following at least one classification algorithm training is used to the training sample set after the union operation, described first is obtained
Grader:Closest KNN sorting algorithms of naive Bayesian NB sorting algorithms, support vector machines sorting algorithm, K and random gloomy
Woods sorting algorithm.
It should be noted that the above-mentioned original sample classification that there is sample cross can be known or need
Identification.When the sample size of training sample set and the huge data volume of original sample classification, the text classifier life
Screening unit is may also include into device, is used for:
Operated in described merge at least two original sample classifications that training sample concentration has sample cross
To before new samples classification, it is the of the multiple original sample classification to be trained according to the training sample set and obtain class categories
Three graders, test the classification accuracy of each class categories of the 3rd grader, and preliminary screening goes out classification accuracy and is less than
The original sample classification of 3rd threshold value;
The original sample classification that there is sample cross is identified in the original sample classification that preliminary screening goes out.
The principle of above-mentioned recognition methods is, if training sample set has sample cross between class, will certainly influence training
The classification accuracy of the grader gone out, therefore the sample cross situation of training sample set can be carried out just by classification accuracy
Step screening, then identifies the original sample classification that there is sample cross from the original sample classification filtered out again.
Wherein it is possible to using to less than the 3rd threshold value original sample classification carry out artificial nucleus couple or machine data matching
There is the original sample classification of sample cross to identify in mode.
For example, the setting of the 3rd threshold value is higher, then the detection to sample cross is sensitiveer.The original sample of training sample set
There is a situation where that the sample data that should belong to A classes originally has but been in B classes in classification, and original should not in A classes
Belong to the sample of other classes, then now the classification accuracy of the B classes of the 3rd grader can be less than the 3rd threshold value, although point of A classes
Class accuracy rate is not affected, but is matched according to artificial nucleus couple or machine data, can show that training sample concentrates A classes and B classes
There is sample cross.But it is in the presence of the sample data that should belong to A classes originally in the original sample classification of training sample set
In B classes, and originally belong to the sample data in B classes and be but in A classes, then now the classification of the B classes of the 3rd grader is accurate
The classification accuracy of true rate and A classes can all be less than the 3rd threshold value, can determine A classes and B by artificial nucleus couple or machine data matching
In class whether also with other class sample cross.
The classification accuracy of each class categories of the 3rd grader is tested, is specifically as follows:Totally adopted based on sample
The classification accuracy of the 3rd grader is tested with cross validation mode.
Train after the first grader, the second training unit 43 can train to form the second grader.Specifically, can be with
Original according to belonging to the training sample and the training sample for belonging to the new samples classification that belong to the new samples classification
Beginning sample class is trained, and respectively obtains the second grader of each new samples classification.
Still by taking above-described embodiment as an example, if there is sample cross, A between original sample classification A and original sample classification C
Merged with C and new samples classification G is generated after operation, classification based training is carried out with the training sample set after merging, by training sample
The each element of concentration is divided into sample class B, D or G, obtains the first grader.Obtain after the first grader, use new samples classification
Classification based training is carried out in G, element in G is subdivided into A classes and C classes, wherein A and C are the training sample for belonging to new samples classification G
Affiliated original sample classification.
So, by the training of the second grader, the original sample classification that can there will be sample cross is separately separated out
Come, finer classification based training is carried out in the range of more specifically, so as to substantially increase the classification accuracy of text classifier.
Optionally, the second taxon 43 is particularly used in using following at least one classification algorithm training, obtains described
Second grader:The closest KNN sorting algorithms of naive Bayesian NB sorting algorithms, support vector machines sorting algorithm, K and with
Machine forest classified algorithm;Belonging to training sample in class categories and the new samples classification that second grader includes
Original sample classification is corresponding.
Further, the training sample set in above-described embodiment can be obtained by training corpus by certain processing.For
The above-mentioned training sample set of acquisition, text classifier generating means provided in an embodiment of the present invention may also include:
Pretreatment unit, for training sample to be concentrated at least two original sample classifications that there is sample cross described
Merge before obtaining new samples classification, training corpus is pre-processed to be filtered to the training corpus and/or
Uniform format;
Participle unit, the training is obtained for pretreated training corpus to be carried out into word segmentation processing according to dictionary for word segmentation
Sample set.
Specifically, the training corpus of collection includes sentence and/or text fragments, and concrete form can be voice, text
Word, image etc. are a variety of, and it is text formatting to first have to that by pretreatment training corpus uniform format will be obtained, and filters out invalid lattice
Formula, is preserved stand-by.Then, pretreated training corpus is carried out word segmentation processing to obtain training sample according to dictionary for word segmentation
Collection.
Optionally, described device may also include new word discovery unit, for performing new word discovery behaviour to the training corpus
The neologisms made and will be seen that add the dictionary for word segmentation.So, new word can be obtained using new word discovery method, according to obtaining
The new word taken can update dictionary for word segmentation, then when carrying out word segmentation processing, can be entered according to the dictionary for word segmentation after renewal
Row participle, so as to make dictionary for word segmentation carry out constantly improve, effectively improves the accuracy rate of word segmentation processing.
Optionally, the new word discovery operation can be realized by following one or more of modes:Mutual information, co-occurrence probabilities
And comentropy.
In order to determine the first grader of generation and the classifying quality of the second grader, further, described device may be used also
Including:
First test cell, the classification accuracy for testing each class categories in first grader;
Second test cell, the classification accuracy for testing each class categories in second grader;
Wherein, the classification accuracy of first grader is respectively P1j, and wherein j is more than or equal to 1 and less than or equal to m
Integer, m be merge operation after training sample concentrate sample class number;
The classification accuracy of second grader is respectively P1h*P2k, and wherein k is more than or equal to 1 and less than or equal to n
Integer, n is the original sample classification number belonging to the training sample in the new samples classification;P1h is in first grader
The classification accuracy of the new samples classification, h is the integer more than or equal to 1 and less than or equal to g, and g is the new samples classification number;
Detection unit, for detect each class categories in first grader classification accuracy whether both greater than the
Whether the classification accuracy of each class categories is both greater than the second probability threshold value in one probability threshold value, and second grader;
Determining unit, if the testing result for the detection unit is yes, determines first grader and described
The success of second classifier training.
For example, if the first probability threshold value is 0.98, the second probability threshold value is 0.95, during test, if first point
In the classification results of class device, the classification accuracy of each class categories is both greater than in the classification results of the 0.98, and second grader,
The classification accuracy of each class categories is both greater than 0.95, then explanation is generated by text classifier provided in an embodiment of the present invention
The classification accuracy for the text classifier that method is generated has reached the requirement of user.
Optionally, can be using each class categories attribute phase with training sample set when carrying out classification accuracy detection
With or similar data tested, these data are labeled with the class categories of correlation.Wherein, with point of training sample set
The same or analogous data of class category attribute can be gone out by algorithm construction, can also be obtained according to cross validation mode.
Optionally, first test cell, is particularly used in and totally tests institute using cross validation mode based on sample
State the classification accuracy of the first grader;Second test cell, specifically for totally using cross validation side based on sample
Formula tests the classification accuracy of second grader.
Wherein, sample totally refers to the whole sample datas related to this classification task.Based on the overall intersection of sample
Verification mode can be that another part can be adopted as test sample collection as training sample set using the overall part of sample
With the sample of (such as 80%) 60% to 90% in the sample totality as training sample set, using remaining sample as treating
Classifying text is tested, and the sample generally includes the multiple original sample classification, and each sample belongs to described many
One in individual original sample classification.
Optionally, described device may also include:Returning unit, if the testing result for the detection unit is no,
At least two original sample classifications that there is sample cross are concentrated to re-start union operation and grader the training sample
Training, until the classification accuracy of each class categories is both greater than the first probability threshold value in first grader, and described the
Untill the classification accuracy of each class categories is both greater than the second probability threshold value in two graders.
That is, after according to the first grader and the effect of the second grader, if it find that any one class categories
Classification accuracy is less than corresponding probability threshold value, all illustrates that the first grader and the second grader not yet reach the requirement of user,
Therefore need again to merge the original sample classification that there is sample cross operation and classifier training, until rate of accuracy reached
Untill above-mentioned threshold requirement.
It is in one embodiment of the invention, optional in order to which the classification accuracy met to grader has different requirements
, described device may also include:Adjustment unit, for adjusting first probability threshold value and second probability threshold value to screen
Go out the first grader and the second grader of different classifications accuracy rate.
Accordingly, as shown in figure 5, embodiments of the invention also provide a kind of document sorting apparatus, previous embodiment is utilized
The grader of any text classifier generating means generation provided is classified, and the sorter includes:
First input block 51, for text set to be sorted to be inputted into first grader, obtains the first classification results;
Second input block 52, for the treating point for the new samples classification by class categories in first classification results
Class text inputs the second grader corresponding with the new samples classification, obtains the second classification results.
The document sorting apparatus that embodiments of the invention are provided, applies any text classification of previous embodiment offer
The text classifier of device generating means generation.So, by the training of the first grader, can there will be sample cross exactly
Original sample classification with being opened in the absence of the original sample class discrimination of sample cross, can be with by the training of the second grader
The original sample classification that there will be sample cross is separately separated out, and finer classification instruction is carried out in the range of more specifically
Practice, so as to substantially increase the accuracy rate of text classification.
Further, the document sorting apparatus, in addition to:
As a result output unit, for being the original sample without merging by class categories in first classification results
The classification results of classification and final classification result of second classification results collectively as the text to be sorted.
Accordingly, embodiments of the invention also provide a kind of computer equipment, including processor and memory;Memory is used
In storage computer instruction, processor is used for the computer instruction for running the memory storage, is carried with performing previous embodiment
Any text classifier generation method supplied, therefore corresponding technique effect can be also realized, have been carried out above specifically
Bright, here is omitted.
Accordingly, embodiments of the invention also provide a kind of computer equipment, including processor and memory;Memory is used
In storage computer instruction, processor is used for the computer instruction for running the memory storage, to perform foregoing implement
Any file classification method that example is provided, therefore corresponding technique effect can be also realized, have been carried out describing in detail above,
Here is omitted.
Accordingly, embodiments of the invention are also provided in a kind of computer-readable recording medium, the storage medium and stored
There is instruction, any text classifier generation method that previous embodiment is provided is performed during the instruction operation, therefore also can be real
Now corresponding technique effect, has been carried out describing in detail above, here is omitted.
Accordingly, embodiments of the invention are also provided in a kind of computer-readable recording medium, the storage medium and stored
There is instruction, any file classification method that previous embodiment is provided is performed during the instruction operation, therefore can also realize corresponding
Technique effect, have been carried out above describe in detail, here is omitted.
It should be noted that herein, term " comprising ", "comprising" or its any other variant are intended to non-row
His property is included, so that process, method, article or device including a series of key elements not only include those key elements, and
And also including other key elements being not expressly set out, or also include for this process, method, article or device institute inherently
Key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that including this
Also there is other identical element in process, method, article or the device of key element.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side
Method can add the mode of required general hardware platform to realize by software, naturally it is also possible to by hardware, but in many cases
The former is more preferably embodiment.Understood based on such, technical scheme is substantially done to prior art in other words
Going out the part of contribution can be embodied in the form of software product, and the computer software product is stored in a storage medium
In (such as ROM/RAM, magnetic disc, CD), including some instructions are to cause a station terminal equipment (can be mobile phone, computer, clothes
It is engaged in device, air conditioner, or network equipment etc.) perform method described in each embodiment of the invention.
The preferred embodiments of the present invention are these are only, are not intended to limit the scope of the invention, it is every to utilize this hair
Equivalent structure or equivalent flow conversion that bright specification and accompanying drawing content are made, or directly or indirectly it is used in other related skills
Art field, is included within the scope of the present invention.
Claims (38)
1. a kind of text classifier generation method, it is characterised in that including:
Training sample is concentrated at least two original sample classifications that there is sample cross merge operation and obtain new samples class
Not;The training sample set includes multiple original sample classifications, and each training sample belongs in the multiple original sample classification
One;
First grader is obtained according to the training sample set training after union operation;
According to belonging to the training sample and the training sample for belonging to the new samples classification that belong to the new samples classification
Original sample classification be trained and obtain the second grader, second grader is used for the classification to first grader
As a result middle class categories carry out subseries again for the text to be sorted of the new samples classification, to be divided into corresponding original sample class
Not in.
2. according to the method described in claim 1, it is characterised in that the training sample set according to after union operation is trained
Include to the first grader:
Following at least one classification algorithm training is used to the training sample set after the union operation, first classification is obtained
Device:The closest KNN sorting algorithms of naive Bayesian NB sorting algorithms, support vector machines sorting algorithm, K and random forest point
Class algorithm.
3. according to the method described in claim 1, it is characterised in that according to the training sample for belonging to the new samples classification and
Original sample classification belonging to the training sample for belonging to the new samples classification, which is trained, to be obtained the second grader and includes:
According to belonging to the training sample and the training sample for belonging to the new samples classification that belong to the new samples classification
Original sample classification be trained, respectively obtain the second grader of each new samples classification.
4. according to the method described in claim 1, it is characterised in that
To the training sample for belonging to the new samples classification and the training sample institute for belonging to the new samples classification
The original sample classification of category uses following at least one classification algorithm training, obtains second grader:Naive Bayesian NB
The closest KNN sorting algorithms of sorting algorithm, support vector machines sorting algorithm, K and random forest sorting algorithm;Described second
The class categories that grader includes are corresponding with the original sample classification belonging to the training sample in the new samples classification.
5. according to the method described in claim 1, it is characterised in that there is sample cross extremely in described concentrate training sample
Few two original sample classifications are merged before obtaining new samples classification, and methods described also includes:
Training corpus is pre-processed to filter and/or uniform format the training corpus;
Pretreated training corpus is subjected to word segmentation processing according to dictionary for word segmentation and obtains the training sample set.
6. method according to claim 5, it is characterised in that the training corpus includes sentence and/or text fragments.
7. method according to claim 5, it is characterised in that also include:New word discovery behaviour is performed to the training corpus
The neologisms made and will be seen that add the dictionary for word segmentation.
8. method according to claim 7, it is characterised in that the new word discovery operation passes through following at least one mode
Realize:Mutual information, co-occurrence probabilities and comentropy.
9. according to the method described in claim 1, it is characterised in that also include:
Test the classification accuracy of each class categories in first grader;
Test the classification accuracy of each class categories in second grader;
Wherein, the classification accuracy of first grader is respectively P1j, and wherein j is whole more than or equal to 1 and less than or equal to m
Number, m is to merge the sample class number that the training sample after operation is concentrated;
The classification accuracy of second grader is respectively P1h*P2k, and wherein k is whole more than or equal to 1 and less than or equal to n
Number, n is the original sample classification number belonging to the training sample in the new samples classification;P1h is institute in first grader
The classification accuracy of new samples classification is stated, h is the integer more than or equal to 1 and less than or equal to g, and g is the new samples classification number;
Detect that whether the classification accuracy of each class categories in first grader is both greater than the first probability threshold value, and described
Whether the classification accuracy of each class categories is both greater than the second probability threshold value in second grader;
If it is, determining first grader and second classifier training success.
10. method according to claim 9, it is characterised in that each class categories in detection first grader
The classification accuracy whether classification accuracy is both greater than each class categories in the first probability threshold value, and second grader is
No both greater than the second probability threshold value, also includes afterwards:
If not, concentrating at least two original sample classifications that there is sample cross to re-start merging behaviour the training sample
Make and classifier training, until the classification accuracy of each class categories is both greater than the first probability threshold in first grader
Value, and untill the classification accuracy of each class categories is both greater than the second probability threshold value in second grader.
11. method according to claim 10, it is characterised in that also include:
First probability threshold value and second probability threshold value is adjusted to filter out the first grader of different classifications accuracy rate
With the second grader.
12. method according to claim 9, it is characterised in that
The classification accuracy of first grader is totally tested using cross validation mode based on sample;
The classification accuracy of second grader is totally tested using cross validation mode based on sample.
13. method according to claim 12, it is characterised in that the cross validation mode bag overall based on sample
Include:Using in the sample totality 60% to 90% sample as training sample set, using remaining sample as to be sorted
Text is tested, and the sample generally includes the multiple original sample classification, and each sample belongs to the multiple original
One in beginning sample class.
14. according to the method described in claim 1, it is characterised in that at least two original samples that there will be sample cross
This categories combination includes into new samples classification:
There will be all original sample categories combinations of sample cross into a new samples classification.
15. the method according to any one of claim 1 to 14, it is characterised in that it is described described by training sample set
The middle at least two original sample classifications that there is sample cross are merged before operation obtains new samples classification, and methods described is also
Including:
The 3rd grader for obtaining that class categories are the multiple original sample classification, test are trained according to the training sample set
The classification accuracy of each class categories of 3rd grader, it is original less than the 3rd threshold value that preliminary screening goes out classification accuracy
Sample class;
The original sample classification that there is sample cross is identified in the original sample classification that preliminary screening goes out.
16. a kind of file classification method, it is characterised in that utilize the text classifier any one of claim 1 to 15
The grader of generation method generation is classified, and the sorting technique includes:
Text set to be sorted is inputted into first grader, the first classification results are obtained;
By to be sorted text input and the new samples of the class categories in first classification results for the new samples classification
Corresponding second grader of classification, obtains the second classification results.
17. method according to claim 16, it is characterised in that also include:
By class categories in first classification results for the original sample classification without merging classification results with it is described
Final classification result of second classification results collectively as the text to be sorted.
18. a kind of text classifier generating means, it is characterised in that including:
Combining unit, for concentrating at least two original sample classifications that there is sample cross to merge operation training sample
Obtain new samples classification;The training sample set includes multiple original sample classifications, and each training sample belongs to the multiple original
One in beginning sample class;
First training unit, for obtaining the first grader according to the training sample set training after union operation;
Second training unit, for according to the training sample for belonging to the new samples classification and described belonging to the new samples class
Original sample classification belonging to other training sample, which is trained, obtains the second grader, and second grader is used for described
Class categories carry out subseries again for the text to be sorted of the new samples classification in the classification results of first grader, to be divided into
In corresponding original sample classification.
19. device according to claim 18, it is characterised in that first training unit, specifically for:
Following at least one classification algorithm training is used to the training sample set after the union operation, first classification is obtained
Device:The closest KNN sorting algorithms of naive Bayesian NB sorting algorithms, support vector machines sorting algorithm, K and random forest point
Class algorithm.
20. device according to claim 18, it is characterised in that second training unit, specifically for:
According to belonging to the training sample and the training sample for belonging to the new samples classification that belong to the new samples classification
Original sample classification be trained, respectively obtain the second grader of each new samples classification.
21. device according to claim 18, it is characterised in that
Second taxon obtains second grader specifically for using following at least one classification algorithm training:
The closest KNN sorting algorithms of naive Bayesian NB sorting algorithms, support vector machines sorting algorithm, K and random forest classification are calculated
Method;The class categories that second grader includes and the original sample classification belonging to the training sample in the new samples classification
It is corresponding.
22. device according to claim 18, it is characterised in that also include:
Pretreatment unit, for training sample to be concentrated at least two original sample classifications progress that there is sample cross described
Merging is obtained before new samples classification, and training corpus is pre-processed to filter and/or form the training corpus
It is unified;
Participle unit, the training sample is obtained for pretreated training corpus to be carried out into word segmentation processing according to dictionary for word segmentation
Collection.
23. device according to claim 22, it is characterised in that the training corpus includes sentence or text fragments.
24. device according to claim 22, it is characterised in that also including new word discovery unit, for the training
Language material performs the neologisms addition dictionary for word segmentation that new word discovery is operated and will be seen that.
25. device according to claim 24, it is characterised in that the new word discovery operation passes through following at least one side
Formula is realized:Mutual information, co-occurrence probabilities and comentropy.
26. device according to claim 18, it is characterised in that also include:
First test cell, the classification accuracy for testing each class categories in first grader;
Second test cell, the classification accuracy for testing each class categories in second grader;
Wherein, the classification accuracy of first grader is respectively P1j, and wherein j is whole more than or equal to 1 and less than or equal to m
Number, m is to merge the sample class number that the training sample after operation is concentrated;
The classification accuracy of second grader is respectively P1h*P2k, and wherein k is whole more than or equal to 1 and less than or equal to n
Number, n is the original sample classification number belonging to the training sample in the new samples classification;P1h is institute in first grader
The classification accuracy of new samples classification is stated, h is the integer more than or equal to 1 and less than or equal to g, and g is the new samples classification number;
Detection unit, for detecting it is general whether the classification accuracy of each class categories in first grader is both greater than first
Whether the classification accuracy of each class categories is both greater than the second probability threshold value in rate threshold value, and second grader;
Determining unit, if the testing result for the detection unit is yes, determines first grader and described second
Classifier training success.
27. device according to claim 26, it is characterised in that also include:
Returning unit, if the testing result for the detection unit is no, the training sample is concentrated and there is sample friendship
At least two original sample classifications of fork re-start union operation and classifier training, until each in first grader
The classification that the classification accuracy of class categories is both greater than each class categories in the first probability threshold value, and second grader is accurate
Untill true rate is both greater than the second probability threshold value.
28. device according to claim 27, it is characterised in that also include:
Adjustment unit, for adjusting first probability threshold value and second probability threshold value to filter out different classifications accuracy rate
The first grader and the second grader.
29. device according to claim 27, it is characterised in that
First test cell, specifically for totally testing first grader using cross validation mode based on sample
Classification accuracy;
Second test cell, specifically for totally testing second grader using cross validation mode based on sample
Classification accuracy.
30. device according to claim 29, it is characterised in that the cross validation mode bag overall based on sample
Include:Using in the sample totality 60% to 90% sample as training sample set, using remaining sample as to be sorted
Text is tested, and the sample generally includes the multiple original sample classification, and each sample belongs to the multiple original
One in beginning sample class.
31. device according to claim 18, it is characterised in that the combining unit, specifically for there will be sample friendship
All original sample categories combinations of fork are into a new samples classification.
32. the device according to any one of claim 18 to 31, it is characterised in that also including screening unit, for
It is described to concentrate at least two original sample classifications that there is sample cross to merge operation and obtain new samples class training sample
Before not, the 3rd grader for obtaining that class categories are the multiple original sample classification is trained according to the training sample set,
The classification accuracy of each class categories of the 3rd grader is tested, preliminary screening goes out classification accuracy less than the 3rd threshold value
Original sample classification;
The original sample classification that there is sample cross is identified in the original sample classification that preliminary screening goes out.
33. a kind of document sorting apparatus, it is characterised in that utilize the text classifier any one of claim 18 to 32
The grader of generating means generation is classified, and the sorter includes:
First input block, for text set to be sorted to be inputted into first grader, obtains the first classification results;
Second input block, for by class categories in first classification results be the new samples classification text to be sorted
Input the second grader corresponding with the new samples classification, obtains the second classification results.
34. device according to claim 33, it is characterised in that also include:
As a result output unit, for being the original sample classification without merging by class categories in first classification results
Classification results and second classification results collectively as the text to be sorted final classification result.
35. a kind of computer equipment, it is characterised in that including processor and memory;Memory is used to store computer instruction,
Processor is used for the computer instruction for running the memory storage, with the text any one of perform claim requirement 1 to 15
This classifier generation method.
36. a kind of computer equipment, it is characterised in that including processor and memory;Memory is used to store computer instruction,
Processor is used for the computer instruction for running the memory storage, with the text classification side described in perform claim requirement 16 or 17
Method.
37. a kind of computer-readable recording medium, it is characterised in that be stored with instruction in the storage medium, the instruction fortune
Text classifier generation method during row any one of perform claim requirement 1 to 15.
38. a kind of computer-readable recording medium, it is characterised in that be stored with instruction in the storage medium, the instruction fortune
File classification method during row described in perform claim requirement 16 or 17.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710457280.2A CN107273500A (en) | 2017-06-16 | 2017-06-16 | Text classifier generation method, file classification method, device and computer equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710457280.2A CN107273500A (en) | 2017-06-16 | 2017-06-16 | Text classifier generation method, file classification method, device and computer equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107273500A true CN107273500A (en) | 2017-10-20 |
Family
ID=60066353
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710457280.2A Pending CN107273500A (en) | 2017-06-16 | 2017-06-16 | Text classifier generation method, file classification method, device and computer equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107273500A (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107844553A (en) * | 2017-10-31 | 2018-03-27 | 山东浪潮通软信息科技有限公司 | A kind of file classification method and device |
CN108038208A (en) * | 2017-12-18 | 2018-05-15 | 深圳前海微众银行股份有限公司 | Training method, device and the storage medium of contextual information identification model |
CN108229564A (en) * | 2018-01-05 | 2018-06-29 | 阿里巴巴集团控股有限公司 | A kind of processing method of data, device and equipment |
CN108710651A (en) * | 2018-05-08 | 2018-10-26 | 华南理工大学 | A kind of large scale customer complaint data automatic classification method |
CN108920694A (en) * | 2018-07-13 | 2018-11-30 | 北京神州泰岳软件股份有限公司 | A kind of short text multi-tag classification method and device |
CN109359186A (en) * | 2018-10-25 | 2019-02-19 | 杭州时趣信息技术有限公司 | A kind of method, apparatus and computer readable storage medium of determining address information |
CN109961063A (en) * | 2017-12-26 | 2019-07-02 | 杭州海康机器人技术有限公司 | Method for text detection and device, computer equipment and storage medium |
CN110489545A (en) * | 2019-07-09 | 2019-11-22 | 平安科技(深圳)有限公司 | File classification method and device, storage medium, computer equipment |
WO2020034126A1 (en) * | 2018-08-15 | 2020-02-20 | 深圳先进技术研究院 | Sample training method, classification method, identification method, device, medium, and system |
CN111339250A (en) * | 2020-02-20 | 2020-06-26 | 北京百度网讯科技有限公司 | Mining method of new category label, electronic equipment and computer readable medium |
CN112396084A (en) * | 2019-08-19 | 2021-02-23 | ***通信有限公司研究院 | Data processing method, device, equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101876987A (en) * | 2009-12-04 | 2010-11-03 | 中国人民解放军信息工程大学 | Overlapped-between-clusters-oriented method for classifying two types of texts |
US20130103695A1 (en) * | 2011-10-21 | 2013-04-25 | Microsoft Corporation | Machine translation detection in web-scraped parallel corpora |
CN106503254A (en) * | 2016-11-11 | 2017-03-15 | 上海智臻智能网络科技股份有限公司 | Language material sorting technique, device and terminal |
-
2017
- 2017-06-16 CN CN201710457280.2A patent/CN107273500A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101876987A (en) * | 2009-12-04 | 2010-11-03 | 中国人民解放军信息工程大学 | Overlapped-between-clusters-oriented method for classifying two types of texts |
US20130103695A1 (en) * | 2011-10-21 | 2013-04-25 | Microsoft Corporation | Machine translation detection in web-scraped parallel corpora |
CN106503254A (en) * | 2016-11-11 | 2017-03-15 | 上海智臻智能网络科技股份有限公司 | Language material sorting technique, device and terminal |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107844553A (en) * | 2017-10-31 | 2018-03-27 | 山东浪潮通软信息科技有限公司 | A kind of file classification method and device |
CN108038208A (en) * | 2017-12-18 | 2018-05-15 | 深圳前海微众银行股份有限公司 | Training method, device and the storage medium of contextual information identification model |
CN108038208B (en) * | 2017-12-18 | 2022-01-11 | 深圳前海微众银行股份有限公司 | Training method and device of context information recognition model and storage medium |
CN109961063A (en) * | 2017-12-26 | 2019-07-02 | 杭州海康机器人技术有限公司 | Method for text detection and device, computer equipment and storage medium |
CN108229564A (en) * | 2018-01-05 | 2018-06-29 | 阿里巴巴集团控股有限公司 | A kind of processing method of data, device and equipment |
CN108710651A (en) * | 2018-05-08 | 2018-10-26 | 华南理工大学 | A kind of large scale customer complaint data automatic classification method |
CN108710651B (en) * | 2018-05-08 | 2022-03-25 | 华南理工大学 | Automatic classification method for large-scale customer complaint data |
CN108920694B (en) * | 2018-07-13 | 2020-08-28 | 鼎富智能科技有限公司 | Short text multi-label classification method and device |
CN108920694A (en) * | 2018-07-13 | 2018-11-30 | 北京神州泰岳软件股份有限公司 | A kind of short text multi-tag classification method and device |
WO2020034126A1 (en) * | 2018-08-15 | 2020-02-20 | 深圳先进技术研究院 | Sample training method, classification method, identification method, device, medium, and system |
CN109359186B (en) * | 2018-10-25 | 2020-12-08 | 杭州时趣信息技术有限公司 | Method and device for determining address information and computer readable storage medium |
CN109359186A (en) * | 2018-10-25 | 2019-02-19 | 杭州时趣信息技术有限公司 | A kind of method, apparatus and computer readable storage medium of determining address information |
CN110489545A (en) * | 2019-07-09 | 2019-11-22 | 平安科技(深圳)有限公司 | File classification method and device, storage medium, computer equipment |
CN112396084A (en) * | 2019-08-19 | 2021-02-23 | ***通信有限公司研究院 | Data processing method, device, equipment and storage medium |
CN111339250A (en) * | 2020-02-20 | 2020-06-26 | 北京百度网讯科技有限公司 | Mining method of new category label, electronic equipment and computer readable medium |
CN111339250B (en) * | 2020-02-20 | 2023-08-18 | 北京百度网讯科技有限公司 | Mining method for new category labels, electronic equipment and computer readable medium |
US11755654B2 (en) | 2020-02-20 | 2023-09-12 | Beijing Baidu Netcom Science Technology Co., Ltd. | Category tag mining method, electronic device and non-transitory computer-readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107273500A (en) | Text classifier generation method, file classification method, device and computer equipment | |
CN109145108A (en) | Classifier training method, classification method, device and computer equipment is laminated in text | |
KR101938212B1 (en) | Subject based document automatic classification system that considers meaning and context | |
CN109919243A (en) | A kind of scrap iron and steel type automatic identifying method and device based on CNN | |
CN109960727B (en) | Personal privacy information automatic detection method and system for unstructured text | |
CN103886108B (en) | The feature selecting and weighing computation method of a kind of unbalanced text set | |
CN105975518B (en) | Expectation cross entropy feature selecting Text Classification System and method based on comentropy | |
CN110069630B (en) | Improved mutual information feature selection method | |
CN110442568A (en) | Acquisition methods and device, storage medium, the electronic device of field label | |
CN107729520B (en) | File classification method and device, computer equipment and computer readable medium | |
CN105975491A (en) | Enterprise news analysis method and system | |
Bader-El-Den et al. | Garf: towards self-optimised random forests | |
CN109491914A (en) | Defect report prediction technique is influenced based on uneven learning strategy height | |
CN107180084A (en) | Word library updating method and device | |
CN107992900A (en) | Sample acquiring method, training method, device, medium and the equipment of defects detection | |
CN106445908A (en) | Text identification method and apparatus | |
CN108416369A (en) | Based on Stacking and the random down-sampled sorting technique of overturning, system, medium and equipment | |
CN104820702B (en) | A kind of attribute weight method and file classification method based on decision tree | |
CN110097096A (en) | A kind of file classification method based on TF-IDF matrix and capsule network | |
CN110781333A (en) | Method for processing unstructured monitoring data of cable-stayed bridge based on machine learning | |
CN110910175A (en) | Tourist ticket product portrait generation method | |
Sun et al. | Adaptive activation thresholding: Dynamic routing type behavior for interpretability in convolutional neural networks | |
CN109800790A (en) | A kind of feature selection approach towards high dimensional data | |
CN102411592A (en) | Text classification method and device | |
CN108733652A (en) | The test method of film review emotional orientation analysis based on machine learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20171020 |