CN103902570A

CN103902570A - Text classification feature extraction method, classification method and device

Info

Publication number: CN103902570A
Application number: CN201210578378.0A
Authority: CN
Inventors: 李鑫; 张延祥
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2012-12-27
Filing date: 2012-12-27
Publication date: 2014-07-02
Anticipated expiration: 2032-12-27
Also published as: CN103902570B

Abstract

The embodiment of the invention discloses a text classification feature extraction method and a classification method and device. The text classification feature extraction method includes the steps of obtaining a feature word set of training set texts, confirming feature grading values of the feature words according to relevance of feature words in the feature word set and preset text categories, recording the feature words whose grading values are higher than a preset score threshold value, and obtaining a text feature set of the training set texts. According to the text classification feature extraction method and the classification method and device, the number of the feature words can be effectively reduced on the situation that the feature words capable of expressing text information are obtained, and accordingly when texts are classified, classification operation time is conveniently reduced, calculation time and space expenditures are reduced, and calculation cost is saved.

Description

A kind of text classification feature extracting method, sorting technique and device

Technical field

The present invention relates to text classification field, relate in particular to a kind of text classification feature extracting method, sorting technique and device.

Background technology

Along with developing rapidly of Internet technology, the quantity of network text presents volatile growth, and how effectively managing these texts is current hot issues, and text classification, as the gordian technique of management mass data, is widely used.

The file classification method based on statistics adopting at present, by the classified text of study, can be classified to new example text preferably.Wherein, in the process that new example is classified, need first carry out word segmentation processing to example text, obtain comprising the set of words of some words, then all words based in set of words carry out text classification processing, complete the classification to this example text.Inventor finds implementing when prior art, adopts the sorting technique of aforesaid way many at example content of text, and in the higher situation of the quantity of the word that participle obtains, sort operation Performance Ratio is poor.

Summary of the invention

Embodiment of the present invention technical matters to be solved is, a kind of text classification feature extracting method, sorting technique and device are provided, and can improve the performance of classification.

In order to solve the problems of the technologies described above, the embodiment of the present invention provides a kind of text classification feature extracting method, it is characterized in that, comprising:

Obtain the Feature Words set of training set text;

According to the word length of the degree of correlation of each Feature Words and preset text categories in Feature Words set and Feature Words, determine the feature score value of each Feature Words;

Recording feature score value, higher than the Feature Words of preset fraction threshold value, obtains the text feature set of described training set text.

The Feature Words set of obtaining training set text wherein, comprises:

Training set text is carried out to word segmentation processing, obtain the set of words of described training set text;

Delete the stop words in described set of words, obtain Feature Words set, the stop words in described set of words comprises auxiliary words of mood and/or personal pronoun.

Wherein, the stop words in the described set of words of described deletion, obtains Feature Words set, comprising:

Each participle in described set of words and the preset stop words in the inactive dictionary of presetting are compared;

According to comparative result, participle identical with preset stop words in set of words is deleted, obtained Feature Words set.

Wherein, described according to the word length of the degree of correlation of each Feature Words and preset text categories in Feature Words set and Feature Words, determine the feature score value of each Feature Words, comprising:

Determine the degree of correlation of each Feature Words in Feature Words set and preset each text categories;

Determine the length weighted value of each Feature Words according to the word length of each Feature Words;

According to the degree of correlation of each Feature Words and length weighted value, determine the feature score value of each Feature Words.

Wherein, described according to the degree of correlation of each Feature Words and length weighted value, determine the feature score value of each Feature Words, comprising:

According to the degree of correlation of Feature Words, determine the class discrimination ability of Feature Words in each corresponding text categories;

Determine the class discrimination ability sum of Feature Words in preset all text categories;

According to class discrimination ability sum and length weighted value, determine the feature score value of each Feature Words.

Wherein, in described definite Feature Words set, in the degree of correlation of each Feature Words and preset text categories, determine that the computing formula of the degree of correlation comprises:

R_{jk} = \frac{| {i : t_{k} &Element; d_{j}, d_{j} &Element; C_{j}} |}{| C_{j} |};

Wherein, R _jkrepresentation feature word t _kwith text categories C _jthe degree of correlation, { i:t _k∈ d _j, d _j∈ C _j| represent text categories C _jin there is Feature Words t _knumber of files, | C _j| represent text categories C _jtotal number of documents.

Wherein, the described word length according to each Feature Words is determined in the length weighted value of each Feature Words, determines that the computing formula of length weighted value comprises:

weight(len(t _k))=log(e+len(t _k))；

Wherein, e is the natural numerical value of presetting, len (t _k) be Feature Words t _klength value.

Wherein, described according to the degree of correlation of each Feature Words, determine in the class discrimination ability of each Feature Words in corresponding text categories, determine that the computing formula of class discrimination ability comprises:

Diff _jk=min(|R _jk-R _ik|)，i≠j；

Wherein, Diff _jkrepresentation feature word t _kat text categories C _jon the value of class discrimination ability, R _jkrepresentation feature word t _kwith text categories C _jthe degree of correlation, R _ikrepresentation feature word t _kwith text categories C _ithe degree of correlation;

In the class discrimination ability sum of described definite Feature Words in preset all text categories, determine that the computing formula of described class discrimination ability sum comprises:

{Diff}_{k} = Σ_{j = 1}^{n} {Diff}_{jk};

Wherein, Diff _kfor Feature Words t _kclass discrimination ability sum in preset all text categories;

Described according to class discrimination ability sum and length weighted value, determine in the feature score value of each Feature Words, determine that the computing formula of feature score value comprises:

f(t _k)=Diff _k×weight(len(t _k))；

Wherein, f (t _k) be Feature Words t _kfeature score value.

Correspondingly, the embodiment of the present invention also provides a kind of file classification method, comprising:

Obtain respectively the Feature Words set of each text in training set, and merged the Feature Words set of duplicate removal formation training set;

According to the degree of correlation and the word length of each Feature Words in the Feature Words set of training set and preset text categories, determine the feature score value of each Feature Words;

Recording feature score value, higher than the Feature Words of preset fraction threshold value, obtains the text feature set of described training set;

According to the text feature set of training set, obtain the Feature Words set of the each text of test set;

Carry out text vector operation according to the Feature Words set of each text in the text feature set of training set and described test set, obtain the text vector of each text in training set and the text vector of the interior each text of test set, form the text vector set of training set and the text vector set of test set;

Generate textual classification model according to the text vector set of training set, and according to the textual classification model of described generation, each text vector in the text vector set of described test set is classified, obtain the classification of each text in test set.

Wherein, text vector operation is carried out in the described Feature Words set according to each text in the text feature set of training set and described test set, obtain the text vector of each text in training set and the text vector of the interior each text of test set, form the text vector set of training set and the text vector set of test set, comprising:

Each Feature Words allocation index in text feature set to described training set and described test set in the Feature Words set of each text;

Determine the weight of each Feature Words in the text feature set of each text in described training set according to the text feature set of training set, and the weight of each Feature Words in the Feature Words set of each text in definite test set, wherein, the algorithm of described definite weight comprises: word frequency-inverse document frequency TF-IDF weighting algorithm;

Generate vector according to the index of each Feature Words and weight, obtain respectively the text vector of each text in training set and test set, obtain the text vector set of training set and test set.

Wherein, the described text vector set according to training set generates textual classification model, and according to the textual classification model of described generation, each text vector in the text vector set of described test set is classified, obtain the classification of each text in test set, comprising:

Each text vector in the text vector set of described training set is carried out to normalized, so that the weight of each characteristic item in each text vector is projected in default numerical range;

According to the text vector set of training set after normalized and preset disaggregated model, generate textual classification model, described preset disaggregated model comprises support vector machines disaggregated model;

According to the textual classification model of described generation, each text vector in the text vector set of described test set is classified, obtain the classification of each text in test set.

According to the text vector set of training set and preset disaggregated model, generate textual classification model, described preset disaggregated model comprises Naive Bayes Classification Model;

Correspondingly, the embodiment of the present invention also provides a kind of text classification feature deriving means, comprising:

Acquisition module, for obtaining the Feature Words set of training set text;

Determination module, for according to the each Feature Words of Feature Words set and the preset degree of correlation of text categories and the word length of Feature Words, determines the feature score value of each Feature Words;

Logging modle, the Feature Words for recording feature score value higher than preset fraction threshold value, obtains the text feature set of described training set text.

Wherein, described acquisition module comprises:

Participle unit, for training set text is carried out to word segmentation processing, obtains the set of words of described training set text;

Delete cells, for deleting the stop words of described set of words, obtains Feature Words set, and the stop words in described set of words comprises auxiliary words of mood and/or personal pronoun.

Wherein, described delete cells comprises:

Relatively subelement, for comparing each participle of described set of words and the preset stop words in the inactive dictionary of presetting;

Delete subelement, for according to comparative result, participle identical with preset stop words set of words being deleted, obtain Feature Words set.

Wherein, described determination module comprises:

The first determining unit, for determining the degree of correlation of the each Feature Words of Feature Words set and preset each text categories;

The second determining unit, for determining the length weighted value of each Feature Words according to the word length of each Feature Words;

The 3rd determining unit, for according to the degree of correlation of each Feature Words and length weighted value, determines the feature score value of each Feature Words.

Wherein, described the 3rd determining unit, specifically for according to the degree of correlation of Feature Words, determine the class discrimination ability of Feature Words in each corresponding text categories, and the class discrimination ability sum of definite Feature Words in preset all text categories, and according to class discrimination ability sum and length weighted value, determine the feature score value of each Feature Words.

Correspondingly, the embodiment of the present invention also provides a kind of document sorting apparatus, comprising:

Characteristic extracting module, for obtaining respectively the Feature Words set of each text in training set, and merged duplicate removal and formed the Feature Words set of training set, according to the degree of correlation and the word length of each Feature Words in the Feature Words set of training set and preset text categories, determine the feature score value of each Feature Words, recording feature score value, higher than the Feature Words of preset fraction threshold value, obtains the text feature set of described training set;

Acquisition module, for according to the text feature set of training set, obtains the Feature Words set of the each text of test set;

Vector determination module, for carrying out text vector operation according to the Feature Words set of each text in the text feature set of training set and described test set, obtain the text vector of each text in training set and the text vector of the interior each text of test set, form the text vector set of training set and the text vector set of test set;

Sort module, for generating textual classification model according to the text vector set of training set, and classifies to each text vector in the text vector set of described test set according to the textual classification model of described generation, obtains the classification of each text in test set.

Wherein, described vectorial determination module comprises:

Index assignment unit, for each Feature Words allocation index of the Feature Words set of each text in the text feature set to described training set and described test set;

Weight determining unit, determine the weight of each Feature Words in the text feature set of each text in described training set according to the text feature set of training set, and the weight of each Feature Words in the Feature Words set of each text in definite test set, wherein, the algorithm of described definite weight comprises: word frequency-inverse document frequency TF-IDF weighting algorithm;

Vector determining unit, for generating vector according to the index of each Feature Words and weight, obtains respectively the text vector of each text in training set and test set, obtains the text vector set of training set and test set.

Wherein, described sort module comprises:

Model generation unit, carries out normalized for each text vector of the text vector set to described training set, so that the weight of each characteristic item in each text vector is projected in default numerical range; According to the text vector set of training set after normalized and preset disaggregated model, generate textual classification model, described preset disaggregated model comprises support vector machines disaggregated model;

The first taxon, for according to the textual classification model of described generation, each text vector of text vector set of described test set being classified, obtains the classification of each text in test set.

Wherein, described sort module comprises:

The second taxon, be used for according to the text vector set of training set and preset disaggregated model, generate textual classification model, described preset disaggregated model comprises Naive Bayes Classification Model, according to the textual classification model of described generation, each text vector in the text vector set of described test set is classified, obtain the classification of each text in test set.

Implement the embodiment of the present invention, there is following beneficial effect:

The embodiment of the present invention obtains after set of words at participle, also need, according to the degree of correlation of each Feature Words and text categories in set of words and the length of Feature Words, set of words is carried out to feature extraction, can be in the case of obtaining expressing the Feature Words of text message, effectively reduce the number of Feature Words, thereby convenient in the time that text is classified, reduce the sort run time, reduce time and space expense that classification is processed, save classification cost.

Accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, to the accompanying drawing of required use in embodiment or description of the Prior Art be briefly described below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skills, do not paying under the prerequisite of creative work, can also obtain according to these accompanying drawings other accompanying drawing.

Fig. 1 is the schematic flow sheet of a kind of text classification feature extracting method of the embodiment of the present invention;

Fig. 2 is the schematic flow sheet of the another kind of text classification feature extracting method of the embodiment of the present invention;

Fig. 3 is wherein a kind of concrete schematic flow sheet of determining method of the feature score value of the embodiment of the present invention;

Fig. 4 is the schematic flow sheet of a kind of file classification method of the embodiment of the present invention;

Fig. 5 is the structural representation of a kind of text classification feature deriving means of the embodiment of the present invention;

Fig. 6 is wherein a kind of concrete outcome schematic diagram of the acquisition module in Fig. 5;

Fig. 7 is wherein a kind of concrete structure composition schematic diagram of the determination module in Fig. 5;

The structural representation of a kind of document sorting apparatus of Fig. 8 embodiment of the present invention;

Fig. 9 is wherein a kind of concrete outcome schematic diagram of the vectorial determination module in Fig. 8;

Figure 10 is wherein a kind of concrete structure schematic diagram of the sort module in Fig. 8.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the present invention's part embodiment, rather than whole embodiment.Based on the embodiment in the present invention, those of ordinary skills, not making the every other embodiment obtaining under creative work prerequisite, belong to the scope of protection of the invention.

Referring to Fig. 1, is the schematic flow sheet of a kind of text classification feature extracting method of the embodiment of the present invention; The described method of the embodiment of the present invention can be applicable in all kinds of text application servers, related training set is that of setting in advance comprises that multiple texts are referred to as the set of training set text, in this training set, the text of text is known type, described training set Chinese version is carried out to the feature extraction of the embodiment of the present invention, so that generate corresponding disaggregated model according to extraction result, the text of unknown test set is classified.Concrete, the described method of the embodiment of the present invention comprises:

S101: the Feature Words set of obtaining training set text.

Described Feature Words set comprises word or the word that can reflect preferably described training set text implication to be expressed, concrete, in embodiments of the present invention, described Feature Words set of obtaining training set text can comprise: training set text is carried out to word segmentation processing, obtain the set of words of described training set text; Delete the stop words in described set of words, obtain Feature Words set, the stop words in described set of words comprises auxiliary words of mood and/or personal pronoun.

Wherein, described training set text is the text that has been marked classification, for the test set text of some unknown classifications is classified as a reference, it can be specifically some microblogging content text of having known classification, Press release text, paper text etc., word segmentation processing is that each sentence in described training set text is decomposed into word or word, is the set of word, word by text-converted.Participle process can adopt existing participle mode to carry out, and is not repeated herein.

Delete stop words and comprise and delete punctuation mark and some tone group word, personal pronoun etc. without Special Significance, these stop words all may occur in any text, and therefore its representative ability to text is more weak, and theme that can not represent, needs to delete.

S102: according to the word length of the degree of correlation of each Feature Words and preset text categories in Feature Words set and Feature Words, determine the feature score value of each Feature Words.

The degree of correlation of Feature Words and text categories can be according to occurring in all texts known under preset text categories that the number of files of described Feature Words and the ratio of this classifying documents sum draw.

Meanwhile, generally, the length of word is shorter, and its expressed information is also fewer, and still less, therefore, the length of word is longer for the expressed information of single character, just more can reflect text categories, therefore, can introduce word length Feature Words is carried out to feature scoring.

Because word is long and the degree of correlation draws, can adopt corresponding word length as weighted value, the mode of carrying out quadrature calculating with the degree of correlation obtains feature score value.The degree of correlation of different characteristic word and word are long not identical, and the feature score value that it obtains also can vary, so that following S103 deletes the Feature Words that score value is less, the Feature Words that keeping characteristics score value is larger, obtains text feature set.

S103: recording feature score value, higher than the Feature Words of preset fraction threshold value, obtains the text feature set of described training set text.

Determine and draw after the feature score value of each Feature Words, the Feature Words that feature score value in Feature Words set is less than to preset fraction threshold value is deleted, and remaining Feature Words forms the text feature set of training set text.

Referring to Fig. 2, is the schematic flow sheet of the another kind of text classification feature extracting method of the embodiment of the present invention again; The described method of the embodiment of the present invention can be applied in all kinds of text application servers equally, so that text application server generates disaggregated model after extracting feature according to this text characteristic of division extracting method, complete the classification of the test set text to unknown text type, concrete, the described method of the embodiment of the present invention comprises:

S201: training set text is carried out to word segmentation processing, obtain the set of words of described training set text.

S202: delete the stop words in described set of words, obtain Feature Words set, the stop words in described set of words comprises auxiliary words of mood and/or personal pronoun.

In embodiments of the present invention, described S202 specifically can comprise: each participle in described set of words and the preset stop words in the inactive dictionary of presetting are compared; According to comparative result, participle identical with preset stop words in set of words is deleted, obtained Feature Words set.

The stop words that described inactive dictionary comprises, by user's typing in advance, does not comprise the word of Special Significance comprising all kinds of auxiliary words, personal pronoun etc.The equipment such as text server are deleted the corresponding word in the set of words of training set text by the mode comparing one by one, obtain Feature Words set, i.e. First Characteristic set of words.

S203: the degree of correlation of determining each Feature Words in Feature Words set and preset each text categories.

The degree of correlation of Feature Words and text categories can be according to occurring in all texts known under described text categories that the number of files of described Feature Words and the ratio of total number of documents draw.

In embodiments of the present invention, the computing formula of concrete definite degree of correlation can be:

R_{jk} = \frac{| {i : t_{k} &Element; d_{j}, d_{j} &Element; C_{j}} |}{| C_{j} |};

S204: the length weighted value of determining each Feature Words according to the word length of each Feature Words.

Generally, the length of word is shorter, and its expressed information is also fewer, and still less, therefore, the length of word is longer for the expressed information of single character, just more can reflect text categories, therefore, can introduce word length Feature Words is carried out to feature scoring.

In embodiments of the present invention, the concrete computing formula of measured length weighted value really can be:

weight(len(t _k))=log(e+len(t _k))；

Wherein, e is the natural numerical value of presetting, len (t _k) be Feature Words t _klength value.E is wherein the numerical value that user obtains according to classification experience.

S205: according to the degree of correlation of each Feature Words and length weighted value, determine the feature score value of each Feature Words.

Because word is long and the degree of correlation draws, can adopt accordingly for example using word length as weighted value, the mode of carrying out quadrature calculating with the degree of correlation obtains feature score value.The degree of correlation of different characteristic word and word are long not identical, and the feature score value that it obtains also can vary, so that delete less Feature Words, the Feature Words that keeping characteristics score value is larger, obtains text feature set, i.e. Second Characteristic set of words.

S206: recording feature score value, higher than the Feature Words of preset fraction threshold value, obtains the text feature set of described training set text.

Determining to such an extent that there emerged a after the feature score value of Feature Words, the Feature Words that feature score value in Feature Words set is less than to preset fraction threshold value is deleted, and remaining Feature Words forms the text feature set of training set text.

Further concrete, then refer to Fig. 3, be wherein a kind of concrete schematic flow sheet of determining method of the feature score value of the embodiment of the present invention; In embodiments of the present invention, the concrete of described feature score value determines that method comprises:

S301: according to the degree of correlation of Feature Words, determine the class discrimination ability of Feature Words in each corresponding text categories.

The computing formula of the definite class discrimination ability in the present embodiment comprises:

Diff _jk=min(|R _jk-R _ik|)，i≠j；

Wherein, Diff _jkrepresentation feature word t _kat text categories C _jon the value of class discrimination ability, R _jkrepresentation feature word t _kwith text categories C _jthe degree of correlation, R _ikrepresentation feature word t _kwith text categories C _ithe degree of correlation.

Described class discrimination ability value is for the difference of the representative ability of characteristic feature word in certain classification and in other classifications, and difference is larger just shows that the ability that Feature Words can distinguish such and other class is larger.

S302: determine the class discrimination ability sum of Feature Words in preset all text categories.

The computing formula of the definite described class discrimination ability sum in the present embodiment comprises:

{Diff}_{k} = Σ_{j = 1}^{n} {Diff}_{jk};

Wherein, Diff _kfor Feature Words t _kclass discrimination ability sum in preset all text categories.

S303: according to class discrimination ability sum and length weighted value, determine the feature score value of each Feature Words.

The computing formula of the definite feature score value in the present embodiment comprises:

f(t _k)=Diff _k×weight(len(t _k))；

Wherein, f (t _k) be Feature Words t _kfeature score value.

And determine the class discrimination ability of Feature Words in each classification according to the degree of correlation, the feature score value that carries out Feature Words according to class discrimination ability sum and length weighted value is again determined and screening, can extract more exactly the Feature Words that obtains characterizing in training set text target text classification information, guarantee further the accuracy that Feature Words extracts.

Below the file classification method of the embodiment of the present invention is elaborated.

Referring to Fig. 4, is the schematic flow sheet of a kind of file classification method of the embodiment of the present invention; First the described method of the embodiment of the present invention extracts the Feature Words of target text by above-mentioned text classification feature extracting method, and then classify according to Feature Words, wherein related training set is that of setting in advance comprises that multiple texts are referred to as the set of training set text, in this training set, the text of text is known type, described training set Chinese version is carried out to the feature extraction of the embodiment of the present invention, so that generate corresponding disaggregated model according to extraction result, the text of unknown test set is classified.It specifically comprises:

S401: obtain respectively the Feature Words set of each text in training set, and merged the Feature Words set of duplicate removal formation training set;

Described training set comprises multiple texts, these texts are the text that has been marked classification, for the test set text of some unknown classifications is classified as a reference, in the Feature Words set that obtains respectively each text in training set, Feature Words set to each text is processed, remove dittograph and only retain one of them, form the Feature Words set of whole training set.

The Feature Words set that obtains each training set Chinese version in described S401 specifically comprises: training set text is carried out to word segmentation processing, obtain the set of words of described training set text; Delete the stop words in described set of words, obtain Feature Words set, the stop words in described set of words comprises auxiliary words of mood and/or personal pronoun.Each text in training set is carried out to Feature Words extraction to obtain the text feature set of training set.

S402: according to the degree of correlation and the word length of each Feature Words in the Feature Words set of training set and preset text categories, determine the feature score value of each Feature Words.

S403: recording feature score value, higher than the Feature Words of preset fraction threshold value, obtains the text feature set of described training set.

The feature score value of definite Feature Words of the embodiment of the present invention and finally get specific formula for calculation that the text feature set of training set uses and mode can adopt concrete extracting mode and formula in above-mentioned Fig. 2 and text classification feature extracting method embodiment corresponding to Fig. 3 to carry out, is not repeated herein.

S404: according to the text feature set of training set, obtain the Feature Words set of the each text of test set;

In described S404, the each text of test set is processed also and can specifically be comprised: test set text is carried out to word segmentation processing, obtain the set of words of described test set text; Delete the stop words in described set of words, and according to the Feature Words in the text feature set of training set, obtain Feature Words set, stop words in described set of words comprises auxiliary words of mood and/or personal pronoun, deletes tone group word, personal pronoun and non-existent word in the text feature set of training set in the set of words of test set text.Test set comprises that one or more text that need to determine its classification is referred to as test set text, and it can comprise microblogging content, Press release, paper text of unknown text classification etc.

S405: carry out text vector operation according to the Feature Words set of each text in the text feature set of training set and described test set, obtain the text vector of each text in training set and the text vector of the interior each text of test set, form the text vector set of training set and the text vector set of test set.

Vector set is that each character representation in the Feature Words set of the text feature set of each text in training set or test set text is become to corresponding vectorial set.In the present embodiment, described S405 specifically can comprise: each the Feature Words allocation index in the text feature set to described training set and described test set in the Feature Words set of each text; Determine the weight of each Feature Words in the text feature set of each text in described training set according to the text feature set of training set, and the weight of each Feature Words in the Feature Words set of each text in definite test set, wherein, the algorithm of described definite weight comprises: word frequency-inverse document frequency TF-IDF weighting algorithm; Generate vector according to the index of each Feature Words and weight, obtain respectively the text vector of each text in training set and test set, obtain the text vector set of training set and test set.

Concrete, in the text feature set of distribution training set, the index value of each Feature Words is the distribution for each text carries out in training set, to have obtained corresponding to the distribution of each text in training set the text classification feature of index value, wherein, each has distributed in the text classification feature of index value, for non-existent Feature Words in its corresponding text, for the index value of its distribution is 0, and for the Feature Words existing, the index value distributing is 1.Then adopt TF-IDF weighting algorithm etc. determine each texts in described training set distribution the weight of each Feature Words in the text feature set of index value, obtain the text vector of each text in training set.Test set can be directly to adopting TF-IDF weighting algorithm to determine the weight of each Feature Words in text after each text allocation index value in test set, so that the final text vector of determining each text that obtains test set.

Certainly,, in described S405, corresponding set is carried out to text vector operation and also can directly adopt existing techniques in realizing.

S406: generate textual classification model according to the text vector set of training set, and according to the textual classification model of described generation, each text vector in the text vector set of described test set is classified, obtain the classification of each text in test set.

In described S406, become textual classification model specifically can comprise according to the vector set symphysis of each text in training set: each text vector in the text vector set of described training set to be carried out to normalized, so that the weight of each characteristic item in each text vector is projected in default numerical range; According to the text vector set of training set after normalized and preset disaggregated model, generate textual classification model, described preset disaggregated model comprises support vector machines disaggregated model; According to the textual classification model of described generation, each text vector in the text vector set of described test set is classified, obtain the classification of each text in test set.Or comprise: according to the text vector set of training set and preset disaggregated model, generate textual classification model, described preset disaggregated model comprises Naive Bayes Classification Model; According to the textual classification model of described generation, each text vector in the text vector set of described test set is classified, obtain the classification of each text in test set.

Because text disaggregated model is by one or more dissimilar text generation, therefore, textual classification model by this generation can, to the classification of the text of unknown classification, can be determined the text categories of described test set text by textual classification model.

Below the text classification feature deriving means of the embodiment of the present invention is elaborated.

Referring to Fig. 5, is the structural representation of a kind of text classification feature deriving means of the embodiment of the present invention; Text extraction element described in the embodiment of the present invention can be arranged in the equipment such as text server, related training set is that of setting in advance comprises that multiple texts are referred to as the set of training set text, in this training set, the text of text is known type, described training set Chinese version is carried out to the feature extraction of the embodiment of the present invention, so that generate corresponding disaggregated model according to extraction result, the text of unknown test set is classified.The described device of the embodiment of the present invention can comprise:

Acquisition module 11, for obtaining the Feature Words set of training set text;

Determination module 12, for according to the degree of correlation and the word length of the each Feature Words of Feature Words set and preset text categories, determines the feature score value of each Feature Words.

Logging modle 13, the Feature Words for recording feature score value higher than preset fraction threshold value, obtains the text feature set of described training set text.

Because word is long and the degree of correlation draws, can adopt corresponding word length as weighted value, the mode of carrying out quadrature with the degree of correlation obtains feature score value.The degree of correlation of different characteristic word and word are long not identical, and the feature score value that it obtains also can vary, so that described logging modle 13 is deleted the larger Feature Words of Feature Words keeping characteristics score value that score value is less, obtains text feature set.

Described logging modle 13 is determined to such an extent that there emerged a after the feature score value of Feature Words at described determination module 12, and the Feature Words that feature score value in Feature Words set is less than to preset fraction threshold value is deleted, and remaining Feature Words forms the text feature set of target text.

Further, referring to Fig. 6, is wherein a kind of concrete outcome schematic diagram of the acquisition module in Fig. 5, and described acquisition module 11 specifically can comprise with lower unit and realizing:

Participle unit 111, for training set text is carried out to word segmentation processing, obtains the set of words of described training set text.

Delete cells 112, for deleting the stop words of described set of words, obtains Feature Words set, and the stop words in described set of words comprises auxiliary words of mood and/or personal pronoun.

Described training set text is the text that has been marked classification, it can be specifically some microblogging content text of having known classification, Press release text, paper text etc., word segmentation processing is that each sentence in described training set text is decomposed into word or word, is the set of word, word by text-converted.Participle process can adopt existing participle mode to carry out, and is not repeated herein.

Further concrete, described delete cells 112 can also comprise following subelement:

Relatively subelement 1121, for comparing each participle of described set of words and the preset stop words in the inactive dictionary of presetting.

Delete subelement 1122, for according to comparative result, participle identical with preset stop words set of words being deleted, obtain Feature Words set.

The stop words that described inactive dictionary comprises, by user's typing in advance, does not comprise the word of Special Significance comprising all kinds of auxiliary words, personal pronoun etc.The equipment such as text server are deleted the corresponding word in the set of words of target text by the mode comparing one by one, obtain Feature Words set, i.e. First Characteristic set of words.

Further, then referring to Fig. 7, is wherein a kind of concrete structure composition schematic diagram of the determination module in Fig. 5, and described determination module 13 specifically can comprise with lower unit:

The first determining unit 131, for determining the degree of correlation of the each Feature Words of Feature Words set and preset each text categories.

The second determining unit 132, for determining the length weighted value of each Feature Words according to the word length of each Feature Words.

The 3rd determining unit 133, for according to the degree of correlation of each Feature Words and length weighted value, determines the feature score value of each Feature Words.

Described the first determining unit 131 determines that the degree of correlation of Feature Words and text categories can be according to occurring in all texts known under described text categories that the number of files of described Feature Words and the ratio of total number of documents draw.

R_{jk} = \frac{| {i : t_{k} &Element; d_{j}, d_{j} &Element; C_{j}} |}{| C_{j} |};

For described the second determining unit 132, in the ordinary course of things, the length of word is shorter, and its expressed information is also fewer, still less, therefore, the length of word is longer for the expressed information of single character, just more can reflect text categories, therefore, can introduce word length Feature Words is carried out to feature scoring.

weight(len(t _k))=log(e+len(t _k))；

Wherein, e is the natural numerical value of presetting, and is the numerical value that user obtains according to classification experience, len (t _k) be Feature Words t _klength value.

Because word is long and the degree of correlation draws, described the 3rd determining unit 133 can adopt accordingly for example using word length as weighted value, and the mode of carrying out quadrature with the degree of correlation obtains feature score value.The degree of correlation of different characteristic word and word are long not identical, the feature score value that it obtains also can vary, and can delete less Feature Words according to feature score value, the Feature Words that keeping characteristics score value is larger, obtain text feature set, i.e. Second Characteristic set of words.

Concrete, in embodiments of the present invention, described the 3rd determining unit 133 determines that feature scoring person specifically can complete definite according to following formula.

First,, according to the degree of correlation of Feature Words, determine the class discrimination ability of Feature Words in each corresponding text categories.The computing formula of the definite class discrimination ability in the present embodiment comprises:

Diff _jk=min(|R _jk-R _ik|)，i≠j；

Secondly, determine the class discrimination ability sum of Feature Words in preset all text categories.The computing formula of the definite described class discrimination ability sum in the present embodiment comprises:

{Diff}_{k} = Σ_{j = 1}^{n} {Diff}_{jk};

Finally, according to class discrimination ability sum and length weighted value, determine the feature score value of each Feature Words.The computing formula of the definite feature score value in the present embodiment comprises:

f(t _k)=Diff _k×weight(len(t _k))；

Wherein, f (t _k) be Feature Words t _kfeature score value.

Referring to Fig. 8, is the structural representation of a kind of document sorting apparatus of the embodiment of the present invention again.Text extraction element described in the embodiment of the present invention can be arranged in the equipment such as text server, so that complete the classification to this target text extract the feature of a certain target text according to this text characteristic of division extracting method after, concrete, the described device of the embodiment of the present invention can comprise: characteristic extracting module 21, acquisition module 22, vectorial determination module 23 and sort module 24.

Described characteristic extracting module 21, for obtaining respectively the Feature Words set of each text in training set, and merged duplicate removal and formed the Feature Words set of training set, according to the degree of correlation and the word length of each Feature Words in the Feature Words set of training set and preset text categories, determine the feature score value of each Feature Words, recording feature score value, higher than the Feature Words of preset fraction threshold value, obtains the text feature set of described training set.

Concrete, described characteristic extracting module 21 can specifically comprise that the acquisition module 11 in above-mentioned text classification feature deriving means embodiment obtains the Feature Words set of each text in training set, then after merging, remove repetitor, determine and recording processing by determination module 12 and logging modle 13 again, complete the obtaining of text feature set of each text in training set.

Described acquisition module 22, for according to the text feature set of training set, obtains the Feature Words set of the each text of test set.

The Feature Words set that described acquisition module 22 obtains the each text of test set can be specifically to pass through: test set text is carried out to word segmentation processing, obtain the set of words of described test set text; Delete the stop words in described set of words, and according to the Feature Words in the text feature set of training set, obtain Feature Words set, stop words in described set of words comprises auxiliary words of mood and/or personal pronoun, deletes tone group word, personal pronoun and non-existent word in the text feature set of training set in the set of words of test set text.Test set comprises that one or more text that need to determine its classification is referred to as test set text, and it can comprise microblogging content, Press release, paper text of unknown text classification etc.

Described vectorial determination module 23, for carrying out text vector operation according to the Feature Words set of each text in the text feature set of training set and described test set, obtain the text vector of each text in training set and the text vector of the interior each text of test set, form the text vector set of training set and the text vector set of test set;

Sort module 24, for generating textual classification model according to the text vector set of training set, and according to the textual classification model of described generation, each text vector in the text vector set of described test set is classified, obtain the classification of each text in test set.

Concrete, refer to Fig. 9, be wherein a kind of concrete outcome schematic diagram of the vectorial determination module in Fig. 8.Described vectorial determination module 23 can adopt vectorization of the prior art to operate text classification proper vector, and it specifically can comprise:

Index assignment unit 231, for each Feature Words allocation index of the Feature Words set of each text in the text feature set to described training set and described test set;

Weight determining unit 232, for determine the weight of each Feature Words of the text feature set of each text in described training set according to the text feature set of training set, and the weight of each Feature Words in the Feature Words set of each text in definite test set, wherein, the algorithm of described definite weight comprises: word frequency-inverse document frequency TF-IDF weighting algorithm;

Vector determining unit 233, for generating vector according to the index of each Feature Words and weight, obtains respectively the text vector of each text in training set and test set, obtains the text vector set of training set and test set.

Further, referring to Figure 10, is wherein a kind of concrete structure schematic diagram of the sort module in Fig. 8, and 24 of described sort modules can comprise:

Model generation unit 241, carries out normalized for each text vector of the text vector set to described training set, so that the weight of each characteristic item in each text vector is projected in default numerical range; According to the text vector set of training set after normalized and preset disaggregated model, generate textual classification model, described preset disaggregated model comprises support vector machines disaggregated model;

The first taxon 242, for according to the textual classification model of described generation, each text vector of text vector set of described test set being classified, obtains the classification of each text in test set.

Further, described sort module 24 can also comprise:

The second taxon 233, be used for according to the text vector set of training set and preset disaggregated model, generate textual classification model, described preset disaggregated model comprises Naive Bayes Classification Model, according to the textual classification model of described generation, each text vector in the text vector set of described test set is classified, obtain the classification of each text in test set.

Described sort module 24 can comprise above-mentioned model generation unit 241, the first taxon 242 and the second taxon 243 simultaneously so that can be as required based on svm classifier model or carry out the sort operation of target text based on model-naive Bayesian.Certainly also can only comprise above-mentioned model generation unit 241, the first taxon 242 or the second taxon 243, with only based on svm classifier model or carry out the sort operation of target text based on model-naive Bayesian.

One of ordinary skill in the art will appreciate that all or part of flow process realizing in above-described embodiment method, can carry out the hardware that instruction is relevant by computer program to complete, described program can be stored in a computer read/write memory medium, this program, in the time carrying out, can comprise as the flow process of the embodiment of above-mentioned each side method.Wherein, described storage medium can be magnetic disc, CD, read-only store-memory body (Read-Only Memory, ROM) or random store-memory body (Random Access Memory, RAM) etc.

Above disclosed is only preferred embodiment of the present invention, certainly can not limit with this interest field of the present invention, and the equivalent variations of therefore doing according to the claims in the present invention, still belongs to the scope that the present invention is contained.

Claims

1. a text classification feature extracting method, is characterized in that, comprising:

Obtain the Feature Words set of training set text;

2. extracting method as claimed in claim 1, is characterized in that, described in obtain training set text Feature Words set comprise:

3. method as claimed in claim 2, is characterized in that, the stop words in the described set of words of described deletion, obtains Feature Words set, comprising:

4. the method as described in claim 1-3 any one, is characterized in that, described according to the word length of the degree of correlation of each Feature Words and preset text categories in Feature Words set and Feature Words, determines the feature score value of each Feature Words, comprising:

5. method as claimed in claim 4, is characterized in that, described according to the degree of correlation of each Feature Words and length weighted value, determines the feature score value of each Feature Words, comprising:

6. method as claimed in claim 5, is characterized in that, in described definite Feature Words set, in the degree of correlation of each Feature Words and preset text categories, determines that the computing formula of the degree of correlation comprises:

R_{jk} = \frac{| {i : t_{k} &Element; d_{j}, d_{j} &Element; C_{j}} |}{| C_{j} |};

Wherein, R _jkrepresentation feature word t _kwith text categories C _jthe degree of correlation, | { i:t _k∈ d _j, d _j∈ C _j| represent text categories C _jin there is Feature Words t _knumber of files, | C _j| represent text categories C _jtotal number of documents.

7. method as claimed in claim 4, is characterized in that, the described word length according to each Feature Words is determined in the length weighted value of each Feature Words, determines that the computing formula of length weighted value comprises:

weight(len(t _k))=log(e+len(t _k))；

8. method as claimed in claim 7, is characterized in that,

Described according to the degree of correlation of each Feature Words, determine in the class discrimination ability of each Feature Words in corresponding text categories, determine that the computing formula of class discrimination ability comprises:

Diff _jk=min(|R _jk-R _ik|)，i≠j；

{Diff}_{k} = Σ_{j = 1}^{n} {Diff}_{jk};

f(t _k)=Diff _k×weight(len(t _k))；

Wherein, f (t _k) be Feature Words t _kfeature score value.

9. a file classification method, is characterized in that, comprising:

10. method as claimed in claim 9, it is characterized in that, text vector operation is carried out in the described Feature Words set according to each text in the text feature set of training set and described test set, obtain the text vector of each text in training set and the text vector of the interior each text of test set, form the text vector set of training set and the text vector set of test set, comprising:

11. methods as claimed in claim 10, it is characterized in that, the described text vector set according to training set generates textual classification model, and according to the textual classification model of described generation, each text vector in the text vector set of described test set is classified, the classification that obtains each text in test set, comprising:

12. methods as claimed in claim 10, it is characterized in that, the described text vector set according to training set generates textual classification model, and according to the textual classification model of described generation, each text vector in the text vector set of described test set is classified, the classification that obtains each text in test set, comprising:

13. 1 kinds of text classification feature deriving means, is characterized in that, comprising:

Acquisition module, for obtaining the Feature Words set of training set text;

14. devices as claimed in claim 13, is characterized in that, described acquisition module comprises:

15. devices as claimed in claim 14, is characterized in that, described delete cells comprises:

16. devices as described in claim 13-15 any one, is characterized in that, described determination module comprises:

17. devices as claimed in claim 16, is characterized in that,

Described the 3rd determining unit, specifically for according to the degree of correlation of Feature Words, determine the class discrimination ability of Feature Words in each corresponding text categories, and the class discrimination ability sum of definite Feature Words in preset all text categories, and according to class discrimination ability sum and length weighted value, determine the feature score value of each Feature Words.

18. 1 kinds of document sorting apparatus, is characterized in that, comprising:

19. devices as claimed in claim 18, is characterized in that, described vectorial determination module comprises:

Weight determining unit, for determine the weight of each Feature Words of the text feature set of each text in described training set according to the text feature set of training set, and the weight of each Feature Words in the Feature Words set of each text in definite test set, wherein, the algorithm of described definite weight comprises: word frequency-inverse document frequency TF-IDF weighting algorithm;

20. devices as claimed in claim 19, is characterized in that, described sort module comprises:

21. devices as claimed in claim 19, is characterized in that, described sort module comprises: