CN103902570A - Text classification feature extraction method, classification method and device - Google Patents

Text classification feature extraction method, classification method and device Download PDF

Info

Publication number
CN103902570A
CN103902570A CN201210578378.0A CN201210578378A CN103902570A CN 103902570 A CN103902570 A CN 103902570A CN 201210578378 A CN201210578378 A CN 201210578378A CN 103902570 A CN103902570 A CN 103902570A
Authority
CN
China
Prior art keywords
text
feature
words
feature words
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201210578378.0A
Other languages
Chinese (zh)
Other versions
CN103902570B (en
Inventor
李鑫
张延祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201210578378.0A priority Critical patent/CN103902570B/en
Publication of CN103902570A publication Critical patent/CN103902570A/en
Application granted granted Critical
Publication of CN103902570B publication Critical patent/CN103902570B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a text classification feature extraction method and a classification method and device. The text classification feature extraction method includes the steps of obtaining a feature word set of training set texts, confirming feature grading values of the feature words according to relevance of feature words in the feature word set and preset text categories, recording the feature words whose grading values are higher than a preset score threshold value, and obtaining a text feature set of the training set texts. According to the text classification feature extraction method and the classification method and device, the number of the feature words can be effectively reduced on the situation that the feature words capable of expressing text information are obtained, and accordingly when texts are classified, classification operation time is conveniently reduced, calculation time and space expenditures are reduced, and calculation cost is saved.

Description

A kind of text classification feature extracting method, sorting technique and device
Technical field
The present invention relates to text classification field, relate in particular to a kind of text classification feature extracting method, sorting technique and device.
Background technology
Along with developing rapidly of Internet technology, the quantity of network text presents volatile growth, and how effectively managing these texts is current hot issues, and text classification, as the gordian technique of management mass data, is widely used.
The file classification method based on statistics adopting at present, by the classified text of study, can be classified to new example text preferably.Wherein, in the process that new example is classified, need first carry out word segmentation processing to example text, obtain comprising the set of words of some words, then all words based in set of words carry out text classification processing, complete the classification to this example text.Inventor finds implementing when prior art, adopts the sorting technique of aforesaid way many at example content of text, and in the higher situation of the quantity of the word that participle obtains, sort operation Performance Ratio is poor.
Summary of the invention
Embodiment of the present invention technical matters to be solved is, a kind of text classification feature extracting method, sorting technique and device are provided, and can improve the performance of classification.
In order to solve the problems of the technologies described above, the embodiment of the present invention provides a kind of text classification feature extracting method, it is characterized in that, comprising:
Obtain the Feature Words set of training set text;
According to the word length of the degree of correlation of each Feature Words and preset text categories in Feature Words set and Feature Words, determine the feature score value of each Feature Words;
Recording feature score value, higher than the Feature Words of preset fraction threshold value, obtains the text feature set of described training set text.
The Feature Words set of obtaining training set text wherein, comprises:
Training set text is carried out to word segmentation processing, obtain the set of words of described training set text;
Delete the stop words in described set of words, obtain Feature Words set, the stop words in described set of words comprises auxiliary words of mood and/or personal pronoun.
Wherein, the stop words in the described set of words of described deletion, obtains Feature Words set, comprising:
Each participle in described set of words and the preset stop words in the inactive dictionary of presetting are compared;
According to comparative result, participle identical with preset stop words in set of words is deleted, obtained Feature Words set.
Wherein, described according to the word length of the degree of correlation of each Feature Words and preset text categories in Feature Words set and Feature Words, determine the feature score value of each Feature Words, comprising:
Determine the degree of correlation of each Feature Words in Feature Words set and preset each text categories;
Determine the length weighted value of each Feature Words according to the word length of each Feature Words;
According to the degree of correlation of each Feature Words and length weighted value, determine the feature score value of each Feature Words.
Wherein, described according to the degree of correlation of each Feature Words and length weighted value, determine the feature score value of each Feature Words, comprising:
According to the degree of correlation of Feature Words, determine the class discrimination ability of Feature Words in each corresponding text categories;
Determine the class discrimination ability sum of Feature Words in preset all text categories;
According to class discrimination ability sum and length weighted value, determine the feature score value of each Feature Words.
Wherein, in described definite Feature Words set, in the degree of correlation of each Feature Words and preset text categories, determine that the computing formula of the degree of correlation comprises:
R jk = | { i : t k ∈ d j , d j ∈ C j } | | C j | ;
Wherein, R jkrepresentation feature word t kwith text categories C jthe degree of correlation, { i:t k∈ d j, d j∈ C j| represent text categories C jin there is Feature Words t knumber of files, | C j| represent text categories C jtotal number of documents.
Wherein, the described word length according to each Feature Words is determined in the length weighted value of each Feature Words, determines that the computing formula of length weighted value comprises:
weight(len(t k))=log(e+len(t k));
Wherein, e is the natural numerical value of presetting, len (t k) be Feature Words t klength value.
Wherein, described according to the degree of correlation of each Feature Words, determine in the class discrimination ability of each Feature Words in corresponding text categories, determine that the computing formula of class discrimination ability comprises:
Diff jk=min(|R jk-R ik|),i≠j;
Wherein, Diff jkrepresentation feature word t kat text categories C jon the value of class discrimination ability, R jkrepresentation feature word t kwith text categories C jthe degree of correlation, R ikrepresentation feature word t kwith text categories C ithe degree of correlation;
In the class discrimination ability sum of described definite Feature Words in preset all text categories, determine that the computing formula of described class discrimination ability sum comprises:
Diff k = Σ j = 1 n Diff jk ;
Wherein, Diff kfor Feature Words t kclass discrimination ability sum in preset all text categories;
Described according to class discrimination ability sum and length weighted value, determine in the feature score value of each Feature Words, determine that the computing formula of feature score value comprises:
f(t k)=Diff k×weight(len(t k));
Wherein, f (t k) be Feature Words t kfeature score value.
Correspondingly, the embodiment of the present invention also provides a kind of file classification method, comprising:
Obtain respectively the Feature Words set of each text in training set, and merged the Feature Words set of duplicate removal formation training set;
According to the degree of correlation and the word length of each Feature Words in the Feature Words set of training set and preset text categories, determine the feature score value of each Feature Words;
Recording feature score value, higher than the Feature Words of preset fraction threshold value, obtains the text feature set of described training set;
According to the text feature set of training set, obtain the Feature Words set of the each text of test set;
Carry out text vector operation according to the Feature Words set of each text in the text feature set of training set and described test set, obtain the text vector of each text in training set and the text vector of the interior each text of test set, form the text vector set of training set and the text vector set of test set;
Generate textual classification model according to the text vector set of training set, and according to the textual classification model of described generation, each text vector in the text vector set of described test set is classified, obtain the classification of each text in test set.
Wherein, text vector operation is carried out in the described Feature Words set according to each text in the text feature set of training set and described test set, obtain the text vector of each text in training set and the text vector of the interior each text of test set, form the text vector set of training set and the text vector set of test set, comprising:
Each Feature Words allocation index in text feature set to described training set and described test set in the Feature Words set of each text;
Determine the weight of each Feature Words in the text feature set of each text in described training set according to the text feature set of training set, and the weight of each Feature Words in the Feature Words set of each text in definite test set, wherein, the algorithm of described definite weight comprises: word frequency-inverse document frequency TF-IDF weighting algorithm;
Generate vector according to the index of each Feature Words and weight, obtain respectively the text vector of each text in training set and test set, obtain the text vector set of training set and test set.
Wherein, the described text vector set according to training set generates textual classification model, and according to the textual classification model of described generation, each text vector in the text vector set of described test set is classified, obtain the classification of each text in test set, comprising:
Each text vector in the text vector set of described training set is carried out to normalized, so that the weight of each characteristic item in each text vector is projected in default numerical range;
According to the text vector set of training set after normalized and preset disaggregated model, generate textual classification model, described preset disaggregated model comprises support vector machines disaggregated model;
According to the textual classification model of described generation, each text vector in the text vector set of described test set is classified, obtain the classification of each text in test set.
Wherein, the described text vector set according to training set generates textual classification model, and according to the textual classification model of described generation, each text vector in the text vector set of described test set is classified, obtain the classification of each text in test set, comprising:
According to the text vector set of training set and preset disaggregated model, generate textual classification model, described preset disaggregated model comprises Naive Bayes Classification Model;
According to the textual classification model of described generation, each text vector in the text vector set of described test set is classified, obtain the classification of each text in test set.
Correspondingly, the embodiment of the present invention also provides a kind of text classification feature deriving means, comprising:
Acquisition module, for obtaining the Feature Words set of training set text;
Determination module, for according to the each Feature Words of Feature Words set and the preset degree of correlation of text categories and the word length of Feature Words, determines the feature score value of each Feature Words;
Logging modle, the Feature Words for recording feature score value higher than preset fraction threshold value, obtains the text feature set of described training set text.
Wherein, described acquisition module comprises:
Participle unit, for training set text is carried out to word segmentation processing, obtains the set of words of described training set text;
Delete cells, for deleting the stop words of described set of words, obtains Feature Words set, and the stop words in described set of words comprises auxiliary words of mood and/or personal pronoun.
Wherein, described delete cells comprises:
Relatively subelement, for comparing each participle of described set of words and the preset stop words in the inactive dictionary of presetting;
Delete subelement, for according to comparative result, participle identical with preset stop words set of words being deleted, obtain Feature Words set.
Wherein, described determination module comprises:
The first determining unit, for determining the degree of correlation of the each Feature Words of Feature Words set and preset each text categories;
The second determining unit, for determining the length weighted value of each Feature Words according to the word length of each Feature Words;
The 3rd determining unit, for according to the degree of correlation of each Feature Words and length weighted value, determines the feature score value of each Feature Words.
Wherein, described the 3rd determining unit, specifically for according to the degree of correlation of Feature Words, determine the class discrimination ability of Feature Words in each corresponding text categories, and the class discrimination ability sum of definite Feature Words in preset all text categories, and according to class discrimination ability sum and length weighted value, determine the feature score value of each Feature Words.
Correspondingly, the embodiment of the present invention also provides a kind of document sorting apparatus, comprising:
Characteristic extracting module, for obtaining respectively the Feature Words set of each text in training set, and merged duplicate removal and formed the Feature Words set of training set, according to the degree of correlation and the word length of each Feature Words in the Feature Words set of training set and preset text categories, determine the feature score value of each Feature Words, recording feature score value, higher than the Feature Words of preset fraction threshold value, obtains the text feature set of described training set;
Acquisition module, for according to the text feature set of training set, obtains the Feature Words set of the each text of test set;
Vector determination module, for carrying out text vector operation according to the Feature Words set of each text in the text feature set of training set and described test set, obtain the text vector of each text in training set and the text vector of the interior each text of test set, form the text vector set of training set and the text vector set of test set;
Sort module, for generating textual classification model according to the text vector set of training set, and classifies to each text vector in the text vector set of described test set according to the textual classification model of described generation, obtains the classification of each text in test set.
Wherein, described vectorial determination module comprises:
Index assignment unit, for each Feature Words allocation index of the Feature Words set of each text in the text feature set to described training set and described test set;
Weight determining unit, determine the weight of each Feature Words in the text feature set of each text in described training set according to the text feature set of training set, and the weight of each Feature Words in the Feature Words set of each text in definite test set, wherein, the algorithm of described definite weight comprises: word frequency-inverse document frequency TF-IDF weighting algorithm;
Vector determining unit, for generating vector according to the index of each Feature Words and weight, obtains respectively the text vector of each text in training set and test set, obtains the text vector set of training set and test set.
Wherein, described sort module comprises:
Model generation unit, carries out normalized for each text vector of the text vector set to described training set, so that the weight of each characteristic item in each text vector is projected in default numerical range; According to the text vector set of training set after normalized and preset disaggregated model, generate textual classification model, described preset disaggregated model comprises support vector machines disaggregated model;
The first taxon, for according to the textual classification model of described generation, each text vector of text vector set of described test set being classified, obtains the classification of each text in test set.
Wherein, described sort module comprises:
The second taxon, be used for according to the text vector set of training set and preset disaggregated model, generate textual classification model, described preset disaggregated model comprises Naive Bayes Classification Model, according to the textual classification model of described generation, each text vector in the text vector set of described test set is classified, obtain the classification of each text in test set.
Implement the embodiment of the present invention, there is following beneficial effect:
The embodiment of the present invention obtains after set of words at participle, also need, according to the degree of correlation of each Feature Words and text categories in set of words and the length of Feature Words, set of words is carried out to feature extraction, can be in the case of obtaining expressing the Feature Words of text message, effectively reduce the number of Feature Words, thereby convenient in the time that text is classified, reduce the sort run time, reduce time and space expense that classification is processed, save classification cost.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, to the accompanying drawing of required use in embodiment or description of the Prior Art be briefly described below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skills, do not paying under the prerequisite of creative work, can also obtain according to these accompanying drawings other accompanying drawing.
Fig. 1 is the schematic flow sheet of a kind of text classification feature extracting method of the embodiment of the present invention;
Fig. 2 is the schematic flow sheet of the another kind of text classification feature extracting method of the embodiment of the present invention;
Fig. 3 is wherein a kind of concrete schematic flow sheet of determining method of the feature score value of the embodiment of the present invention;
Fig. 4 is the schematic flow sheet of a kind of file classification method of the embodiment of the present invention;
Fig. 5 is the structural representation of a kind of text classification feature deriving means of the embodiment of the present invention;
Fig. 6 is wherein a kind of concrete outcome schematic diagram of the acquisition module in Fig. 5;
Fig. 7 is wherein a kind of concrete structure composition schematic diagram of the determination module in Fig. 5;
The structural representation of a kind of document sorting apparatus of Fig. 8 embodiment of the present invention;
Fig. 9 is wherein a kind of concrete outcome schematic diagram of the vectorial determination module in Fig. 8;
Figure 10 is wherein a kind of concrete structure schematic diagram of the sort module in Fig. 8.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the present invention's part embodiment, rather than whole embodiment.Based on the embodiment in the present invention, those of ordinary skills, not making the every other embodiment obtaining under creative work prerequisite, belong to the scope of protection of the invention.
Referring to Fig. 1, is the schematic flow sheet of a kind of text classification feature extracting method of the embodiment of the present invention; The described method of the embodiment of the present invention can be applicable in all kinds of text application servers, related training set is that of setting in advance comprises that multiple texts are referred to as the set of training set text, in this training set, the text of text is known type, described training set Chinese version is carried out to the feature extraction of the embodiment of the present invention, so that generate corresponding disaggregated model according to extraction result, the text of unknown test set is classified.Concrete, the described method of the embodiment of the present invention comprises:
S101: the Feature Words set of obtaining training set text.
Described Feature Words set comprises word or the word that can reflect preferably described training set text implication to be expressed, concrete, in embodiments of the present invention, described Feature Words set of obtaining training set text can comprise: training set text is carried out to word segmentation processing, obtain the set of words of described training set text; Delete the stop words in described set of words, obtain Feature Words set, the stop words in described set of words comprises auxiliary words of mood and/or personal pronoun.
Wherein, described training set text is the text that has been marked classification, for the test set text of some unknown classifications is classified as a reference, it can be specifically some microblogging content text of having known classification, Press release text, paper text etc., word segmentation processing is that each sentence in described training set text is decomposed into word or word, is the set of word, word by text-converted.Participle process can adopt existing participle mode to carry out, and is not repeated herein.
Delete stop words and comprise and delete punctuation mark and some tone group word, personal pronoun etc. without Special Significance, these stop words all may occur in any text, and therefore its representative ability to text is more weak, and theme that can not represent, needs to delete.
S102: according to the word length of the degree of correlation of each Feature Words and preset text categories in Feature Words set and Feature Words, determine the feature score value of each Feature Words.
The degree of correlation of Feature Words and text categories can be according to occurring in all texts known under preset text categories that the number of files of described Feature Words and the ratio of this classifying documents sum draw.
Meanwhile, generally, the length of word is shorter, and its expressed information is also fewer, and still less, therefore, the length of word is longer for the expressed information of single character, just more can reflect text categories, therefore, can introduce word length Feature Words is carried out to feature scoring.
Because word is long and the degree of correlation draws, can adopt corresponding word length as weighted value, the mode of carrying out quadrature calculating with the degree of correlation obtains feature score value.The degree of correlation of different characteristic word and word are long not identical, and the feature score value that it obtains also can vary, so that following S103 deletes the Feature Words that score value is less, the Feature Words that keeping characteristics score value is larger, obtains text feature set.
S103: recording feature score value, higher than the Feature Words of preset fraction threshold value, obtains the text feature set of described training set text.
Determine and draw after the feature score value of each Feature Words, the Feature Words that feature score value in Feature Words set is less than to preset fraction threshold value is deleted, and remaining Feature Words forms the text feature set of training set text.
The embodiment of the present invention obtains after set of words at participle, also need, according to the degree of correlation of each Feature Words and text categories in set of words and the length of Feature Words, set of words is carried out to feature extraction, can be in the case of obtaining expressing the Feature Words of text message, effectively reduce the number of Feature Words, thereby convenient in the time that text is classified, reduce the sort run time, reduce time and space expense that classification is processed, save classification cost.
Referring to Fig. 2, is the schematic flow sheet of the another kind of text classification feature extracting method of the embodiment of the present invention again; The described method of the embodiment of the present invention can be applied in all kinds of text application servers equally, so that text application server generates disaggregated model after extracting feature according to this text characteristic of division extracting method, complete the classification of the test set text to unknown text type, concrete, the described method of the embodiment of the present invention comprises:
S201: training set text is carried out to word segmentation processing, obtain the set of words of described training set text.
S202: delete the stop words in described set of words, obtain Feature Words set, the stop words in described set of words comprises auxiliary words of mood and/or personal pronoun.
Delete stop words and comprise and delete punctuation mark and some tone group word, personal pronoun etc. without Special Significance, these stop words all may occur in any text, and therefore its representative ability to text is more weak, and theme that can not represent, needs to delete.
In embodiments of the present invention, described S202 specifically can comprise: each participle in described set of words and the preset stop words in the inactive dictionary of presetting are compared; According to comparative result, participle identical with preset stop words in set of words is deleted, obtained Feature Words set.
The stop words that described inactive dictionary comprises, by user's typing in advance, does not comprise the word of Special Significance comprising all kinds of auxiliary words, personal pronoun etc.The equipment such as text server are deleted the corresponding word in the set of words of training set text by the mode comparing one by one, obtain Feature Words set, i.e. First Characteristic set of words.
S203: the degree of correlation of determining each Feature Words in Feature Words set and preset each text categories.
The degree of correlation of Feature Words and text categories can be according to occurring in all texts known under described text categories that the number of files of described Feature Words and the ratio of total number of documents draw.
In embodiments of the present invention, the computing formula of concrete definite degree of correlation can be:
R jk = | { i : t k ∈ d j , d j ∈ C j } | | C j | ;
Wherein, R jkrepresentation feature word t kwith text categories C jthe degree of correlation, { i:t k∈ d j, d j∈ C j| represent text categories C jin there is Feature Words t knumber of files, | C j| represent text categories C jtotal number of documents.
S204: the length weighted value of determining each Feature Words according to the word length of each Feature Words.
Generally, the length of word is shorter, and its expressed information is also fewer, and still less, therefore, the length of word is longer for the expressed information of single character, just more can reflect text categories, therefore, can introduce word length Feature Words is carried out to feature scoring.
In embodiments of the present invention, the concrete computing formula of measured length weighted value really can be:
weight(len(t k))=log(e+len(t k));
Wherein, e is the natural numerical value of presetting, len (t k) be Feature Words t klength value.E is wherein the numerical value that user obtains according to classification experience.
S205: according to the degree of correlation of each Feature Words and length weighted value, determine the feature score value of each Feature Words.
Because word is long and the degree of correlation draws, can adopt accordingly for example using word length as weighted value, the mode of carrying out quadrature calculating with the degree of correlation obtains feature score value.The degree of correlation of different characteristic word and word are long not identical, and the feature score value that it obtains also can vary, so that delete less Feature Words, the Feature Words that keeping characteristics score value is larger, obtains text feature set, i.e. Second Characteristic set of words.
S206: recording feature score value, higher than the Feature Words of preset fraction threshold value, obtains the text feature set of described training set text.
Determining to such an extent that there emerged a after the feature score value of Feature Words, the Feature Words that feature score value in Feature Words set is less than to preset fraction threshold value is deleted, and remaining Feature Words forms the text feature set of training set text.
Further concrete, then refer to Fig. 3, be wherein a kind of concrete schematic flow sheet of determining method of the feature score value of the embodiment of the present invention; In embodiments of the present invention, the concrete of described feature score value determines that method comprises:
S301: according to the degree of correlation of Feature Words, determine the class discrimination ability of Feature Words in each corresponding text categories.
The computing formula of the definite class discrimination ability in the present embodiment comprises:
Diff jk=min(|R jk-R ik|),i≠j;
Wherein, Diff jkrepresentation feature word t kat text categories C jon the value of class discrimination ability, R jkrepresentation feature word t kwith text categories C jthe degree of correlation, R ikrepresentation feature word t kwith text categories C ithe degree of correlation.
Described class discrimination ability value is for the difference of the representative ability of characteristic feature word in certain classification and in other classifications, and difference is larger just shows that the ability that Feature Words can distinguish such and other class is larger.
S302: determine the class discrimination ability sum of Feature Words in preset all text categories.
The computing formula of the definite described class discrimination ability sum in the present embodiment comprises:
Diff k = Σ j = 1 n Diff jk ;
Wherein, Diff kfor Feature Words t kclass discrimination ability sum in preset all text categories.
S303: according to class discrimination ability sum and length weighted value, determine the feature score value of each Feature Words.
The computing formula of the definite feature score value in the present embodiment comprises:
f(t k)=Diff k×weight(len(t k));
Wherein, f (t k) be Feature Words t kfeature score value.
The embodiment of the present invention obtains after set of words at participle, also need, according to the degree of correlation of each Feature Words and text categories in set of words and the length of Feature Words, set of words is carried out to feature extraction, can be in the case of obtaining expressing the Feature Words of text message, effectively reduce the number of Feature Words, thereby convenient in the time that text is classified, reduce the sort run time, reduce time and space expense that classification is processed, save classification cost.
And determine the class discrimination ability of Feature Words in each classification according to the degree of correlation, the feature score value that carries out Feature Words according to class discrimination ability sum and length weighted value is again determined and screening, can extract more exactly the Feature Words that obtains characterizing in training set text target text classification information, guarantee further the accuracy that Feature Words extracts.
Below the file classification method of the embodiment of the present invention is elaborated.
Referring to Fig. 4, is the schematic flow sheet of a kind of file classification method of the embodiment of the present invention; First the described method of the embodiment of the present invention extracts the Feature Words of target text by above-mentioned text classification feature extracting method, and then classify according to Feature Words, wherein related training set is that of setting in advance comprises that multiple texts are referred to as the set of training set text, in this training set, the text of text is known type, described training set Chinese version is carried out to the feature extraction of the embodiment of the present invention, so that generate corresponding disaggregated model according to extraction result, the text of unknown test set is classified.It specifically comprises:
S401: obtain respectively the Feature Words set of each text in training set, and merged the Feature Words set of duplicate removal formation training set;
Described training set comprises multiple texts, these texts are the text that has been marked classification, for the test set text of some unknown classifications is classified as a reference, in the Feature Words set that obtains respectively each text in training set, Feature Words set to each text is processed, remove dittograph and only retain one of them, form the Feature Words set of whole training set.
The Feature Words set that obtains each training set Chinese version in described S401 specifically comprises: training set text is carried out to word segmentation processing, obtain the set of words of described training set text; Delete the stop words in described set of words, obtain Feature Words set, the stop words in described set of words comprises auxiliary words of mood and/or personal pronoun.Each text in training set is carried out to Feature Words extraction to obtain the text feature set of training set.
S402: according to the degree of correlation and the word length of each Feature Words in the Feature Words set of training set and preset text categories, determine the feature score value of each Feature Words.
S403: recording feature score value, higher than the Feature Words of preset fraction threshold value, obtains the text feature set of described training set.
The feature score value of definite Feature Words of the embodiment of the present invention and finally get specific formula for calculation that the text feature set of training set uses and mode can adopt concrete extracting mode and formula in above-mentioned Fig. 2 and text classification feature extracting method embodiment corresponding to Fig. 3 to carry out, is not repeated herein.
S404: according to the text feature set of training set, obtain the Feature Words set of the each text of test set;
In described S404, the each text of test set is processed also and can specifically be comprised: test set text is carried out to word segmentation processing, obtain the set of words of described test set text; Delete the stop words in described set of words, and according to the Feature Words in the text feature set of training set, obtain Feature Words set, stop words in described set of words comprises auxiliary words of mood and/or personal pronoun, deletes tone group word, personal pronoun and non-existent word in the text feature set of training set in the set of words of test set text.Test set comprises that one or more text that need to determine its classification is referred to as test set text, and it can comprise microblogging content, Press release, paper text of unknown text classification etc.
S405: carry out text vector operation according to the Feature Words set of each text in the text feature set of training set and described test set, obtain the text vector of each text in training set and the text vector of the interior each text of test set, form the text vector set of training set and the text vector set of test set.
Vector set is that each character representation in the Feature Words set of the text feature set of each text in training set or test set text is become to corresponding vectorial set.In the present embodiment, described S405 specifically can comprise: each the Feature Words allocation index in the text feature set to described training set and described test set in the Feature Words set of each text; Determine the weight of each Feature Words in the text feature set of each text in described training set according to the text feature set of training set, and the weight of each Feature Words in the Feature Words set of each text in definite test set, wherein, the algorithm of described definite weight comprises: word frequency-inverse document frequency TF-IDF weighting algorithm; Generate vector according to the index of each Feature Words and weight, obtain respectively the text vector of each text in training set and test set, obtain the text vector set of training set and test set.
Concrete, in the text feature set of distribution training set, the index value of each Feature Words is the distribution for each text carries out in training set, to have obtained corresponding to the distribution of each text in training set the text classification feature of index value, wherein, each has distributed in the text classification feature of index value, for non-existent Feature Words in its corresponding text, for the index value of its distribution is 0, and for the Feature Words existing, the index value distributing is 1.Then adopt TF-IDF weighting algorithm etc. determine each texts in described training set distribution the weight of each Feature Words in the text feature set of index value, obtain the text vector of each text in training set.Test set can be directly to adopting TF-IDF weighting algorithm to determine the weight of each Feature Words in text after each text allocation index value in test set, so that the final text vector of determining each text that obtains test set.
Certainly,, in described S405, corresponding set is carried out to text vector operation and also can directly adopt existing techniques in realizing.
S406: generate textual classification model according to the text vector set of training set, and according to the textual classification model of described generation, each text vector in the text vector set of described test set is classified, obtain the classification of each text in test set.
In described S406, become textual classification model specifically can comprise according to the vector set symphysis of each text in training set: each text vector in the text vector set of described training set to be carried out to normalized, so that the weight of each characteristic item in each text vector is projected in default numerical range; According to the text vector set of training set after normalized and preset disaggregated model, generate textual classification model, described preset disaggregated model comprises support vector machines disaggregated model; According to the textual classification model of described generation, each text vector in the text vector set of described test set is classified, obtain the classification of each text in test set.Or comprise: according to the text vector set of training set and preset disaggregated model, generate textual classification model, described preset disaggregated model comprises Naive Bayes Classification Model; According to the textual classification model of described generation, each text vector in the text vector set of described test set is classified, obtain the classification of each text in test set.
Because text disaggregated model is by one or more dissimilar text generation, therefore, textual classification model by this generation can, to the classification of the text of unknown classification, can be determined the text categories of described test set text by textual classification model.
The embodiment of the present invention obtains after set of words at participle, also need, according to the degree of correlation of each Feature Words and text categories in set of words and the length of Feature Words, set of words is carried out to feature extraction, can be in the case of obtaining expressing the Feature Words of text message, effectively reduce the number of Feature Words, thereby convenient in the time that text is classified, reduce the sort run time, reduce time and space expense that classification is processed, save classification cost.
Below the text classification feature deriving means of the embodiment of the present invention is elaborated.
Referring to Fig. 5, is the structural representation of a kind of text classification feature deriving means of the embodiment of the present invention; Text extraction element described in the embodiment of the present invention can be arranged in the equipment such as text server, related training set is that of setting in advance comprises that multiple texts are referred to as the set of training set text, in this training set, the text of text is known type, described training set Chinese version is carried out to the feature extraction of the embodiment of the present invention, so that generate corresponding disaggregated model according to extraction result, the text of unknown test set is classified.The described device of the embodiment of the present invention can comprise:
Acquisition module 11, for obtaining the Feature Words set of training set text;
Described Feature Words set comprises word or the word that can reflect preferably described training set text implication to be expressed, concrete, in embodiments of the present invention, described Feature Words set of obtaining training set text can comprise: training set text is carried out to word segmentation processing, obtain the set of words of described training set text; Delete the stop words in described set of words, obtain Feature Words set, the stop words in described set of words comprises auxiliary words of mood and/or personal pronoun.
Determination module 12, for according to the degree of correlation and the word length of the each Feature Words of Feature Words set and preset text categories, determines the feature score value of each Feature Words.
Logging modle 13, the Feature Words for recording feature score value higher than preset fraction threshold value, obtains the text feature set of described training set text.
The degree of correlation of Feature Words and text categories can be according to occurring in all texts known under preset text categories that the number of files of described Feature Words and the ratio of this classifying documents sum draw.
Meanwhile, generally, the length of word is shorter, and its expressed information is also fewer, and still less, therefore, the length of word is longer for the expressed information of single character, just more can reflect text categories, therefore, can introduce word length Feature Words is carried out to feature scoring.
Because word is long and the degree of correlation draws, can adopt corresponding word length as weighted value, the mode of carrying out quadrature with the degree of correlation obtains feature score value.The degree of correlation of different characteristic word and word are long not identical, and the feature score value that it obtains also can vary, so that described logging modle 13 is deleted the larger Feature Words of Feature Words keeping characteristics score value that score value is less, obtains text feature set.
Described logging modle 13 is determined to such an extent that there emerged a after the feature score value of Feature Words at described determination module 12, and the Feature Words that feature score value in Feature Words set is less than to preset fraction threshold value is deleted, and remaining Feature Words forms the text feature set of target text.
Further, referring to Fig. 6, is wherein a kind of concrete outcome schematic diagram of the acquisition module in Fig. 5, and described acquisition module 11 specifically can comprise with lower unit and realizing:
Participle unit 111, for training set text is carried out to word segmentation processing, obtains the set of words of described training set text.
Delete cells 112, for deleting the stop words of described set of words, obtains Feature Words set, and the stop words in described set of words comprises auxiliary words of mood and/or personal pronoun.
Described training set text is the text that has been marked classification, it can be specifically some microblogging content text of having known classification, Press release text, paper text etc., word segmentation processing is that each sentence in described training set text is decomposed into word or word, is the set of word, word by text-converted.Participle process can adopt existing participle mode to carry out, and is not repeated herein.
Delete stop words and comprise and delete punctuation mark and some tone group word, personal pronoun etc. without Special Significance, these stop words all may occur in any text, and therefore its representative ability to text is more weak, and theme that can not represent, needs to delete.
Further concrete, described delete cells 112 can also comprise following subelement:
Relatively subelement 1121, for comparing each participle of described set of words and the preset stop words in the inactive dictionary of presetting.
Delete subelement 1122, for according to comparative result, participle identical with preset stop words set of words being deleted, obtain Feature Words set.
The stop words that described inactive dictionary comprises, by user's typing in advance, does not comprise the word of Special Significance comprising all kinds of auxiliary words, personal pronoun etc.The equipment such as text server are deleted the corresponding word in the set of words of target text by the mode comparing one by one, obtain Feature Words set, i.e. First Characteristic set of words.
Further, then referring to Fig. 7, is wherein a kind of concrete structure composition schematic diagram of the determination module in Fig. 5, and described determination module 13 specifically can comprise with lower unit:
The first determining unit 131, for determining the degree of correlation of the each Feature Words of Feature Words set and preset each text categories.
The second determining unit 132, for determining the length weighted value of each Feature Words according to the word length of each Feature Words.
The 3rd determining unit 133, for according to the degree of correlation of each Feature Words and length weighted value, determines the feature score value of each Feature Words.
Described the first determining unit 131 determines that the degree of correlation of Feature Words and text categories can be according to occurring in all texts known under described text categories that the number of files of described Feature Words and the ratio of total number of documents draw.
In embodiments of the present invention, the computing formula of concrete definite degree of correlation can be:
R jk = | { i : t k ∈ d j , d j ∈ C j } | | C j | ;
Wherein, R jkrepresentation feature word t kwith text categories C jthe degree of correlation, { i:t k∈ d j, d j∈ C j| represent text categories C jin there is Feature Words t knumber of files, | C j| represent text categories C jtotal number of documents.
For described the second determining unit 132, in the ordinary course of things, the length of word is shorter, and its expressed information is also fewer, still less, therefore, the length of word is longer for the expressed information of single character, just more can reflect text categories, therefore, can introduce word length Feature Words is carried out to feature scoring.
In embodiments of the present invention, the concrete computing formula of measured length weighted value really can be:
weight(len(t k))=log(e+len(t k));
Wherein, e is the natural numerical value of presetting, and is the numerical value that user obtains according to classification experience, len (t k) be Feature Words t klength value.
Because word is long and the degree of correlation draws, described the 3rd determining unit 133 can adopt accordingly for example using word length as weighted value, and the mode of carrying out quadrature with the degree of correlation obtains feature score value.The degree of correlation of different characteristic word and word are long not identical, the feature score value that it obtains also can vary, and can delete less Feature Words according to feature score value, the Feature Words that keeping characteristics score value is larger, obtain text feature set, i.e. Second Characteristic set of words.
Concrete, in embodiments of the present invention, described the 3rd determining unit 133 determines that feature scoring person specifically can complete definite according to following formula.
First,, according to the degree of correlation of Feature Words, determine the class discrimination ability of Feature Words in each corresponding text categories.The computing formula of the definite class discrimination ability in the present embodiment comprises:
Diff jk=min(|R jk-R ik|),i≠j;
Wherein, Diff jkrepresentation feature word t kat text categories C jon the value of class discrimination ability, R jkrepresentation feature word t kwith text categories C jthe degree of correlation, R ikrepresentation feature word t kwith text categories C ithe degree of correlation.
Described class discrimination ability value is for the difference of the representative ability of characteristic feature word in certain classification and in other classifications, and difference is larger just shows that the ability that Feature Words can distinguish such and other class is larger.
Secondly, determine the class discrimination ability sum of Feature Words in preset all text categories.The computing formula of the definite described class discrimination ability sum in the present embodiment comprises:
Diff k = Σ j = 1 n Diff jk ;
Wherein, Diff kfor Feature Words t kclass discrimination ability sum in preset all text categories.
Finally, according to class discrimination ability sum and length weighted value, determine the feature score value of each Feature Words.The computing formula of the definite feature score value in the present embodiment comprises:
f(t k)=Diff k×weight(len(t k));
Wherein, f (t k) be Feature Words t kfeature score value.
The embodiment of the present invention obtains after set of words at participle, also need, according to the degree of correlation of each Feature Words and text categories in set of words and the length of Feature Words, set of words is carried out to feature extraction, can be in the case of obtaining expressing the Feature Words of text message, effectively reduce the number of Feature Words, thereby convenient in the time that text is classified, reduce the sort run time, reduce time and space expense that classification is processed, save classification cost.
And determine the class discrimination ability of Feature Words in each classification according to the degree of correlation, the feature score value that carries out Feature Words according to class discrimination ability sum and length weighted value is again determined and screening, can extract more exactly the Feature Words that obtains characterizing in training set text target text classification information, guarantee further the accuracy that Feature Words extracts.
Referring to Fig. 8, is the structural representation of a kind of document sorting apparatus of the embodiment of the present invention again.Text extraction element described in the embodiment of the present invention can be arranged in the equipment such as text server, so that complete the classification to this target text extract the feature of a certain target text according to this text characteristic of division extracting method after, concrete, the described device of the embodiment of the present invention can comprise: characteristic extracting module 21, acquisition module 22, vectorial determination module 23 and sort module 24.
Described characteristic extracting module 21, for obtaining respectively the Feature Words set of each text in training set, and merged duplicate removal and formed the Feature Words set of training set, according to the degree of correlation and the word length of each Feature Words in the Feature Words set of training set and preset text categories, determine the feature score value of each Feature Words, recording feature score value, higher than the Feature Words of preset fraction threshold value, obtains the text feature set of described training set.
Concrete, described characteristic extracting module 21 can specifically comprise that the acquisition module 11 in above-mentioned text classification feature deriving means embodiment obtains the Feature Words set of each text in training set, then after merging, remove repetitor, determine and recording processing by determination module 12 and logging modle 13 again, complete the obtaining of text feature set of each text in training set.
Described acquisition module 22, for according to the text feature set of training set, obtains the Feature Words set of the each text of test set.
The Feature Words set that described acquisition module 22 obtains the each text of test set can be specifically to pass through: test set text is carried out to word segmentation processing, obtain the set of words of described test set text; Delete the stop words in described set of words, and according to the Feature Words in the text feature set of training set, obtain Feature Words set, stop words in described set of words comprises auxiliary words of mood and/or personal pronoun, deletes tone group word, personal pronoun and non-existent word in the text feature set of training set in the set of words of test set text.Test set comprises that one or more text that need to determine its classification is referred to as test set text, and it can comprise microblogging content, Press release, paper text of unknown text classification etc.
Described vectorial determination module 23, for carrying out text vector operation according to the Feature Words set of each text in the text feature set of training set and described test set, obtain the text vector of each text in training set and the text vector of the interior each text of test set, form the text vector set of training set and the text vector set of test set;
Sort module 24, for generating textual classification model according to the text vector set of training set, and according to the textual classification model of described generation, each text vector in the text vector set of described test set is classified, obtain the classification of each text in test set.
Concrete, refer to Fig. 9, be wherein a kind of concrete outcome schematic diagram of the vectorial determination module in Fig. 8.Described vectorial determination module 23 can adopt vectorization of the prior art to operate text classification proper vector, and it specifically can comprise:
Index assignment unit 231, for each Feature Words allocation index of the Feature Words set of each text in the text feature set to described training set and described test set;
Weight determining unit 232, for determine the weight of each Feature Words of the text feature set of each text in described training set according to the text feature set of training set, and the weight of each Feature Words in the Feature Words set of each text in definite test set, wherein, the algorithm of described definite weight comprises: word frequency-inverse document frequency TF-IDF weighting algorithm;
Vector determining unit 233, for generating vector according to the index of each Feature Words and weight, obtains respectively the text vector of each text in training set and test set, obtains the text vector set of training set and test set.
Further, referring to Figure 10, is wherein a kind of concrete structure schematic diagram of the sort module in Fig. 8, and 24 of described sort modules can comprise:
Model generation unit 241, carries out normalized for each text vector of the text vector set to described training set, so that the weight of each characteristic item in each text vector is projected in default numerical range; According to the text vector set of training set after normalized and preset disaggregated model, generate textual classification model, described preset disaggregated model comprises support vector machines disaggregated model;
The first taxon 242, for according to the textual classification model of described generation, each text vector of text vector set of described test set being classified, obtains the classification of each text in test set.
Further, described sort module 24 can also comprise:
The second taxon 233, be used for according to the text vector set of training set and preset disaggregated model, generate textual classification model, described preset disaggregated model comprises Naive Bayes Classification Model, according to the textual classification model of described generation, each text vector in the text vector set of described test set is classified, obtain the classification of each text in test set.
Described sort module 24 can comprise above-mentioned model generation unit 241, the first taxon 242 and the second taxon 243 simultaneously so that can be as required based on svm classifier model or carry out the sort operation of target text based on model-naive Bayesian.Certainly also can only comprise above-mentioned model generation unit 241, the first taxon 242 or the second taxon 243, with only based on svm classifier model or carry out the sort operation of target text based on model-naive Bayesian.
The embodiment of the present invention obtains after set of words at participle, also need, according to the degree of correlation of each Feature Words and text categories in set of words and the length of Feature Words, set of words is carried out to feature extraction, can be in the case of obtaining expressing the Feature Words of text message, effectively reduce the number of Feature Words, thereby convenient in the time that text is classified, reduce the sort run time, reduce time and space expense that classification is processed, save classification cost.
One of ordinary skill in the art will appreciate that all or part of flow process realizing in above-described embodiment method, can carry out the hardware that instruction is relevant by computer program to complete, described program can be stored in a computer read/write memory medium, this program, in the time carrying out, can comprise as the flow process of the embodiment of above-mentioned each side method.Wherein, described storage medium can be magnetic disc, CD, read-only store-memory body (Read-Only Memory, ROM) or random store-memory body (Random Access Memory, RAM) etc.
Above disclosed is only preferred embodiment of the present invention, certainly can not limit with this interest field of the present invention, and the equivalent variations of therefore doing according to the claims in the present invention, still belongs to the scope that the present invention is contained.

Claims (21)

1. a text classification feature extracting method, is characterized in that, comprising:
Obtain the Feature Words set of training set text;
According to the word length of the degree of correlation of each Feature Words and preset text categories in Feature Words set and Feature Words, determine the feature score value of each Feature Words;
Recording feature score value, higher than the Feature Words of preset fraction threshold value, obtains the text feature set of described training set text.
2. extracting method as claimed in claim 1, is characterized in that, described in obtain training set text Feature Words set comprise:
Training set text is carried out to word segmentation processing, obtain the set of words of described training set text;
Delete the stop words in described set of words, obtain Feature Words set, the stop words in described set of words comprises auxiliary words of mood and/or personal pronoun.
3. method as claimed in claim 2, is characterized in that, the stop words in the described set of words of described deletion, obtains Feature Words set, comprising:
Each participle in described set of words and the preset stop words in the inactive dictionary of presetting are compared;
According to comparative result, participle identical with preset stop words in set of words is deleted, obtained Feature Words set.
4. the method as described in claim 1-3 any one, is characterized in that, described according to the word length of the degree of correlation of each Feature Words and preset text categories in Feature Words set and Feature Words, determines the feature score value of each Feature Words, comprising:
Determine the degree of correlation of each Feature Words in Feature Words set and preset each text categories;
Determine the length weighted value of each Feature Words according to the word length of each Feature Words;
According to the degree of correlation of each Feature Words and length weighted value, determine the feature score value of each Feature Words.
5. method as claimed in claim 4, is characterized in that, described according to the degree of correlation of each Feature Words and length weighted value, determines the feature score value of each Feature Words, comprising:
According to the degree of correlation of Feature Words, determine the class discrimination ability of Feature Words in each corresponding text categories;
Determine the class discrimination ability sum of Feature Words in preset all text categories;
According to class discrimination ability sum and length weighted value, determine the feature score value of each Feature Words.
6. method as claimed in claim 5, is characterized in that, in described definite Feature Words set, in the degree of correlation of each Feature Words and preset text categories, determines that the computing formula of the degree of correlation comprises:
R jk = | { i : t k ∈ d j , d j ∈ C j } | | C j | ;
Wherein, R jkrepresentation feature word t kwith text categories C jthe degree of correlation, | { i:t k∈ d j, d j∈ C j| represent text categories C jin there is Feature Words t knumber of files, | C j| represent text categories C jtotal number of documents.
7. method as claimed in claim 4, is characterized in that, the described word length according to each Feature Words is determined in the length weighted value of each Feature Words, determines that the computing formula of length weighted value comprises:
weight(len(t k))=log(e+len(t k));
Wherein, e is the natural numerical value of presetting, len (t k) be Feature Words t klength value.
8. method as claimed in claim 7, is characterized in that,
Described according to the degree of correlation of each Feature Words, determine in the class discrimination ability of each Feature Words in corresponding text categories, determine that the computing formula of class discrimination ability comprises:
Diff jk=min(|R jk-R ik|),i≠j;
Wherein, Diff jkrepresentation feature word t kat text categories C jon the value of class discrimination ability, R jkrepresentation feature word t kwith text categories C jthe degree of correlation, R ikrepresentation feature word t kwith text categories C ithe degree of correlation;
In the class discrimination ability sum of described definite Feature Words in preset all text categories, determine that the computing formula of described class discrimination ability sum comprises:
Diff k = Σ j = 1 n Diff jk ;
Wherein, Diff kfor Feature Words t kclass discrimination ability sum in preset all text categories;
Described according to class discrimination ability sum and length weighted value, determine in the feature score value of each Feature Words, determine that the computing formula of feature score value comprises:
f(t k)=Diff k×weight(len(t k));
Wherein, f (t k) be Feature Words t kfeature score value.
9. a file classification method, is characterized in that, comprising:
Obtain respectively the Feature Words set of each text in training set, and merged the Feature Words set of duplicate removal formation training set;
According to the degree of correlation and the word length of each Feature Words in the Feature Words set of training set and preset text categories, determine the feature score value of each Feature Words;
Recording feature score value, higher than the Feature Words of preset fraction threshold value, obtains the text feature set of described training set;
According to the text feature set of training set, obtain the Feature Words set of the each text of test set;
Carry out text vector operation according to the Feature Words set of each text in the text feature set of training set and described test set, obtain the text vector of each text in training set and the text vector of the interior each text of test set, form the text vector set of training set and the text vector set of test set;
Generate textual classification model according to the text vector set of training set, and according to the textual classification model of described generation, each text vector in the text vector set of described test set is classified, obtain the classification of each text in test set.
10. method as claimed in claim 9, it is characterized in that, text vector operation is carried out in the described Feature Words set according to each text in the text feature set of training set and described test set, obtain the text vector of each text in training set and the text vector of the interior each text of test set, form the text vector set of training set and the text vector set of test set, comprising:
Each Feature Words allocation index in text feature set to described training set and described test set in the Feature Words set of each text;
Determine the weight of each Feature Words in the text feature set of each text in described training set according to the text feature set of training set, and the weight of each Feature Words in the Feature Words set of each text in definite test set, wherein, the algorithm of described definite weight comprises: word frequency-inverse document frequency TF-IDF weighting algorithm;
Generate vector according to the index of each Feature Words and weight, obtain respectively the text vector of each text in training set and test set, obtain the text vector set of training set and test set.
11. methods as claimed in claim 10, it is characterized in that, the described text vector set according to training set generates textual classification model, and according to the textual classification model of described generation, each text vector in the text vector set of described test set is classified, the classification that obtains each text in test set, comprising:
Each text vector in the text vector set of described training set is carried out to normalized, so that the weight of each characteristic item in each text vector is projected in default numerical range;
According to the text vector set of training set after normalized and preset disaggregated model, generate textual classification model, described preset disaggregated model comprises support vector machines disaggregated model;
According to the textual classification model of described generation, each text vector in the text vector set of described test set is classified, obtain the classification of each text in test set.
12. methods as claimed in claim 10, it is characterized in that, the described text vector set according to training set generates textual classification model, and according to the textual classification model of described generation, each text vector in the text vector set of described test set is classified, the classification that obtains each text in test set, comprising:
According to the text vector set of training set and preset disaggregated model, generate textual classification model, described preset disaggregated model comprises Naive Bayes Classification Model;
According to the textual classification model of described generation, each text vector in the text vector set of described test set is classified, obtain the classification of each text in test set.
13. 1 kinds of text classification feature deriving means, is characterized in that, comprising:
Acquisition module, for obtaining the Feature Words set of training set text;
Determination module, for according to the each Feature Words of Feature Words set and the preset degree of correlation of text categories and the word length of Feature Words, determines the feature score value of each Feature Words;
Logging modle, the Feature Words for recording feature score value higher than preset fraction threshold value, obtains the text feature set of described training set text.
14. devices as claimed in claim 13, is characterized in that, described acquisition module comprises:
Participle unit, for training set text is carried out to word segmentation processing, obtains the set of words of described training set text;
Delete cells, for deleting the stop words of described set of words, obtains Feature Words set, and the stop words in described set of words comprises auxiliary words of mood and/or personal pronoun.
15. devices as claimed in claim 14, is characterized in that, described delete cells comprises:
Relatively subelement, for comparing each participle of described set of words and the preset stop words in the inactive dictionary of presetting;
Delete subelement, for according to comparative result, participle identical with preset stop words set of words being deleted, obtain Feature Words set.
16. devices as described in claim 13-15 any one, is characterized in that, described determination module comprises:
The first determining unit, for determining the degree of correlation of the each Feature Words of Feature Words set and preset each text categories;
The second determining unit, for determining the length weighted value of each Feature Words according to the word length of each Feature Words;
The 3rd determining unit, for according to the degree of correlation of each Feature Words and length weighted value, determines the feature score value of each Feature Words.
17. devices as claimed in claim 16, is characterized in that,
Described the 3rd determining unit, specifically for according to the degree of correlation of Feature Words, determine the class discrimination ability of Feature Words in each corresponding text categories, and the class discrimination ability sum of definite Feature Words in preset all text categories, and according to class discrimination ability sum and length weighted value, determine the feature score value of each Feature Words.
18. 1 kinds of document sorting apparatus, is characterized in that, comprising:
Characteristic extracting module, for obtaining respectively the Feature Words set of each text in training set, and merged duplicate removal and formed the Feature Words set of training set, according to the degree of correlation and the word length of each Feature Words in the Feature Words set of training set and preset text categories, determine the feature score value of each Feature Words, recording feature score value, higher than the Feature Words of preset fraction threshold value, obtains the text feature set of described training set;
Acquisition module, for according to the text feature set of training set, obtains the Feature Words set of the each text of test set;
Vector determination module, for carrying out text vector operation according to the Feature Words set of each text in the text feature set of training set and described test set, obtain the text vector of each text in training set and the text vector of the interior each text of test set, form the text vector set of training set and the text vector set of test set;
Sort module, for generating textual classification model according to the text vector set of training set, and classifies to each text vector in the text vector set of described test set according to the textual classification model of described generation, obtains the classification of each text in test set.
19. devices as claimed in claim 18, is characterized in that, described vectorial determination module comprises:
Index assignment unit, for each Feature Words allocation index of the Feature Words set of each text in the text feature set to described training set and described test set;
Weight determining unit, for determine the weight of each Feature Words of the text feature set of each text in described training set according to the text feature set of training set, and the weight of each Feature Words in the Feature Words set of each text in definite test set, wherein, the algorithm of described definite weight comprises: word frequency-inverse document frequency TF-IDF weighting algorithm;
Vector determining unit, for generating vector according to the index of each Feature Words and weight, obtains respectively the text vector of each text in training set and test set, obtains the text vector set of training set and test set.
20. devices as claimed in claim 19, is characterized in that, described sort module comprises:
Model generation unit, carries out normalized for each text vector of the text vector set to described training set, so that the weight of each characteristic item in each text vector is projected in default numerical range; According to the text vector set of training set after normalized and preset disaggregated model, generate textual classification model, described preset disaggregated model comprises support vector machines disaggregated model;
The first taxon, for according to the textual classification model of described generation, each text vector of text vector set of described test set being classified, obtains the classification of each text in test set.
21. devices as claimed in claim 19, is characterized in that, described sort module comprises:
The second taxon, be used for according to the text vector set of training set and preset disaggregated model, generate textual classification model, described preset disaggregated model comprises Naive Bayes Classification Model, according to the textual classification model of described generation, each text vector in the text vector set of described test set is classified, obtain the classification of each text in test set.
CN201210578378.0A 2012-12-27 2012-12-27 A kind of text classification feature extracting method, sorting technique and device Active CN103902570B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210578378.0A CN103902570B (en) 2012-12-27 2012-12-27 A kind of text classification feature extracting method, sorting technique and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210578378.0A CN103902570B (en) 2012-12-27 2012-12-27 A kind of text classification feature extracting method, sorting technique and device

Publications (2)

Publication Number Publication Date
CN103902570A true CN103902570A (en) 2014-07-02
CN103902570B CN103902570B (en) 2018-11-09

Family

ID=50993898

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210578378.0A Active CN103902570B (en) 2012-12-27 2012-12-27 A kind of text classification feature extracting method, sorting technique and device

Country Status (1)

Country Link
CN (1) CN103902570B (en)

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105354184A (en) * 2015-10-28 2016-02-24 甘肃智呈网络科技有限公司 Method for using optimized vector space model to automatically classify document
CN105389307A (en) * 2015-12-02 2016-03-09 上海智臻智能网络科技股份有限公司 Statement intention category identification method and apparatus
CN105447750A (en) * 2015-11-17 2016-03-30 小米科技有限责任公司 Information identification method, apparatus, terminal and server
CN105512104A (en) * 2015-12-02 2016-04-20 上海智臻智能网络科技股份有限公司 Dictionary dimension reducing method and device and information classifying method and device
CN105574105A (en) * 2015-12-14 2016-05-11 北京锐安科技有限公司 Text classification model determining method
CN105930358A (en) * 2016-04-08 2016-09-07 南方电网科学研究院有限责任公司 Case retrieval method and system based on relevance
CN105956031A (en) * 2016-04-25 2016-09-21 深圳市永兴元科技有限公司 Text classification method and apparatus
CN106056154A (en) * 2016-05-27 2016-10-26 大连楼兰科技股份有限公司 Fault code recognition and classification method
CN106067037A (en) * 2016-05-27 2016-11-02 大连楼兰科技股份有限公司 DTC identification and classification stage
CN106354716A (en) * 2015-07-17 2017-01-25 华为技术有限公司 Method and device for converting text
CN106528776A (en) * 2016-11-07 2017-03-22 上海智臻智能网络科技股份有限公司 Text classification method and device
CN106682411A (en) * 2016-12-22 2017-05-17 浙江大学 Method for converting physical examination diagnostic data into disease label
CN106709370A (en) * 2016-12-31 2017-05-24 北京明朝万达科技股份有限公司 Long word identification method and system based on text contents
CN106874295A (en) * 2015-12-11 2017-06-20 腾讯科技(深圳)有限公司 A kind of method and device for determining service parameter
CN106897428A (en) * 2017-02-27 2017-06-27 腾讯科技(深圳)有限公司 Text classification feature extracting method, file classification method and device
WO2017133188A1 (en) * 2016-02-05 2017-08-10 华为技术有限公司 Method and device for determining feature set
CN107092679A (en) * 2017-04-21 2017-08-25 北京邮电大学 A kind of feature term vector preparation method, file classification method and device
CN107908783A (en) * 2017-12-07 2018-04-13 百度在线网络技术(北京)有限公司 Retrieve appraisal procedure, device, server and the storage medium of text relevant
CN108228869A (en) * 2018-01-15 2018-06-29 北京奇艺世纪科技有限公司 The method for building up and device of a kind of textual classification model
CN108491406A (en) * 2018-01-23 2018-09-04 深圳市阿西莫夫科技有限公司 Information classification approach, device, computer equipment and storage medium
CN108520740A (en) * 2018-04-13 2018-09-11 国家计算机网络与信息安全管理中心 Based on manifold audio content consistency analysis method and analysis system
CN108595542A (en) * 2018-04-08 2018-09-28 北京奇艺世纪科技有限公司 A kind of textual classification model generates, file classification method and device
CN108984518A (en) * 2018-06-11 2018-12-11 人民法院信息技术服务中心 A kind of file classification method towards judgement document
CN111708888A (en) * 2020-06-16 2020-09-25 腾讯科技(深圳)有限公司 Artificial intelligence based classification method, device, terminal and storage medium
CN109063217B (en) * 2018-10-29 2020-11-03 广东电网有限责任公司广州供电局 Work order classification method and device in electric power marketing system and related equipment thereof
WO2021042516A1 (en) * 2019-09-02 2021-03-11 平安科技(深圳)有限公司 Named-entity recognition method and device, and computer readable storage medium
CN112906386A (en) * 2019-12-03 2021-06-04 深圳无域科技技术有限公司 Method and device for determining text features
CN107798033B (en) * 2017-03-01 2021-07-02 中南大学 Case text classification method in public security field

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101290626A (en) * 2008-06-12 2008-10-22 昆明理工大学 Text categorization feature selection and weight computation method based on field knowledge
CN101477566A (en) * 2009-01-19 2009-07-08 腾讯科技(深圳)有限公司 Method and apparatus used for putting candidate key words advertisement
CN101477544A (en) * 2009-01-12 2009-07-08 腾讯科技(深圳)有限公司 Rubbish text recognition method and system
CN102289522A (en) * 2011-09-19 2011-12-21 北京金和软件股份有限公司 Method of intelligently classifying texts
US20120159263A1 (en) * 2010-12-17 2012-06-21 Microsoft Corporation Temporal rule-based feature definition and extraction

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101290626A (en) * 2008-06-12 2008-10-22 昆明理工大学 Text categorization feature selection and weight computation method based on field knowledge
CN101477544A (en) * 2009-01-12 2009-07-08 腾讯科技(深圳)有限公司 Rubbish text recognition method and system
CN101477566A (en) * 2009-01-19 2009-07-08 腾讯科技(深圳)有限公司 Method and apparatus used for putting candidate key words advertisement
US20120159263A1 (en) * 2010-12-17 2012-06-21 Microsoft Corporation Temporal rule-based feature definition and extraction
CN102289522A (en) * 2011-09-19 2011-12-21 北京金和软件股份有限公司 Method of intelligently classifying texts

Cited By (43)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106354716A (en) * 2015-07-17 2017-01-25 华为技术有限公司 Method and device for converting text
CN106354716B (en) * 2015-07-17 2020-06-02 华为技术有限公司 Method and apparatus for converting text
CN105354184B (en) * 2015-10-28 2018-04-20 甘肃智呈网络科技有限公司 A kind of vector space model using optimization realizes the method that document is classified automatically
CN105354184A (en) * 2015-10-28 2016-02-24 甘肃智呈网络科技有限公司 Method for using optimized vector space model to automatically classify document
CN105447750A (en) * 2015-11-17 2016-03-30 小米科技有限责任公司 Information identification method, apparatus, terminal and server
CN105447750B (en) * 2015-11-17 2022-06-03 小米科技有限责任公司 Information identification method and device, terminal and server
CN105389307A (en) * 2015-12-02 2016-03-09 上海智臻智能网络科技股份有限公司 Statement intention category identification method and apparatus
CN105512104A (en) * 2015-12-02 2016-04-20 上海智臻智能网络科技股份有限公司 Dictionary dimension reducing method and device and information classifying method and device
CN106874295A (en) * 2015-12-11 2017-06-20 腾讯科技(深圳)有限公司 A kind of method and device for determining service parameter
CN105574105A (en) * 2015-12-14 2016-05-11 北京锐安科技有限公司 Text classification model determining method
CN105574105B (en) * 2015-12-14 2019-05-28 北京锐安科技有限公司 A kind of determination method of textual classification model
CN107045503B (en) * 2016-02-05 2019-03-05 华为技术有限公司 A kind of method and device that feature set determines
US11461659B2 (en) 2016-02-05 2022-10-04 Huawei Technologies Co., Ltd. Feature set determining method and apparatus
WO2017133188A1 (en) * 2016-02-05 2017-08-10 华为技术有限公司 Method and device for determining feature set
CN107045503A (en) * 2016-02-05 2017-08-15 华为技术有限公司 The method and device that a kind of feature set is determined
CN105930358A (en) * 2016-04-08 2016-09-07 南方电网科学研究院有限责任公司 Case retrieval method and system based on relevance
CN105930358B (en) * 2016-04-08 2019-06-04 南方电网科学研究院有限责任公司 Case retrieval method and system based on relevance
CN105956031A (en) * 2016-04-25 2016-09-21 深圳市永兴元科技有限公司 Text classification method and apparatus
CN106067037A (en) * 2016-05-27 2016-11-02 大连楼兰科技股份有限公司 DTC identification and classification stage
CN106056154A (en) * 2016-05-27 2016-10-26 大连楼兰科技股份有限公司 Fault code recognition and classification method
CN106528776A (en) * 2016-11-07 2017-03-22 上海智臻智能网络科技股份有限公司 Text classification method and device
CN106682411A (en) * 2016-12-22 2017-05-17 浙江大学 Method for converting physical examination diagnostic data into disease label
CN106682411B (en) * 2016-12-22 2019-04-16 浙江大学 A method of disease label is converted by physical examination diagnostic data
CN106709370A (en) * 2016-12-31 2017-05-24 北京明朝万达科技股份有限公司 Long word identification method and system based on text contents
CN106709370B (en) * 2016-12-31 2019-10-29 北京明朝万达科技股份有限公司 A kind of long word recognition method and system based on content of text
CN106897428A (en) * 2017-02-27 2017-06-27 腾讯科技(深圳)有限公司 Text classification feature extracting method, file classification method and device
CN106897428B (en) * 2017-02-27 2022-08-09 腾讯科技(深圳)有限公司 Text classification feature extraction method and text classification method and device
CN107798033B (en) * 2017-03-01 2021-07-02 中南大学 Case text classification method in public security field
CN107092679B (en) * 2017-04-21 2020-01-03 北京邮电大学 Feature word vector obtaining method and text classification method and device
CN107092679A (en) * 2017-04-21 2017-08-25 北京邮电大学 A kind of feature term vector preparation method, file classification method and device
CN107908783A (en) * 2017-12-07 2018-04-13 百度在线网络技术(北京)有限公司 Retrieve appraisal procedure, device, server and the storage medium of text relevant
CN108228869A (en) * 2018-01-15 2018-06-29 北京奇艺世纪科技有限公司 The method for building up and device of a kind of textual classification model
CN108491406A (en) * 2018-01-23 2018-09-04 深圳市阿西莫夫科技有限公司 Information classification approach, device, computer equipment and storage medium
CN108595542A (en) * 2018-04-08 2018-09-28 北京奇艺世纪科技有限公司 A kind of textual classification model generates, file classification method and device
CN108595542B (en) * 2018-04-08 2021-11-02 北京奇艺世纪科技有限公司 Text classification model generation method and device, and text classification method and device
CN108520740A (en) * 2018-04-13 2018-09-11 国家计算机网络与信息安全管理中心 Based on manifold audio content consistency analysis method and analysis system
CN108984518A (en) * 2018-06-11 2018-12-11 人民法院信息技术服务中心 A kind of file classification method towards judgement document
CN109063217B (en) * 2018-10-29 2020-11-03 广东电网有限责任公司广州供电局 Work order classification method and device in electric power marketing system and related equipment thereof
WO2021042516A1 (en) * 2019-09-02 2021-03-11 平安科技(深圳)有限公司 Named-entity recognition method and device, and computer readable storage medium
CN112906386A (en) * 2019-12-03 2021-06-04 深圳无域科技技术有限公司 Method and device for determining text features
CN112906386B (en) * 2019-12-03 2023-08-11 深圳无域科技技术有限公司 Method and device for determining text characteristics
CN111708888A (en) * 2020-06-16 2020-09-25 腾讯科技(深圳)有限公司 Artificial intelligence based classification method, device, terminal and storage medium
CN111708888B (en) * 2020-06-16 2023-10-24 腾讯科技(深圳)有限公司 Classification method, device, terminal and storage medium based on artificial intelligence

Also Published As

Publication number Publication date
CN103902570B (en) 2018-11-09

Similar Documents

Publication Publication Date Title
CN103902570A (en) Text classification feature extraction method, classification method and device
Putri et al. Latent Dirichlet allocation (LDA) for sentiment analysis toward tourism review in Indonesia
CN104391835B (en) Feature Words system of selection and device in text
US9779085B2 (en) Multilingual embeddings for natural language processing
Hamouda et al. Sentiment analyzer for arabic comments system
Durant et al. Predicting the political sentiment of web log posts using supervised machine learning techniques coupled with feature selection
WO2011085562A1 (en) System and method for automatically extracting metadata from unstructured electronic documents
CN103593431A (en) Internet public opinion analyzing method and device
CN109165529B (en) Dark chain tampering detection method and device and computer readable storage medium
Abid et al. Spam SMS filtering based on text features and supervised machine learning techniques
CN110990676A (en) Social media hotspot topic extraction method and system
CN104462229A (en) Event classification method and device
CN103268346A (en) Semi-supervised classification method and semi-supervised classification system
Xia et al. Improving patient opinion mining through multi-step classification
Mekala et al. A Novel Document Representation Approach for Authorship Attribution.
Siddiqui et al. Quality Prediction of Wearable Apps in the Google Play Store.
CN107315807B (en) Talent recommendation method and device
CN109359274A (en) The method, device and equipment that the character string of a kind of pair of Mass production is identified
Guo Social network rumor recognition based on enhanced naive bayes
CN111611394B (en) Text classification method and device, electronic equipment and readable storage medium
Yousef et al. TopicsRanksDC: distance-based topic ranking applied on two-class data
KR101240330B1 (en) System and method for mutidimensional document classification
Moohebat et al. Linguistic feature classifying and tracing
Smith et al. Classification of text to subject using LDA
Sharif et al. A scoping review of topic modelling on online data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant