CN106372640A

CN106372640A - Character frequency text classification method

Info

Publication number: CN106372640A
Application number: CN201610698064.2A
Authority: CN
Inventors: 谭军
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2016-08-19
Filing date: 2016-08-19
Publication date: 2017-02-01

Abstract

The invention provides a character frequency text classification method. The method comprises the following steps of preprocessing an input text; performing Chinese character segmentation on the processed text; forming a corpus library; removing stop words in the corpus library; forming a vocabulary text matrix; training a sample by adopting a classifier; and calculating a recall rate of character frequencies according to a calculation formula: the recall rate equals the number of classified correct texts divided by the number of actual texts. The classification method has the following characteristics: the effect of the character frequencies is much better than that of word frequencies, and even in random forest (RF), an artificial neural network (NNET) and a classifier Bagging and Boosting combined algorithm, the recall rates all reach 100%. It is proved that the character frequencies have better characteristics than the word frequencies in commodity description.

Description

A kind of word frequency file classification method

Technical field

The present invention relates to text classification field, more particularly, to a kind of word frequency file classification method.

Background technology

According to experimental result, in traditional automatic Text Categorization, generally believe that selection word is better than word as characteristic item And phrase.But it is in descriptive labelling classification, different because of its exclusive feature.Such as:

1) text size of descriptive labelling is typically all comparatively short, leads to Feature Words few and very sparse, word frequency, Term co-occurrence frequency The information such as rate can not be fully used it is difficult to deeply understand the relatedness between Feature Words.

2) comprise a large amount of abbreviations, alternative word in descriptive labelling, common saying etc. is difficult to the Feature Words distinguishing, and expression-form becomes In colloquial style, this leads to a lot of segmentation methods can not have good performance.

Chinese character is most basic linguistic unit in Chinese, and word is the minimum linguistic unit with semanteme in Chinese.? In descriptive labelling, it is used word necessarily will to face the word segmentation problem of complexity as characteristic item,

Content of the invention

Present invention proposition is a kind of to be used Chinese character as a kind of word frequency file classification method of characteristic item, because the cutting of Chinese character Problem is simpler than the cutting problems of word, and the quantity of Chinese character is much smaller than the quantity of word, and therefore feature extraction efficiency can carry High.

To achieve these goals, the technical scheme is that

A kind of word frequency file classification method, including herein below:

Pretreatment is carried out to the text of input, Chinese character segmentation is carried out for the text after processing, forms corpus, remove language Stop words in material storehouse, is formed vocabulary text matrix, using grader, vocabulary text matrix (sample) is trained, and calculates word The recall rate of frequency, calculation is:

The process forming vocabulary text matrix is:

Under r environment, the termdocumentmatrix function in " tm " bag is used to form vocabulary text matrix, vocabulary literary composition This matrix is to be built according to vector space model；Vector space model is the information representing a text with a vector, Make text become one of feature space point, form a matrix in vector space model Chinese version set, that is, special Levy the set at space midpoint；

word_iIt is the characteristic item in vector space model, w_ijIt is the weight of characteristic item (i.e. word frequency).

For characteristic item (the i.e. word frequency) weighted value in model, cross tf-idf weight calculation method and obtain；

Tf-idf weight calculation formula is:

t f i d f (t_{k}, d_{j}) = t f (t_{k}, d_{j}) \cdot l o g \frac{| t_{r} |}{# t_{r} (t_{k})}

Wherein tf (t_k, d_j) represent key word t_kIn document d_jThe frequency of middle appearance；|t_r| in data complete or collected works document total Number, #t_r(t_k) for comprising key word t_kTotal number of documents；

Wherein

t f (t_{k}, d_{j}) = \{\begin{matrix} 1 + \log # (t_{k}, d_{j}) & i f # (t_{k}, d_{j}) > 0 \\ 0 & o t h e r w i s e \end{matrix}

#(t_k, d_j) represent key word t_kIn document d_jThe number of times of middle appearance；

Last cosine normalization obtains final weighted value:

w_{k j} = \frac{t f i d f (t_{k}, d_{j})}{\sqrt{σ_{s = 1}^{| t |} t f i d f {(t_{s}, d_{j})}^{2}}} .

Preferably, the text of input is carried out with pretreatment to refer to, under r environment, be loaded into former data file, remove numeral, symbol Number, letter after re-write Output of for ms.

Preferably, for the text after processing carry out Chinese character segmentation be the text that behind codecs storehouse, pretreatment exported with One word adds the form write new file of an English comma.

Preferably, form corpus to refer to, under r environment, use vcorpus function in " tm " bag to form corpus.

Preferably, remove the stop words in corpus to refer to, under r environment, use tm_map function in " tm " bag to delete and stop Word, the composition wherein disabling vocabulary is: downloads interjection, measure word, conjunction, pronoun, auxiliary word complete works from online xinhua dictionary, does Become to disable vocabulary.

Discounting for the semantic relation between linguistic unit, two kinds of feature extracting methods are similar from the point of view of statistics angle 's.Hereinafter experiment shows, in descriptive labelling classification, word frequency is substantially better than word frequency as characteristic item.

Brief description

Fig. 1 is the realization procedure chart of word frequency file classification method of the present invention.

Specific embodiment

The present invention will be further described below in conjunction with the accompanying drawings, but embodiments of the present invention are not limited to this.

A kind of word frequency file classification method, including herein below: carry out pretreatment to the text of input, after processing Text carries out Chinese character segmentation, forms corpus, removes the stop words in corpus, forms vocabulary text matrix, using grader To sample, (sample refers to vocabulary text matrix？) be trained, calculate the recall rate of word frequency.

The process of the method specifically describes now:

As former data, totally 90 samples, are divided three classes the text of input, respectively clothes, books and makeups.It is divided into two Row, first is classified as merchandise classification, and second is classified as descriptive labelling, such as table 1 below:

Table 1

To input text carry out pretreatment: under r environment, be loaded into former data file, remove numeral, symbol, letter after Re-write Output of for ms.

# result such as table 2:

Table 2

Chinese character segmentation is carried out for the text after processing, because python is very convenient for Chinese text processing, Determine that Text segmentation part is carried out under python environment.The file after introducing codecs storehouse exporting previous step is added with a word The form write new file of one English comma

# result such as table 3:

Table 3

Form corpus: under r environment, use vcorpus function in " tm " bag to form corpus

Remove stop words: under r environment, use tm_map function in " tm " bag to delete stop words

The composition wherein disabling vocabulary is: from online xinhua dictionary, download interjection, measure word, conjunction, pronoun, auxiliary word are big Entirely, make deactivation vocabulary, such as table 4.

Table 4

Form vocabulary text matrix: under r environment, use the termdocumentmatrix function in " tm " bag to form word Remittance text matrix.Vocabulary text matrix is to be built according to vector space model.Vector space model (vsm) be exactly with one to Measure and to represent the information of a text so that text becomes one of feature space point.In vector space model Chinese version collection Close and form a matrix, that is, the set at feature space midpoint.

Vocabulary file matrix (term-documentmatrix), such as table 5:

Table 5

word_iIt is the characteristic item in vector space model, w_ijIt is the weight of item.

For the weighted value in model, we are obtained by tf-idf weight calculation method.

Tf-idf is a kind of statistical method, in order to assess a words in a file set or a corpus wherein The significance level of text document.The number of times that the importance of words occurs hereof with it is directly proportional increase, but simultaneously can be with The frequency that it occurs in corpus to be inversely proportional to decline.Tf represents the frequency that entry occurs in document d.The main think of of idf Think: if the document comprising entry t is fewer, idf is bigger, then explanation entry t has good class discrimination ability.

Tf x idf weight calculation formula is

t f i d f (t_{k}, d_{j}) = t f (t_{k}, d_{j}) \cdot l o g \frac{| t_{r} |}{# t_{r} (t_{k})}

Wherein tf (t_k, d_j) represent key word t_kIn document d_jThe frequency of middle appearance.|t_r| in data complete or collected works document total Number, #t_r(t_k) for comprising key word t_kTotal number of documents.

Wherein

t f (t_{k}, d_{j}) = \{\begin{matrix} 1 + \log # (t_{k}, d_{j}) & i f # (t_{k}, d_{j}) > 0 \\ 0 & o t h e r w i s e \end{matrix}

#(t_k, d_j) represent key word t_kIn document d_jThe number of times of middle appearance.

Last cosine normalization obtains final weighted value

w_{k j} = \frac{t f i d f (t_{k}, d_{j})}{\sqrt{σ_{s = 1}^{| t |} t f i d f {(t_{s}, d_{j})}^{2}}}

This method can give expression to the importance to certain class article for the key word well, is therefore widely adopted.

Training grader: the numerous function pair simplicity shellfishes in r environment, in being wrapped by " e1071 " and " rtexttools " Ye Si, bagging, boosting, artificial neural network, random forest, support vector machine, this seven graders of decision tree are carried out Training.Because sample is less, with all 90 samples simultaneously as training set with test set.

Calculate recall rate: recall rate reflects the accuracy of classification, and its computing formula is:

# result of calculation such as table 6:

Table 6

Word frequency contrasts

Participle

Under r environment, using jiebar bag, participle is carried out to sample.Segmentation methods select to combine hidden Markov and maximum The model of probability.

# word segmentation result is as follows:

After # removes stop words

Comparative result

# word frequency recall rate is as follows:

From the result that this is tested although when being characterized with word frequency, random forest (rf), and assembled classifier The classification results of bagging and boosting algorithm still compare good.But this it appears that for most under contrast The sorting algorithm of number, the effect of word frequency is well more a lot of than word frequency, or even at random forest (rf), artificial neural network (nnet), In assembled classifier bagging and boosting algorithm recall rate all reached 100%. this prove in descriptive labelling, Word frequency ratio word frequency has more characteristic.

The embodiment of invention described above, does not constitute limiting the scope of the present invention.Any at this Done modification, equivalent and improvement etc. within bright spiritual principles, should be included in the claim protection of the present invention Within the scope of.

Claims

1. a kind of word frequency file classification method is it is characterised in that include herein below:

Pretreatment is carried out to the text of input, Chinese character segmentation is carried out for the text after processing, forms corpus, remove corpus Interior stop words, is formed vocabulary text matrix, using grader, vocabulary text matrix is trained, and calculates recalling of word frequency Rate, calculation is:

The process forming vocabulary text matrix is:

Under r environment, the termdocumentmatrix function in " tm " bag is used to form vocabulary text matrix, vocabulary text square Battle array is to be built according to vector space model；Vector space model be represented with a vector text information so that Text becomes one of feature space point, forms a matrix in vector space model Chinese version set, that is, feature is empty Between midpoint set；

word_iIt is the characteristic item in vector space model, w_ijIt is the weight of characteristic item.

For the Feature item weighting value in model, cross tf-idf weight calculation method and obtain；

Tf-idf weight calculation formula is:

t f i d f (t_{k}, d_{j}) = t f (t_{k}, d_{j}) \cdot l o g \frac{| t_{r} |}{# t_{r} (t_{k})}

Wherein tf (t_k, d_j) represent key word t_kIn document d_jThe frequency of middle appearance；|t_r| for the sum of document in data complete or collected works, # t_r(t_k) for comprising key word t_kTotal number of documents；

Wherein

t f (t_{k}, d_{j}) = \{\begin{matrix} 1 + \log # (t_{k}, d_{j}) & i f # (t_{k}, d_{j}) > 0 \\ 0 & o t h e r w i s e \end{matrix}

Last cosine normalization obtains final weighted value:

w_{k j} = \frac{t f i d f (t_{k}, d_{j})}{\sqrt{σ_{s = 1}^{| t |} t f i d f {(t_{s}, d_{j})}^{2}}} .

2. method according to claim 1 refers under r environment it is characterised in that carrying out pretreatment to the text inputting, Be loaded into former data file, remove numeral, symbol, letter after re-write Output of for ms.

3. method according to claim 1 it is characterised in that for process after text carry out Chinese character segmentation be Behind codecs storehouse, the text that pretreatment exports is write new file in the form of a word adds an English comma.

4. method according to claim 1 refers to, under r environment, use in " tm " bag it is characterised in that forming corpus Vcorpus function forms corpus.

5. method according to claim 1 is it is characterised in that the stop words removing in corpus refers to, under r environment, make With " tm " bag in tm_map function delete stop words, wherein disable vocabulary composition be: from online xinhua dictionary download interjection, Measure word, conjunction, pronoun, auxiliary word are complete works of, make deactivation vocabulary.