CN106372640A - Character frequency text classification method - Google Patents
Character frequency text classification method Download PDFInfo
- Publication number
- CN106372640A CN106372640A CN201610698064.2A CN201610698064A CN106372640A CN 106372640 A CN106372640 A CN 106372640A CN 201610698064 A CN201610698064 A CN 201610698064A CN 106372640 A CN106372640 A CN 106372640A
- Authority
- CN
- China
- Prior art keywords
- text
- word
- vocabulary
- corpus
- matrix
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 25
- 239000011159 matrix material Substances 0.000 claims abstract description 19
- 230000011218 segmentation Effects 0.000 claims abstract description 12
- 238000004364 calculation method Methods 0.000 claims abstract description 10
- 238000012545 processing Methods 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 3
- 230000009849 deactivation Effects 0.000 claims description 2
- 238000007637 random forest analysis Methods 0.000 abstract description 5
- 238000013528 artificial neural network Methods 0.000 abstract description 4
- 238000004422 calculation algorithm Methods 0.000 abstract description 4
- 238000012549 training Methods 0.000 abstract description 4
- 230000000694 effects Effects 0.000 abstract description 2
- 238000007781 pre-processing Methods 0.000 abstract 1
- 230000006870 function Effects 0.000 description 7
- 238000002372 labelling Methods 0.000 description 7
- 230000000052 comparative effect Effects 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 235000015170 shellfish Nutrition 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/148—Segmentation of character regions
- G06V30/158—Segmentation of character regions using character size, text spacings or pitch estimation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a character frequency text classification method. The method comprises the following steps of preprocessing an input text; performing Chinese character segmentation on the processed text; forming a corpus library; removing stop words in the corpus library; forming a vocabulary text matrix; training a sample by adopting a classifier; and calculating a recall rate of character frequencies according to a calculation formula: the recall rate equals the number of classified correct texts divided by the number of actual texts. The classification method has the following characteristics: the effect of the character frequencies is much better than that of word frequencies, and even in random forest (RF), an artificial neural network (NNET) and a classifier Bagging and Boosting combined algorithm, the recall rates all reach 100%. It is proved that the character frequencies have better characteristics than the word frequencies in commodity description.
Description
Technical field
The present invention relates to text classification field, more particularly, to a kind of word frequency file classification method.
Background technology
According to experimental result, in traditional automatic Text Categorization, generally believe that selection word is better than word as characteristic item
And phrase.But it is in descriptive labelling classification, different because of its exclusive feature.Such as:
1) text size of descriptive labelling is typically all comparatively short, leads to Feature Words few and very sparse, word frequency, Term co-occurrence frequency
The information such as rate can not be fully used it is difficult to deeply understand the relatedness between Feature Words.
2) comprise a large amount of abbreviations, alternative word in descriptive labelling, common saying etc. is difficult to the Feature Words distinguishing, and expression-form becomes
In colloquial style, this leads to a lot of segmentation methods can not have good performance.
Chinese character is most basic linguistic unit in Chinese, and word is the minimum linguistic unit with semanteme in Chinese.?
In descriptive labelling, it is used word necessarily will to face the word segmentation problem of complexity as characteristic item,
Content of the invention
Present invention proposition is a kind of to be used Chinese character as a kind of word frequency file classification method of characteristic item, because the cutting of Chinese character
Problem is simpler than the cutting problems of word, and the quantity of Chinese character is much smaller than the quantity of word, and therefore feature extraction efficiency can carry
High.
To achieve these goals, the technical scheme is that
A kind of word frequency file classification method, including herein below:
Pretreatment is carried out to the text of input, Chinese character segmentation is carried out for the text after processing, forms corpus, remove language
Stop words in material storehouse, is formed vocabulary text matrix, using grader, vocabulary text matrix (sample) is trained, and calculates word
The recall rate of frequency, calculation is:
The process forming vocabulary text matrix is:
Under r environment, the termdocumentmatrix function in " tm " bag is used to form vocabulary text matrix, vocabulary literary composition
This matrix is to be built according to vector space model;Vector space model is the information representing a text with a vector,
Make text become one of feature space point, form a matrix in vector space model Chinese version set, that is, special
Levy the set at space midpoint;
wordiIt is the characteristic item in vector space model, wijIt is the weight of characteristic item (i.e. word frequency).
For characteristic item (the i.e. word frequency) weighted value in model, cross tf-idf weight calculation method and obtain;
Tf-idf weight calculation formula is:
Wherein tf (tk, dj) represent key word tkIn document djThe frequency of middle appearance;|tr| in data complete or collected works document total
Number, #tr(tk) for comprising key word tkTotal number of documents;
Wherein
#(tk, dj) represent key word tkIn document djThe number of times of middle appearance;
Last cosine normalization obtains final weighted value:
Preferably, the text of input is carried out with pretreatment to refer to, under r environment, be loaded into former data file, remove numeral, symbol
Number, letter after re-write Output of for ms.
Preferably, for the text after processing carry out Chinese character segmentation be the text that behind codecs storehouse, pretreatment exported with
One word adds the form write new file of an English comma.
Preferably, form corpus to refer to, under r environment, use vcorpus function in " tm " bag to form corpus.
Preferably, remove the stop words in corpus to refer to, under r environment, use tm_map function in " tm " bag to delete and stop
Word, the composition wherein disabling vocabulary is: downloads interjection, measure word, conjunction, pronoun, auxiliary word complete works from online xinhua dictionary, does
Become to disable vocabulary.
Discounting for the semantic relation between linguistic unit, two kinds of feature extracting methods are similar from the point of view of statistics angle
's.Hereinafter experiment shows, in descriptive labelling classification, word frequency is substantially better than word frequency as characteristic item.
Brief description
Fig. 1 is the realization procedure chart of word frequency file classification method of the present invention.
Specific embodiment
The present invention will be further described below in conjunction with the accompanying drawings, but embodiments of the present invention are not limited to this.
A kind of word frequency file classification method, including herein below: carry out pretreatment to the text of input, after processing
Text carries out Chinese character segmentation, forms corpus, removes the stop words in corpus, forms vocabulary text matrix, using grader
To sample, (sample refers to vocabulary text matrix?) be trained, calculate the recall rate of word frequency.
The process of the method specifically describes now:
As former data, totally 90 samples, are divided three classes the text of input, respectively clothes, books and makeups.It is divided into two
Row, first is classified as merchandise classification, and second is classified as descriptive labelling, such as table 1 below:
Table 1
To input text carry out pretreatment: under r environment, be loaded into former data file, remove numeral, symbol, letter after
Re-write Output of for ms.
# result such as table 2:
Table 2
Chinese character segmentation is carried out for the text after processing, because python is very convenient for Chinese text processing,
Determine that Text segmentation part is carried out under python environment.The file after introducing codecs storehouse exporting previous step is added with a word
The form write new file of one English comma
# result such as table 3:
Table 3
Form corpus: under r environment, use vcorpus function in " tm " bag to form corpus
Remove stop words: under r environment, use tm_map function in " tm " bag to delete stop words
The composition wherein disabling vocabulary is: from online xinhua dictionary, download interjection, measure word, conjunction, pronoun, auxiliary word are big
Entirely, make deactivation vocabulary, such as table 4.
Table 4
Form vocabulary text matrix: under r environment, use the termdocumentmatrix function in " tm " bag to form word
Remittance text matrix.Vocabulary text matrix is to be built according to vector space model.Vector space model (vsm) be exactly with one to
Measure and to represent the information of a text so that text becomes one of feature space point.In vector space model Chinese version collection
Close and form a matrix, that is, the set at feature space midpoint.
Vocabulary file matrix (term-documentmatrix), such as table 5:
Table 5
wordiIt is the characteristic item in vector space model, wijIt is the weight of item.
For the weighted value in model, we are obtained by tf-idf weight calculation method.
Tf-idf is a kind of statistical method, in order to assess a words in a file set or a corpus wherein
The significance level of text document.The number of times that the importance of words occurs hereof with it is directly proportional increase, but simultaneously can be with
The frequency that it occurs in corpus to be inversely proportional to decline.Tf represents the frequency that entry occurs in document d.The main think of of idf
Think: if the document comprising entry t is fewer, idf is bigger, then explanation entry t has good class discrimination ability.
Tf x idf weight calculation formula is
Wherein tf (tk, dj) represent key word tkIn document djThe frequency of middle appearance.|tr| in data complete or collected works document total
Number, #tr(tk) for comprising key word tkTotal number of documents.
Wherein
#(tk, dj) represent key word tkIn document djThe number of times of middle appearance.
Last cosine normalization obtains final weighted value
This method can give expression to the importance to certain class article for the key word well, is therefore widely adopted.
Training grader: the numerous function pair simplicity shellfishes in r environment, in being wrapped by " e1071 " and " rtexttools "
Ye Si, bagging, boosting, artificial neural network, random forest, support vector machine, this seven graders of decision tree are carried out
Training.Because sample is less, with all 90 samples simultaneously as training set with test set.
Calculate recall rate: recall rate reflects the accuracy of classification, and its computing formula is:
# result of calculation such as table 6:
Table 6
Word frequency contrasts
Participle
Under r environment, using jiebar bag, participle is carried out to sample.Segmentation methods select to combine hidden Markov and maximum
The model of probability.
# word segmentation result is as follows:
After # removes stop words
Comparative result
# word frequency recall rate is as follows:
# word frequency recall rate is as follows:
From the result that this is tested although when being characterized with word frequency, random forest (rf), and assembled classifier
The classification results of bagging and boosting algorithm still compare good.But this it appears that for most under contrast
The sorting algorithm of number, the effect of word frequency is well more a lot of than word frequency, or even at random forest (rf), artificial neural network (nnet),
In assembled classifier bagging and boosting algorithm recall rate all reached 100%. this prove in descriptive labelling,
Word frequency ratio word frequency has more characteristic.
The embodiment of invention described above, does not constitute limiting the scope of the present invention.Any at this
Done modification, equivalent and improvement etc. within bright spiritual principles, should be included in the claim protection of the present invention
Within the scope of.
Claims (5)
1. a kind of word frequency file classification method is it is characterised in that include herein below:
Pretreatment is carried out to the text of input, Chinese character segmentation is carried out for the text after processing, forms corpus, remove corpus
Interior stop words, is formed vocabulary text matrix, using grader, vocabulary text matrix is trained, and calculates recalling of word frequency
Rate, calculation is:
The process forming vocabulary text matrix is:
Under r environment, the termdocumentmatrix function in " tm " bag is used to form vocabulary text matrix, vocabulary text square
Battle array is to be built according to vector space model;Vector space model be represented with a vector text information so that
Text becomes one of feature space point, forms a matrix in vector space model Chinese version set, that is, feature is empty
Between midpoint set;
wordiIt is the characteristic item in vector space model, wijIt is the weight of characteristic item.
For the Feature item weighting value in model, cross tf-idf weight calculation method and obtain;
Tf-idf weight calculation formula is:
Wherein tf (tk, dj) represent key word tkIn document djThe frequency of middle appearance;|tr| for the sum of document in data complete or collected works, #
tr(tk) for comprising key word tkTotal number of documents;
Wherein
#(tk, dj) represent key word tkIn document djThe number of times of middle appearance;
Last cosine normalization obtains final weighted value:
2. method according to claim 1 refers under r environment it is characterised in that carrying out pretreatment to the text inputting,
Be loaded into former data file, remove numeral, symbol, letter after re-write Output of for ms.
3. method according to claim 1 it is characterised in that for process after text carry out Chinese character segmentation be
Behind codecs storehouse, the text that pretreatment exports is write new file in the form of a word adds an English comma.
4. method according to claim 1 refers to, under r environment, use in " tm " bag it is characterised in that forming corpus
Vcorpus function forms corpus.
5. method according to claim 1 is it is characterised in that the stop words removing in corpus refers to, under r environment, make
With " tm " bag in tm_map function delete stop words, wherein disable vocabulary composition be: from online xinhua dictionary download interjection,
Measure word, conjunction, pronoun, auxiliary word are complete works of, make deactivation vocabulary.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610698064.2A CN106372640A (en) | 2016-08-19 | 2016-08-19 | Character frequency text classification method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610698064.2A CN106372640A (en) | 2016-08-19 | 2016-08-19 | Character frequency text classification method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106372640A true CN106372640A (en) | 2017-02-01 |
Family
ID=57878340
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610698064.2A Pending CN106372640A (en) | 2016-08-19 | 2016-08-19 | Character frequency text classification method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106372640A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107491541A (en) * | 2017-08-24 | 2017-12-19 | 北京丁牛科技有限公司 | File classification method and device |
CN107609113A (en) * | 2017-09-13 | 2018-01-19 | 北京科技大学 | A kind of Automatic document classification method |
CN109241276A (en) * | 2018-07-11 | 2019-01-18 | 河海大学 | Word's kinds method, speech creativeness evaluation method and system in text |
CN109712680A (en) * | 2019-01-24 | 2019-05-03 | 易保互联医疗信息科技(北京)有限公司 | Medical data generation method and system based on HL7 specification |
CN109840281A (en) * | 2019-02-27 | 2019-06-04 | 浪潮软件集团有限公司 | A kind of self study intelligent decision method based on random forests algorithm |
CN110427959A (en) * | 2019-06-14 | 2019-11-08 | 合肥工业大学 | Complain classification method, system and the storage medium of text |
CN115455987A (en) * | 2022-11-14 | 2022-12-09 | 合肥高维数据技术有限公司 | Character grouping method based on word frequency and word frequency, storage medium and electronic equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101021838A (en) * | 2007-03-02 | 2007-08-22 | 华为技术有限公司 | Text handling method and system |
CN102194013A (en) * | 2011-06-23 | 2011-09-21 | 上海毕佳数据有限公司 | Domain-knowledge-based short text classification method and text classification system |
CN104391835A (en) * | 2014-09-30 | 2015-03-04 | 中南大学 | Method and device for selecting feature words in texts |
CN105045913A (en) * | 2015-08-14 | 2015-11-11 | 北京工业大学 | Text classification method based on WordNet and latent semantic analysis |
CN105045812A (en) * | 2015-06-18 | 2015-11-11 | 上海高欣计算机***有限公司 | Text topic classification method and system |
-
2016
- 2016-08-19 CN CN201610698064.2A patent/CN106372640A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101021838A (en) * | 2007-03-02 | 2007-08-22 | 华为技术有限公司 | Text handling method and system |
CN102194013A (en) * | 2011-06-23 | 2011-09-21 | 上海毕佳数据有限公司 | Domain-knowledge-based short text classification method and text classification system |
CN104391835A (en) * | 2014-09-30 | 2015-03-04 | 中南大学 | Method and device for selecting feature words in texts |
CN105045812A (en) * | 2015-06-18 | 2015-11-11 | 上海高欣计算机***有限公司 | Text topic classification method and system |
CN105045913A (en) * | 2015-08-14 | 2015-11-11 | 北京工业大学 | Text classification method based on WordNet and latent semantic analysis |
Non-Patent Citations (2)
Title |
---|
梅君: "中文文本分类的研究与应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
王梦云等: "基于字频向量的中文文本自动分类***", 《情报学报》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107491541A (en) * | 2017-08-24 | 2017-12-19 | 北京丁牛科技有限公司 | File classification method and device |
CN107491541B (en) * | 2017-08-24 | 2021-03-02 | 北京丁牛科技有限公司 | Text classification method and device |
CN107609113A (en) * | 2017-09-13 | 2018-01-19 | 北京科技大学 | A kind of Automatic document classification method |
CN109241276A (en) * | 2018-07-11 | 2019-01-18 | 河海大学 | Word's kinds method, speech creativeness evaluation method and system in text |
CN109241276B (en) * | 2018-07-11 | 2022-03-08 | 河海大学 | Word classification method in text, and speech creativity evaluation method and system |
CN109712680A (en) * | 2019-01-24 | 2019-05-03 | 易保互联医疗信息科技(北京)有限公司 | Medical data generation method and system based on HL7 specification |
CN109712680B (en) * | 2019-01-24 | 2021-02-09 | 易保互联医疗信息科技(北京)有限公司 | Medical data generation method and system based on HL7 standard |
CN109840281A (en) * | 2019-02-27 | 2019-06-04 | 浪潮软件集团有限公司 | A kind of self study intelligent decision method based on random forests algorithm |
CN110427959A (en) * | 2019-06-14 | 2019-11-08 | 合肥工业大学 | Complain classification method, system and the storage medium of text |
CN115455987A (en) * | 2022-11-14 | 2022-12-09 | 合肥高维数据技术有限公司 | Character grouping method based on word frequency and word frequency, storage medium and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106372640A (en) | Character frequency text classification method | |
CN108509629B (en) | Text emotion analysis method based on emotion dictionary and support vector machine | |
CN103207913B (en) | The acquisition methods of commercial fine granularity semantic relation and system | |
CN112287684B (en) | Short text auditing method and device for fusion variant word recognition | |
CN111125349A (en) | Graph model text abstract generation method based on word frequency and semantics | |
CN109241530A (en) | A kind of more classification methods of Chinese text based on N-gram vector sum convolutional neural networks | |
CN107229610A (en) | The analysis method and device of a kind of affection data | |
CN105701084A (en) | Characteristic extraction method of text classification on the basis of mutual information | |
CN103995876A (en) | Text classification method based on chi square statistics and SMO algorithm | |
CN106598940A (en) | Text similarity solution algorithm based on global optimization of keyword quality | |
CN109271517B (en) | IG TF-IDF text feature vector generation and text classification method | |
Kruengkrai et al. | Language identification based on string kernels | |
CN102567308A (en) | Information processing feature extracting method | |
CN106599054A (en) | Method and system for title classification and push | |
WO2014022172A2 (en) | Information classification based on product recognition | |
CN109670014A (en) | A kind of Authors of Science Articles name disambiguation method of rule-based matching and machine learning | |
US11960521B2 (en) | Text classification system based on feature selection and method thereof | |
CN110705247A (en) | Based on x2-C text similarity calculation method | |
CN115080973B (en) | Malicious code detection method and system based on multi-mode feature fusion | |
CN112527958A (en) | User behavior tendency identification method, device, equipment and storage medium | |
CN105912720B (en) | A kind of text data analysis method of emotion involved in computer | |
Huda et al. | A multi-label classification on topics of quranic verses (english translation) using backpropagation neural network with stochastic gradient descent and adam optimizer | |
CN106503153A (en) | A kind of computer version taxonomic hierarchies, system and its file classification method | |
CN111241271B (en) | Text emotion classification method and device and electronic equipment | |
CN113095723A (en) | Coupon recommendation method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170201 |
|
RJ01 | Rejection of invention patent application after publication |