CN106372640A - Character frequency text classification method - Google Patents

Character frequency text classification method Download PDF

Info

Publication number
CN106372640A
CN106372640A CN201610698064.2A CN201610698064A CN106372640A CN 106372640 A CN106372640 A CN 106372640A CN 201610698064 A CN201610698064 A CN 201610698064A CN 106372640 A CN106372640 A CN 106372640A
Authority
CN
China
Prior art keywords
text
word
vocabulary
corpus
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610698064.2A
Other languages
Chinese (zh)
Inventor
谭军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN201610698064.2A priority Critical patent/CN106372640A/en
Publication of CN106372640A publication Critical patent/CN106372640A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/158Segmentation of character regions using character size, text spacings or pitch estimation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a character frequency text classification method. The method comprises the following steps of preprocessing an input text; performing Chinese character segmentation on the processed text; forming a corpus library; removing stop words in the corpus library; forming a vocabulary text matrix; training a sample by adopting a classifier; and calculating a recall rate of character frequencies according to a calculation formula: the recall rate equals the number of classified correct texts divided by the number of actual texts. The classification method has the following characteristics: the effect of the character frequencies is much better than that of word frequencies, and even in random forest (RF), an artificial neural network (NNET) and a classifier Bagging and Boosting combined algorithm, the recall rates all reach 100%. It is proved that the character frequencies have better characteristics than the word frequencies in commodity description.

Description

A kind of word frequency file classification method
Technical field
The present invention relates to text classification field, more particularly, to a kind of word frequency file classification method.
Background technology
According to experimental result, in traditional automatic Text Categorization, generally believe that selection word is better than word as characteristic item And phrase.But it is in descriptive labelling classification, different because of its exclusive feature.Such as:
1) text size of descriptive labelling is typically all comparatively short, leads to Feature Words few and very sparse, word frequency, Term co-occurrence frequency The information such as rate can not be fully used it is difficult to deeply understand the relatedness between Feature Words.
2) comprise a large amount of abbreviations, alternative word in descriptive labelling, common saying etc. is difficult to the Feature Words distinguishing, and expression-form becomes In colloquial style, this leads to a lot of segmentation methods can not have good performance.
Chinese character is most basic linguistic unit in Chinese, and word is the minimum linguistic unit with semanteme in Chinese.? In descriptive labelling, it is used word necessarily will to face the word segmentation problem of complexity as characteristic item,
Content of the invention
Present invention proposition is a kind of to be used Chinese character as a kind of word frequency file classification method of characteristic item, because the cutting of Chinese character Problem is simpler than the cutting problems of word, and the quantity of Chinese character is much smaller than the quantity of word, and therefore feature extraction efficiency can carry High.
To achieve these goals, the technical scheme is that
A kind of word frequency file classification method, including herein below:
Pretreatment is carried out to the text of input, Chinese character segmentation is carried out for the text after processing, forms corpus, remove language Stop words in material storehouse, is formed vocabulary text matrix, using grader, vocabulary text matrix (sample) is trained, and calculates word The recall rate of frequency, calculation is:
The process forming vocabulary text matrix is:
Under r environment, the termdocumentmatrix function in " tm " bag is used to form vocabulary text matrix, vocabulary literary composition This matrix is to be built according to vector space model;Vector space model is the information representing a text with a vector, Make text become one of feature space point, form a matrix in vector space model Chinese version set, that is, special Levy the set at space midpoint;
wordiIt is the characteristic item in vector space model, wijIt is the weight of characteristic item (i.e. word frequency).
For characteristic item (the i.e. word frequency) weighted value in model, cross tf-idf weight calculation method and obtain;
Tf-idf weight calculation formula is:
t f i d f ( t k , d j ) = t f ( t k , d j ) · l o g | t r | # t r ( t k )
Wherein tf (tk, dj) represent key word tkIn document djThe frequency of middle appearance;|tr| in data complete or collected works document total Number, #tr(tk) for comprising key word tkTotal number of documents;
Wherein
t f ( t k , d j ) = 1 + log # ( t k , d j ) i f # ( t k , d j ) > 0 0 o t h e r w i s e
#(tk, dj) represent key word tkIn document djThe number of times of middle appearance;
Last cosine normalization obtains final weighted value:
w k j = t f i d f ( t k , d j ) σ s = 1 | t | t f i d f ( t s , d j ) 2 .
Preferably, the text of input is carried out with pretreatment to refer to, under r environment, be loaded into former data file, remove numeral, symbol Number, letter after re-write Output of for ms.
Preferably, for the text after processing carry out Chinese character segmentation be the text that behind codecs storehouse, pretreatment exported with One word adds the form write new file of an English comma.
Preferably, form corpus to refer to, under r environment, use vcorpus function in " tm " bag to form corpus.
Preferably, remove the stop words in corpus to refer to, under r environment, use tm_map function in " tm " bag to delete and stop Word, the composition wherein disabling vocabulary is: downloads interjection, measure word, conjunction, pronoun, auxiliary word complete works from online xinhua dictionary, does Become to disable vocabulary.
Discounting for the semantic relation between linguistic unit, two kinds of feature extracting methods are similar from the point of view of statistics angle 's.Hereinafter experiment shows, in descriptive labelling classification, word frequency is substantially better than word frequency as characteristic item.
Brief description
Fig. 1 is the realization procedure chart of word frequency file classification method of the present invention.
Specific embodiment
The present invention will be further described below in conjunction with the accompanying drawings, but embodiments of the present invention are not limited to this.
A kind of word frequency file classification method, including herein below: carry out pretreatment to the text of input, after processing Text carries out Chinese character segmentation, forms corpus, removes the stop words in corpus, forms vocabulary text matrix, using grader To sample, (sample refers to vocabulary text matrix?) be trained, calculate the recall rate of word frequency.
The process of the method specifically describes now:
As former data, totally 90 samples, are divided three classes the text of input, respectively clothes, books and makeups.It is divided into two Row, first is classified as merchandise classification, and second is classified as descriptive labelling, such as table 1 below:
Table 1
To input text carry out pretreatment: under r environment, be loaded into former data file, remove numeral, symbol, letter after Re-write Output of for ms.
# result such as table 2:
Table 2
Chinese character segmentation is carried out for the text after processing, because python is very convenient for Chinese text processing, Determine that Text segmentation part is carried out under python environment.The file after introducing codecs storehouse exporting previous step is added with a word The form write new file of one English comma
# result such as table 3:
Table 3
Form corpus: under r environment, use vcorpus function in " tm " bag to form corpus
Remove stop words: under r environment, use tm_map function in " tm " bag to delete stop words
The composition wherein disabling vocabulary is: from online xinhua dictionary, download interjection, measure word, conjunction, pronoun, auxiliary word are big Entirely, make deactivation vocabulary, such as table 4.
Table 4
Form vocabulary text matrix: under r environment, use the termdocumentmatrix function in " tm " bag to form word Remittance text matrix.Vocabulary text matrix is to be built according to vector space model.Vector space model (vsm) be exactly with one to Measure and to represent the information of a text so that text becomes one of feature space point.In vector space model Chinese version collection Close and form a matrix, that is, the set at feature space midpoint.
Vocabulary file matrix (term-documentmatrix), such as table 5:
Table 5
wordiIt is the characteristic item in vector space model, wijIt is the weight of item.
For the weighted value in model, we are obtained by tf-idf weight calculation method.
Tf-idf is a kind of statistical method, in order to assess a words in a file set or a corpus wherein The significance level of text document.The number of times that the importance of words occurs hereof with it is directly proportional increase, but simultaneously can be with The frequency that it occurs in corpus to be inversely proportional to decline.Tf represents the frequency that entry occurs in document d.The main think of of idf Think: if the document comprising entry t is fewer, idf is bigger, then explanation entry t has good class discrimination ability.
Tf x idf weight calculation formula is
t f i d f ( t k , d j ) = t f ( t k , d j ) · l o g | t r | # t r ( t k )
Wherein tf (tk, dj) represent key word tkIn document djThe frequency of middle appearance.|tr| in data complete or collected works document total Number, #tr(tk) for comprising key word tkTotal number of documents.
Wherein
t f ( t k , d j ) = 1 + log # ( t k , d j ) i f # ( t k , d j ) > 0 0 o t h e r w i s e
#(tk, dj) represent key word tkIn document djThe number of times of middle appearance.
Last cosine normalization obtains final weighted value
w k j = t f i d f ( t k , d j ) σ s = 1 | t | t f i d f ( t s , d j ) 2
This method can give expression to the importance to certain class article for the key word well, is therefore widely adopted.
Training grader: the numerous function pair simplicity shellfishes in r environment, in being wrapped by " e1071 " and " rtexttools " Ye Si, bagging, boosting, artificial neural network, random forest, support vector machine, this seven graders of decision tree are carried out Training.Because sample is less, with all 90 samples simultaneously as training set with test set.
Calculate recall rate: recall rate reflects the accuracy of classification, and its computing formula is:
# result of calculation such as table 6:
Table 6
Word frequency contrasts
Participle
Under r environment, using jiebar bag, participle is carried out to sample.Segmentation methods select to combine hidden Markov and maximum The model of probability.
# word segmentation result is as follows:
After # removes stop words
Comparative result
# word frequency recall rate is as follows:
# word frequency recall rate is as follows:
From the result that this is tested although when being characterized with word frequency, random forest (rf), and assembled classifier The classification results of bagging and boosting algorithm still compare good.But this it appears that for most under contrast The sorting algorithm of number, the effect of word frequency is well more a lot of than word frequency, or even at random forest (rf), artificial neural network (nnet), In assembled classifier bagging and boosting algorithm recall rate all reached 100%. this prove in descriptive labelling, Word frequency ratio word frequency has more characteristic.
The embodiment of invention described above, does not constitute limiting the scope of the present invention.Any at this Done modification, equivalent and improvement etc. within bright spiritual principles, should be included in the claim protection of the present invention Within the scope of.

Claims (5)

1. a kind of word frequency file classification method is it is characterised in that include herein below:
Pretreatment is carried out to the text of input, Chinese character segmentation is carried out for the text after processing, forms corpus, remove corpus Interior stop words, is formed vocabulary text matrix, using grader, vocabulary text matrix is trained, and calculates recalling of word frequency Rate, calculation is:
The process forming vocabulary text matrix is:
Under r environment, the termdocumentmatrix function in " tm " bag is used to form vocabulary text matrix, vocabulary text square Battle array is to be built according to vector space model;Vector space model be represented with a vector text information so that Text becomes one of feature space point, forms a matrix in vector space model Chinese version set, that is, feature is empty Between midpoint set;
wordiIt is the characteristic item in vector space model, wijIt is the weight of characteristic item.
For the Feature item weighting value in model, cross tf-idf weight calculation method and obtain;
Tf-idf weight calculation formula is:
t f i d f ( t k , d j ) = t f ( t k , d j ) · l o g | t r | # t r ( t k )
Wherein tf (tk, dj) represent key word tkIn document djThe frequency of middle appearance;|tr| for the sum of document in data complete or collected works, # tr(tk) for comprising key word tkTotal number of documents;
Wherein t f ( t k , d j ) = 1 + log # ( t k , d j ) i f # ( t k , d j ) > 0 0 o t h e r w i s e
#(tk, dj) represent key word tkIn document djThe number of times of middle appearance;
Last cosine normalization obtains final weighted value:
w k j = t f i d f ( t k , d j ) σ s = 1 | t | t f i d f ( t s , d j ) 2 .
2. method according to claim 1 refers under r environment it is characterised in that carrying out pretreatment to the text inputting, Be loaded into former data file, remove numeral, symbol, letter after re-write Output of for ms.
3. method according to claim 1 it is characterised in that for process after text carry out Chinese character segmentation be Behind codecs storehouse, the text that pretreatment exports is write new file in the form of a word adds an English comma.
4. method according to claim 1 refers to, under r environment, use in " tm " bag it is characterised in that forming corpus Vcorpus function forms corpus.
5. method according to claim 1 is it is characterised in that the stop words removing in corpus refers to, under r environment, make With " tm " bag in tm_map function delete stop words, wherein disable vocabulary composition be: from online xinhua dictionary download interjection, Measure word, conjunction, pronoun, auxiliary word are complete works of, make deactivation vocabulary.
CN201610698064.2A 2016-08-19 2016-08-19 Character frequency text classification method Pending CN106372640A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610698064.2A CN106372640A (en) 2016-08-19 2016-08-19 Character frequency text classification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610698064.2A CN106372640A (en) 2016-08-19 2016-08-19 Character frequency text classification method

Publications (1)

Publication Number Publication Date
CN106372640A true CN106372640A (en) 2017-02-01

Family

ID=57878340

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610698064.2A Pending CN106372640A (en) 2016-08-19 2016-08-19 Character frequency text classification method

Country Status (1)

Country Link
CN (1) CN106372640A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107491541A (en) * 2017-08-24 2017-12-19 北京丁牛科技有限公司 File classification method and device
CN107609113A (en) * 2017-09-13 2018-01-19 北京科技大学 A kind of Automatic document classification method
CN109241276A (en) * 2018-07-11 2019-01-18 河海大学 Word's kinds method, speech creativeness evaluation method and system in text
CN109712680A (en) * 2019-01-24 2019-05-03 易保互联医疗信息科技(北京)有限公司 Medical data generation method and system based on HL7 specification
CN109840281A (en) * 2019-02-27 2019-06-04 浪潮软件集团有限公司 A kind of self study intelligent decision method based on random forests algorithm
CN110427959A (en) * 2019-06-14 2019-11-08 合肥工业大学 Complain classification method, system and the storage medium of text
CN115455987A (en) * 2022-11-14 2022-12-09 合肥高维数据技术有限公司 Character grouping method based on word frequency and word frequency, storage medium and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101021838A (en) * 2007-03-02 2007-08-22 华为技术有限公司 Text handling method and system
CN102194013A (en) * 2011-06-23 2011-09-21 上海毕佳数据有限公司 Domain-knowledge-based short text classification method and text classification system
CN104391835A (en) * 2014-09-30 2015-03-04 中南大学 Method and device for selecting feature words in texts
CN105045913A (en) * 2015-08-14 2015-11-11 北京工业大学 Text classification method based on WordNet and latent semantic analysis
CN105045812A (en) * 2015-06-18 2015-11-11 上海高欣计算机***有限公司 Text topic classification method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101021838A (en) * 2007-03-02 2007-08-22 华为技术有限公司 Text handling method and system
CN102194013A (en) * 2011-06-23 2011-09-21 上海毕佳数据有限公司 Domain-knowledge-based short text classification method and text classification system
CN104391835A (en) * 2014-09-30 2015-03-04 中南大学 Method and device for selecting feature words in texts
CN105045812A (en) * 2015-06-18 2015-11-11 上海高欣计算机***有限公司 Text topic classification method and system
CN105045913A (en) * 2015-08-14 2015-11-11 北京工业大学 Text classification method based on WordNet and latent semantic analysis

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
梅君: "中文文本分类的研究与应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
王梦云等: "基于字频向量的中文文本自动分类***", 《情报学报》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107491541A (en) * 2017-08-24 2017-12-19 北京丁牛科技有限公司 File classification method and device
CN107491541B (en) * 2017-08-24 2021-03-02 北京丁牛科技有限公司 Text classification method and device
CN107609113A (en) * 2017-09-13 2018-01-19 北京科技大学 A kind of Automatic document classification method
CN109241276A (en) * 2018-07-11 2019-01-18 河海大学 Word's kinds method, speech creativeness evaluation method and system in text
CN109241276B (en) * 2018-07-11 2022-03-08 河海大学 Word classification method in text, and speech creativity evaluation method and system
CN109712680A (en) * 2019-01-24 2019-05-03 易保互联医疗信息科技(北京)有限公司 Medical data generation method and system based on HL7 specification
CN109712680B (en) * 2019-01-24 2021-02-09 易保互联医疗信息科技(北京)有限公司 Medical data generation method and system based on HL7 standard
CN109840281A (en) * 2019-02-27 2019-06-04 浪潮软件集团有限公司 A kind of self study intelligent decision method based on random forests algorithm
CN110427959A (en) * 2019-06-14 2019-11-08 合肥工业大学 Complain classification method, system and the storage medium of text
CN115455987A (en) * 2022-11-14 2022-12-09 合肥高维数据技术有限公司 Character grouping method based on word frequency and word frequency, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN106372640A (en) Character frequency text classification method
CN108509629B (en) Text emotion analysis method based on emotion dictionary and support vector machine
CN103207913B (en) The acquisition methods of commercial fine granularity semantic relation and system
CN112287684B (en) Short text auditing method and device for fusion variant word recognition
CN111125349A (en) Graph model text abstract generation method based on word frequency and semantics
CN109241530A (en) A kind of more classification methods of Chinese text based on N-gram vector sum convolutional neural networks
CN107229610A (en) The analysis method and device of a kind of affection data
CN105701084A (en) Characteristic extraction method of text classification on the basis of mutual information
CN103995876A (en) Text classification method based on chi square statistics and SMO algorithm
CN106598940A (en) Text similarity solution algorithm based on global optimization of keyword quality
CN109271517B (en) IG TF-IDF text feature vector generation and text classification method
Kruengkrai et al. Language identification based on string kernels
CN102567308A (en) Information processing feature extracting method
CN106599054A (en) Method and system for title classification and push
WO2014022172A2 (en) Information classification based on product recognition
CN109670014A (en) A kind of Authors of Science Articles name disambiguation method of rule-based matching and machine learning
US11960521B2 (en) Text classification system based on feature selection and method thereof
CN110705247A (en) Based on x2-C text similarity calculation method
CN115080973B (en) Malicious code detection method and system based on multi-mode feature fusion
CN112527958A (en) User behavior tendency identification method, device, equipment and storage medium
CN105912720B (en) A kind of text data analysis method of emotion involved in computer
Huda et al. A multi-label classification on topics of quranic verses (english translation) using backpropagation neural network with stochastic gradient descent and adam optimizer
CN106503153A (en) A kind of computer version taxonomic hierarchies, system and its file classification method
CN111241271B (en) Text emotion classification method and device and electronic equipment
CN113095723A (en) Coupon recommendation method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170201

RJ01 Rejection of invention patent application after publication