CN106570076A - Computer text classification system - Google Patents

Computer text classification system Download PDF

Info

Publication number
CN106570076A
CN106570076A CN201610905152.5A CN201610905152A CN106570076A CN 106570076 A CN106570076 A CN 106570076A CN 201610905152 A CN201610905152 A CN 201610905152A CN 106570076 A CN106570076 A CN 106570076A
Authority
CN
China
Prior art keywords
word
text
classification
module
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610905152.5A
Other languages
Chinese (zh)
Inventor
何正娣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen University
Original Assignee
Shenzhen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen University filed Critical Shenzhen University
Priority to CN201610905152.5A priority Critical patent/CN106570076A/en
Publication of CN106570076A publication Critical patent/CN106570076A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a computer text classification system. The system comprises a text preprocessing module, a text feature extraction module, a text training processing module, a classification processing module, a text type marking module and an effect improvement module. According to the system, based on an information theory, the classification process is further refined, so that the functions of the modules in the classification system are made clear, and the classification efficiency and the classification processing rate are ensured; and the effect improvement module is added, so that the classification processing correctness is improved.

Description

A kind of computer version categorizing system
Technical field
The present invention relates to a kind of artificial intelligence field, is related to a kind of Text Classification System.
Background technology
Text classification is used as fields such as information filtering, information retrieval, text database, digital library and mail classification Technical foundation, have a wide range of applications.The development of network and popularization, greatly facilitate our acquisition information.But it is big Amount information gives people many difficulties brought to the process of information, it is impossible to required information is quickly obtained, while can also bring Some reverse side information.Information Filtering Technology can be used to solve these problems, and the essence of information filtering is a classification problem, both Can be used to by user dislike information filter, it is also possible to for by user's information filtering interested out.Existing text The efficiency of categorizing system process is too low, and the effect on driving birds is not good of classification, error rate is too high.
The main object of the present invention is to provide the high Text Classification System of a kind of high efficiency, classification handling rate height, accuracy.
The content of the invention
In view of this, the technical problem to be solved in the present invention is to provide a kind of text retrieval characteristic of division system of selection, uses In solution insurmountable problem set forth above.
To reach the effect of above-mentioned technical proposal, the technical scheme is that:A kind of computer version categorizing system, bag Include Text Pretreatment module, Text character extraction module, text training managing module, classification processing module, text category flag Module and effect improve module;
Text Pretreatment module mainly carries out pretreatment to text, and the text to being input into first utilizes participle software by text Disconnect, punctuation mark, space are removed, be divided into set of words, then set of words is further processed, will be not intended to The word of justice is removed, and set of words is simplified in formation;
The main basis of Text character extraction module is feature selection approach, produce in set of words spy from simplifying first Word subset is levied, the process for producing feature word subset is the process for not stopping to search for, the algorithm of search adopts branch-and-bound search Algorithm, is then evaluated the feature word subset for producing with the evaluation function based on genetic algorithm, obtains evaluation of estimate, and will Evaluation of estimate with stop threshold value be compared, if evaluation of estimate than stop threshold value greatly if stop search, otherwise continue to search, Jing Cross to evaluate to filter and produce new feature word subset, using mutual information method the frequency that feature word occurs, comprehensive characteristics are calculated The frequency that word occurs, obtains the mapping table between feature word and frequency;
Text training managing module is processed the mapping table between feature word and frequency, randomly selects other texts This, calculates inverse document index, will inverse document index as input, by the weighted value for training classifier calculated feature word, from And obtain term weighing matrix;
, according to the term weighing matrix, the classification for arranging classification using svm classifier algorithm is interval, will for classification processing module Word is classified, and obtains word's kinds vector set, and the word of a classification belongs to same in word's kinds vector set Individual vector;
Text category flag module is used to be marked word's kinds vector set, sets up the category label of word Table, word's kinds are added in the category label table of word as the mark value of word classification with special value of symbol by mark value Vector set obtains labeled word's kinds vector set;
Effect improves module and carries out error statistics to labeled word's kinds vector set, the process of statistics be one with Press proof this extraction process, is ranked up according to the distribution law of word, and it is the region that emphasis is extracted to sort in front 30% region, right The classifying quality of the sample of extraction is tested, and is adjusted using nuisance parameter, if the frequency of adjustment is too high, illustrates what is classified Effect is not good enough, returns Text character extraction module, threshold value is modified and re-starts feature extraction until the frequency that will be adjusted Rate is dropped in safe scope.
Description of the drawings
Fig. 1 is a kind of structure chart of Text Classification System.
Specific embodiment
In order that the technical problem to be solved, technical scheme and beneficial effect become more apparent, below tie Drawings and Examples are closed, the present invention will be described in detail.It should be noted that specific embodiment described herein is only used To explain the present invention, it is not intended to limit the present invention, the product that can realize said function belongs to equivalent and improvement, includes Within protection scope of the present invention.Concrete grammar is as follows:
Embodiment one:The conventional method of text classification is characterized selection.This category feature has stronger class discrimination ability, Can concentrate and occur in a certain class or a few apoplexy due to endogenous wind.Therefore, when some words occur in the text, very big assurance can determines These texts are belonging to which kind of or which class.Different features has different abilities to express to the classification of text, it is clear that feature Selection seeks to select the classification to text feature compared with high rule complexity.This ability is referred to as text categories differentiation Degree, that is to say the ability of feature differentiation classification.
The text categories discrimination of feature is exactly the size comprising classification information amount in feature.Comprising text categories information Amount is big, and its text categories discrimination is just big;Conversely, its text categories discrimination is just little.Intuitively say, existed by a feature Whether occurs the ability of the category attribute to judge the document in document.Specifically, if feature occur in a document with The no classification on judging the document is without at all impact, then this feature is nonsensical to text classification, then the area of this feature Sub-category ability very little, feature selection should be very little to its evaluation of estimate.
The text categories discrimination of feature is bigger, and the classification information amount that feature is carried is more, then whether certain feature There is big class discrimination degree to be exactly to very useful feature of classifying.Such as one Feature Words are only present in medical classification, its class Other discrimination reaches maximum.However, this feature is rare word, it is few to occur in the text.If this kind of word is selected as spy Levy, then this feature is occurred without at all in other texts, i.e., the characteristic vector of all texts is zero in the dimension.This The sparse problem of the matrix that frequently encounters in text classification is had led to, so as to produce Expired Drugs, what this phenomenon was produced Reason is to be too dependent on training set.Therefore, when carrying out the feature selection of text, it should select high frequency words as far as possible, that is, consider special The generality levied, this is referred to as the generalization ability being characterized.It is apparent that the feature that text categories discrimination is big and generalization ability is strong It is that, to classification useful feature the most, and the feature that discrimination is little, generalization ability is weak is then most useless feature.Class discrimination degree The big and weak feature of generalization ability higher-dimension when it is more useful because having enough features to avoid Sparse Problem to protect Card recall rate, and the feature for having more preferable distinction is favorably improved the accuracy rate of classification.Class discrimination degree is little and generalization ability Strong feature is more useful when low-dimensional, because in the case of Sparse, although the ability of feature differentiation classification is not By force.
Embodiment two:Nuisance parameter value is found, its text classification effect can be optimized.For whole Text Classification System, should Algorithm considers different nuisance parameter values, and with training set the classifying quality index of each nuisance parameter value is estimated.By using friendship Fork checking, obtains the average of evaluation index and the estimated value of variance, and in two systems statistical significance.Optimal nuisance parameter Value is exactly to obtain highest statistical significance relative at A=0 (i.e. information gain feature selection).For each redundancy ginseng Number, cross validation is carried out using training set to Text Classification System.Training set is randomly divided into several pieces, selects a Validation test collection is used as, remaining is used as checking training set.With checking training set to feature selection and the system of sorting algorithm composition It is trained, then text classification effect, Calculation Estimation index is obtained with validation test collection.Proportionally checking collection is cut into into instruction Practicing collection and test set circulation carries out repeating experiment.For example:Four-legs intersection checking is done, training set is divided equally into into four parts.For the first time, test Card training set uses 1,2 and 3 label subset, the 4th label subset to test next time as validation test collection, uses 1,2 and 4 mark Number as training set, the 3rd label is used as test set.
For different nuisance parameter values, the average and variance of experiment effect index are tested.From the experiment of this paper From the point of view of, verify enough using four-legs intersection.Effect can not be remarkably reinforced using more cross validations, more intersections are tested Although card can have preferably estimation to average and variance, while also increasing the consumption of training time.Consider two redundancy ginsengs Number, carries out cross validation to it respectively, each obtains one group of effectiveness indicator data.It is superfluous using one to define notable statistical test Remaining parameter ratio has more preferable effect using another nuisance parameter.Algorithm tests different nuisance parameters straight from the beginning of initial value To one optimum of discovery.When the statistically significant highest of certain nuisance parameter value, then it is assumed that it is an optimum.
The process for defining feature selection from the angle of information theory is exactly to concentrate to select and classification mark from candidate feature The process of the maximum character subset of the mutual information of label.On this basis, the approximate of four kinds of mutual information feature selection approach is compared Algorithm.Because in addition to information gain, all Algorithm of documents categorization are required for nuisance parameter, it is therefore desirable to which a kind of packaging type search is superfluous The algorithm of remaining parameter best value.From the point of view of comparative result, for fixed value 0.4 is used using nuisance parameter, in all cases There is different degrees of raising, improve more obvious in some cases.Thus also further demonstrate that nuisance parameter and choose Characteristic number be related to the sorting algorithm for using.
Embodiment three:According to former Algorithm of documents categorization evaluation experimental, dashed forward the most with Rochcio, KNN and SVM feature Go out, the classification accuracy rate of wherein KNN, SvM method is higher, the time complexity and space complexity of Rochcio methods is minimum, because This herein mainly to these three methods carried out analysis, compare.
Rochcio graders are the methods based on vector space model and minimum range, and its greatest feature is that have well Feedback function, can according to its formula to classify vector space be modified.The method was carried earliest by Hull in 1994 Go out, since then, Rochcio methods are just widely used.Roeehio formula are:
Wherein W 'jc:The weight of class C center vector, β is the number of positive example in training sample, and γ is counter-example in training sample Number.
Vector distance measure formulas are:
Rochcio Method And Principles are simple, calculate rapid.Calculation procedure is:It is the higher-dimension in vector space by text representation Vector, according in training set just. the vector of example gives positive weights, and the vector of counter-example gives negative weights, is added averagely every to calculate The center of one classification.For the text for belonging to test set, it is calculated to the similarity of each class center, this text is sorted out In the classification maximum with its similarity.From its calculating process, if to those between class distances than larger and inter- object distance ratio Less categorical distribution situation, Rochcio graders can reach preferable nicety of grading, and for those do not reach it is this " good The categorical distribution situation of good distribution ", Rochcio classifier methods effects are poor.But because it calculates simple, rapid, so Among this method Jing is commonly used for the higher application of time requirement of classifying, and become the mark compared with other sorting techniques It is accurate.The whole realization step of this grader and evaluation procedure can be expressed as follows:
Apoplexy due to endogenous wind scheming formula is:
Wherein, niIt is LiThe number of class Chinese version, and DijIt is classification LiJ-th text vector.It is determined that the class of text After not, system finds the text of correlation in the range of the restriction of text library and recommends user.System using the feature after dimensionality reduction to Quantity space carries out text representation, and using TFDIF characteristic item weighting is carried out.In order to reduce index and matching primitives amount, using TFDIF The dimensional feature item of value highest 20 represents document D:
D=[z1, TFIDF (z1), Z2, TFIDF (z2) ..., Z20, TFID) F (z20)]
Text categorization task can be regarded as filling in table { 0,1 }, and abscissa is a series of classification, and vertical coordinate is , to there is a numeral, O represents this document and is not belonging to this classification, and 1 represents this for some documents, every document and each classification Document belongs to this classification.Uncertain link in order to reduce experiment, in addition it is also necessary to which this categorizing process is carried out more precisely It is bright.
Automatic Text Categorization has two kinds of typical method of testings, and one kind is one method of testing of training, and second is k point of intersection Evaluation methodology.A method of testing is trained to be classical evaluation methodology, original training set conjunction T is divided into training set and test set by it Two parts are closed, using training set feature selection and classifier training are carried out, grader is tested using test set.And Original training set is closed and is divided into k parts { 1T, TZ ..., Tk } by k point of cross assessment method, carries out k test, finally takes the flat of them Average is used as final result.
Ttria=T-Ti, Ttes=Ti, i=1,2 ..., k
K point of cross assessment method is usually used in the situation that original training set closes very little, it is therefore an objective to make full use of initial sample to enter Row training.Most strict also most accurate cross assessment method is L00 methods, it is assumed that have m sample, uses a sample conduct every time Test sample, all as training sample, the meansigma methodss for finally testing m time are used as final result for remaining sample.
The present invention will be described in detail for above-described embodiment.It should be noted that specific embodiment described herein Only to explain the present invention, it is not intended to limit the present invention, the product that can realize said function belongs to equivalent and improvement, It is included within protection scope of the present invention.
The invention has the beneficial effects as follows:Based on theory of information, the process to classifying further is refined the present invention, be specify that The function of modules in computer version categorizing system, it is ensured that the efficiency and classification handling rate of classification, adds effect complete Kind module, improves the accuracy that classification is processed.

Claims (1)

1. a kind of computer version categorizing system, it is characterised in that including Text Pretreatment module, Text character extraction module, Text training managing module, classification processing module, text category flag module and effect improve module;
Text of the calculating section in computer first to the Text Pretreatment module mainly to being input into carries out pretreatment, first Text is disconnected by the text of the input using participle software, punctuation mark, space are removed, be divided into set of words, connect And the set of words is further processed, insignificant word is removed, set of words is simplified in formation;
The main basis of the Text character extraction module is feature selection approach, is produced from described simplifying in set of words first Go out feature word subset, the process for producing the feature word subset is the process for not stopping to search for, the algorithm of search adopts branch Bound search algorithm, is then evaluated the feature word subset for producing with the evaluation function based on genetic algorithm, is obtained To evaluation of estimate, and institute's evaluation values are compared with the threshold value for stopping, if institute's evaluation values are bigger than the threshold value of the stopping Stop search, otherwise continue to search, through evaluating to filter new feature word subset is produced, new feature word Collection is made up of feature word, and using mutual information method the frequency that the feature word occurs is calculated, and the comprehensive feature word occurs Frequency, obtain the mapping table between the frequency that the feature word and the feature word occur;
Mapping table between the frequency that the text training managing module occurs to the feature word and the feature word enters Row is processed, and randomly selects other texts, calculates inverse document frequency, using the inverse document frequency after calculating as input, By the weighted value of feature word described in training classifier calculated, so as to obtain term weighing matrix;
, according to the term weighing matrix, the classification for arranging classification using svm classifier algorithm is interval, will for the classification processing module The word simplified in set of words is classified, and obtains word's kinds vector set, and the word of a classification belongs to described Same vector in word's kinds vector set;
The text category flag module is used to be marked word's kinds vector set, sets up the category label of word Table, in the category label table of the word with special value of symbol as word classification mark value, by the word classification Mark value adds word's kinds vector set to obtain labeled word's kinds vector set;
The effect improves module and carries out error statistics to the labeled word's kinds vector set, and the process of statistics is one Individual random sampling process, first by the labeled word's kinds in the labeled word's kinds vector set to Amount is extracted, and is ranked up according to the distribution law of word, and it is the region that emphasis is extracted to sort in front 30% region, to what is extracted The classifying quality of sample is tested, and is adjusted using nuisance parameter, if the frequency of adjustment is too high, illustrates the effect classified Fruit it is not good enough, come back to the Text character extraction module, the threshold value is modified re-start feature extraction until The frequency of the adjustment is dropped in safe scope.
CN201610905152.5A 2016-10-11 2016-10-11 Computer text classification system Pending CN106570076A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610905152.5A CN106570076A (en) 2016-10-11 2016-10-11 Computer text classification system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610905152.5A CN106570076A (en) 2016-10-11 2016-10-11 Computer text classification system

Publications (1)

Publication Number Publication Date
CN106570076A true CN106570076A (en) 2017-04-19

Family

ID=60414153

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610905152.5A Pending CN106570076A (en) 2016-10-11 2016-10-11 Computer text classification system

Country Status (1)

Country Link
CN (1) CN106570076A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503153A (en) * 2016-10-21 2017-03-15 江苏理工学院 Computer text classification system, system and text classification method thereof
CN107194617A (en) * 2017-07-06 2017-09-22 北京航空航天大学 A kind of app software engineers soft skill categorizing system and method
CN107579816A (en) * 2017-09-06 2018-01-12 中国科学院半导体研究所 Password dictionary generation method based on recurrent neural network
CN108388914A (en) * 2018-02-26 2018-08-10 中译语通科技股份有限公司 A kind of grader construction method, grader based on semantic computation
CN110969006A (en) * 2019-12-02 2020-04-07 支付宝(杭州)信息技术有限公司 Training method and system of text sequencing model
CN112364629A (en) * 2020-11-27 2021-02-12 苏州大学 Text classification system and method based on redundancy-removing mutual information feature selection

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102081667A (en) * 2011-01-23 2011-06-01 浙江大学 Chinese text classification method based on Base64 coding
CN102332012A (en) * 2011-09-13 2012-01-25 南方报业传媒集团 Chinese text sorting method based on correlation study between sorts
CN103116637A (en) * 2013-02-08 2013-05-22 无锡南理工科技发展有限公司 Text sentiment classification method facing Chinese Web comments
CN103995876A (en) * 2014-05-26 2014-08-20 上海大学 Text classification method based on chi square statistics and SMO algorithm

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102081667A (en) * 2011-01-23 2011-06-01 浙江大学 Chinese text classification method based on Base64 coding
CN102332012A (en) * 2011-09-13 2012-01-25 南方报业传媒集团 Chinese text sorting method based on correlation study between sorts
CN103116637A (en) * 2013-02-08 2013-05-22 无锡南理工科技发展有限公司 Text sentiment classification method facing Chinese Web comments
CN103995876A (en) * 2014-05-26 2014-08-20 上海大学 Text classification method based on chi square statistics and SMO algorithm

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503153A (en) * 2016-10-21 2017-03-15 江苏理工学院 Computer text classification system, system and text classification method thereof
CN106503153B (en) * 2016-10-21 2019-05-10 江苏理工学院 Computer text classification system
CN107194617A (en) * 2017-07-06 2017-09-22 北京航空航天大学 A kind of app software engineers soft skill categorizing system and method
CN107579816A (en) * 2017-09-06 2018-01-12 中国科学院半导体研究所 Password dictionary generation method based on recurrent neural network
CN107579816B (en) * 2017-09-06 2020-05-19 中国科学院半导体研究所 Method for generating password dictionary based on recurrent neural network
CN108388914A (en) * 2018-02-26 2018-08-10 中译语通科技股份有限公司 A kind of grader construction method, grader based on semantic computation
CN108388914B (en) * 2018-02-26 2022-04-01 中译语通科技股份有限公司 Classifier construction method based on semantic calculation and classifier
CN110969006A (en) * 2019-12-02 2020-04-07 支付宝(杭州)信息技术有限公司 Training method and system of text sequencing model
CN110969006B (en) * 2019-12-02 2023-03-21 支付宝(杭州)信息技术有限公司 Training method and system of text sequencing model
CN112364629A (en) * 2020-11-27 2021-02-12 苏州大学 Text classification system and method based on redundancy-removing mutual information feature selection

Similar Documents

Publication Publication Date Title
CN106570076A (en) Computer text classification system
CN107291723A (en) The method and apparatus of web page text classification, the method and apparatus of web page text identification
CN103632168B (en) Classifier integration method for machine learning
CN107798033B (en) Case text classification method in public security field
CN105138653B (en) It is a kind of that method and its recommendation apparatus are recommended based on typical degree and the topic of difficulty
CN105975518B (en) Expectation cross entropy feature selecting Text Classification System and method based on comentropy
CN104391835A (en) Method and device for selecting feature words in texts
CN102194013A (en) Domain-knowledge-based short text classification method and text classification system
CN110222744A (en) A kind of Naive Bayes Classification Model improved method based on attribute weight
CN106156372B (en) A kind of classification method and device of internet site
CN104834940A (en) Medical image inspection disease classification method based on support vector machine (SVM)
CN109933670A (en) A kind of file classification method calculating semantic distance based on combinatorial matrix
CN102622373A (en) Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm
CN104239485A (en) Statistical machine learning-based internet hidden link detection method
CN105760493A (en) Automatic work order classification method for electricity marketing service hot spot 95598
CN105260437A (en) Text classification feature selection method and application thereof to biomedical text classification
CN107273505A (en) Supervision cross-module state Hash search method based on nonparametric Bayes model
CN111680225B (en) WeChat financial message analysis method and system based on machine learning
CN103617435A (en) Image sorting method and system for active learning
CN114707571B (en) Credit data anomaly detection method based on enhanced isolation forest
CN110516074A (en) Website theme classification method and device based on deep learning
CN102129568A (en) Method for detecting image-based spam email by utilizing improved gauss hybrid model classifier
CN106815605B (en) Data classification method and equipment based on machine learning
CN103268346B (en) Semisupervised classification method and system
CN109783633A (en) Data analysis service procedural model recommended method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170419

WD01 Invention patent application deemed withdrawn after publication