CN106570076A - Computer text classification system - Google Patents
Computer text classification system Download PDFInfo
- Publication number
- CN106570076A CN106570076A CN201610905152.5A CN201610905152A CN106570076A CN 106570076 A CN106570076 A CN 106570076A CN 201610905152 A CN201610905152 A CN 201610905152A CN 106570076 A CN106570076 A CN 106570076A
- Authority
- CN
- China
- Prior art keywords
- word
- text
- classification
- module
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a computer text classification system. The system comprises a text preprocessing module, a text feature extraction module, a text training processing module, a classification processing module, a text type marking module and an effect improvement module. According to the system, based on an information theory, the classification process is further refined, so that the functions of the modules in the classification system are made clear, and the classification efficiency and the classification processing rate are ensured; and the effect improvement module is added, so that the classification processing correctness is improved.
Description
Technical field
The present invention relates to a kind of artificial intelligence field, is related to a kind of Text Classification System.
Background technology
Text classification is used as fields such as information filtering, information retrieval, text database, digital library and mail classification
Technical foundation, have a wide range of applications.The development of network and popularization, greatly facilitate our acquisition information.But it is big
Amount information gives people many difficulties brought to the process of information, it is impossible to required information is quickly obtained, while can also bring
Some reverse side information.Information Filtering Technology can be used to solve these problems, and the essence of information filtering is a classification problem, both
Can be used to by user dislike information filter, it is also possible to for by user's information filtering interested out.Existing text
The efficiency of categorizing system process is too low, and the effect on driving birds is not good of classification, error rate is too high.
The main object of the present invention is to provide the high Text Classification System of a kind of high efficiency, classification handling rate height, accuracy.
The content of the invention
In view of this, the technical problem to be solved in the present invention is to provide a kind of text retrieval characteristic of division system of selection, uses
In solution insurmountable problem set forth above.
To reach the effect of above-mentioned technical proposal, the technical scheme is that:A kind of computer version categorizing system, bag
Include Text Pretreatment module, Text character extraction module, text training managing module, classification processing module, text category flag
Module and effect improve module;
Text Pretreatment module mainly carries out pretreatment to text, and the text to being input into first utilizes participle software by text
Disconnect, punctuation mark, space are removed, be divided into set of words, then set of words is further processed, will be not intended to
The word of justice is removed, and set of words is simplified in formation;
The main basis of Text character extraction module is feature selection approach, produce in set of words spy from simplifying first
Word subset is levied, the process for producing feature word subset is the process for not stopping to search for, the algorithm of search adopts branch-and-bound search
Algorithm, is then evaluated the feature word subset for producing with the evaluation function based on genetic algorithm, obtains evaluation of estimate, and will
Evaluation of estimate with stop threshold value be compared, if evaluation of estimate than stop threshold value greatly if stop search, otherwise continue to search, Jing
Cross to evaluate to filter and produce new feature word subset, using mutual information method the frequency that feature word occurs, comprehensive characteristics are calculated
The frequency that word occurs, obtains the mapping table between feature word and frequency;
Text training managing module is processed the mapping table between feature word and frequency, randomly selects other texts
This, calculates inverse document index, will inverse document index as input, by the weighted value for training classifier calculated feature word, from
And obtain term weighing matrix;
, according to the term weighing matrix, the classification for arranging classification using svm classifier algorithm is interval, will for classification processing module
Word is classified, and obtains word's kinds vector set, and the word of a classification belongs to same in word's kinds vector set
Individual vector;
Text category flag module is used to be marked word's kinds vector set, sets up the category label of word
Table, word's kinds are added in the category label table of word as the mark value of word classification with special value of symbol by mark value
Vector set obtains labeled word's kinds vector set;
Effect improves module and carries out error statistics to labeled word's kinds vector set, the process of statistics be one with
Press proof this extraction process, is ranked up according to the distribution law of word, and it is the region that emphasis is extracted to sort in front 30% region, right
The classifying quality of the sample of extraction is tested, and is adjusted using nuisance parameter, if the frequency of adjustment is too high, illustrates what is classified
Effect is not good enough, returns Text character extraction module, threshold value is modified and re-starts feature extraction until the frequency that will be adjusted
Rate is dropped in safe scope.
Description of the drawings
Fig. 1 is a kind of structure chart of Text Classification System.
Specific embodiment
In order that the technical problem to be solved, technical scheme and beneficial effect become more apparent, below tie
Drawings and Examples are closed, the present invention will be described in detail.It should be noted that specific embodiment described herein is only used
To explain the present invention, it is not intended to limit the present invention, the product that can realize said function belongs to equivalent and improvement, includes
Within protection scope of the present invention.Concrete grammar is as follows:
Embodiment one:The conventional method of text classification is characterized selection.This category feature has stronger class discrimination ability,
Can concentrate and occur in a certain class or a few apoplexy due to endogenous wind.Therefore, when some words occur in the text, very big assurance can determines
These texts are belonging to which kind of or which class.Different features has different abilities to express to the classification of text, it is clear that feature
Selection seeks to select the classification to text feature compared with high rule complexity.This ability is referred to as text categories differentiation
Degree, that is to say the ability of feature differentiation classification.
The text categories discrimination of feature is exactly the size comprising classification information amount in feature.Comprising text categories information
Amount is big, and its text categories discrimination is just big;Conversely, its text categories discrimination is just little.Intuitively say, existed by a feature
Whether occurs the ability of the category attribute to judge the document in document.Specifically, if feature occur in a document with
The no classification on judging the document is without at all impact, then this feature is nonsensical to text classification, then the area of this feature
Sub-category ability very little, feature selection should be very little to its evaluation of estimate.
The text categories discrimination of feature is bigger, and the classification information amount that feature is carried is more, then whether certain feature
There is big class discrimination degree to be exactly to very useful feature of classifying.Such as one Feature Words are only present in medical classification, its class
Other discrimination reaches maximum.However, this feature is rare word, it is few to occur in the text.If this kind of word is selected as spy
Levy, then this feature is occurred without at all in other texts, i.e., the characteristic vector of all texts is zero in the dimension.This
The sparse problem of the matrix that frequently encounters in text classification is had led to, so as to produce Expired Drugs, what this phenomenon was produced
Reason is to be too dependent on training set.Therefore, when carrying out the feature selection of text, it should select high frequency words as far as possible, that is, consider special
The generality levied, this is referred to as the generalization ability being characterized.It is apparent that the feature that text categories discrimination is big and generalization ability is strong
It is that, to classification useful feature the most, and the feature that discrimination is little, generalization ability is weak is then most useless feature.Class discrimination degree
The big and weak feature of generalization ability higher-dimension when it is more useful because having enough features to avoid Sparse Problem to protect
Card recall rate, and the feature for having more preferable distinction is favorably improved the accuracy rate of classification.Class discrimination degree is little and generalization ability
Strong feature is more useful when low-dimensional, because in the case of Sparse, although the ability of feature differentiation classification is not
By force.
Embodiment two:Nuisance parameter value is found, its text classification effect can be optimized.For whole Text Classification System, should
Algorithm considers different nuisance parameter values, and with training set the classifying quality index of each nuisance parameter value is estimated.By using friendship
Fork checking, obtains the average of evaluation index and the estimated value of variance, and in two systems statistical significance.Optimal nuisance parameter
Value is exactly to obtain highest statistical significance relative at A=0 (i.e. information gain feature selection).For each redundancy ginseng
Number, cross validation is carried out using training set to Text Classification System.Training set is randomly divided into several pieces, selects a
Validation test collection is used as, remaining is used as checking training set.With checking training set to feature selection and the system of sorting algorithm composition
It is trained, then text classification effect, Calculation Estimation index is obtained with validation test collection.Proportionally checking collection is cut into into instruction
Practicing collection and test set circulation carries out repeating experiment.For example:Four-legs intersection checking is done, training set is divided equally into into four parts.For the first time, test
Card training set uses 1,2 and 3 label subset, the 4th label subset to test next time as validation test collection, uses 1,2 and 4 mark
Number as training set, the 3rd label is used as test set.
For different nuisance parameter values, the average and variance of experiment effect index are tested.From the experiment of this paper
From the point of view of, verify enough using four-legs intersection.Effect can not be remarkably reinforced using more cross validations, more intersections are tested
Although card can have preferably estimation to average and variance, while also increasing the consumption of training time.Consider two redundancy ginsengs
Number, carries out cross validation to it respectively, each obtains one group of effectiveness indicator data.It is superfluous using one to define notable statistical test
Remaining parameter ratio has more preferable effect using another nuisance parameter.Algorithm tests different nuisance parameters straight from the beginning of initial value
To one optimum of discovery.When the statistically significant highest of certain nuisance parameter value, then it is assumed that it is an optimum.
The process for defining feature selection from the angle of information theory is exactly to concentrate to select and classification mark from candidate feature
The process of the maximum character subset of the mutual information of label.On this basis, the approximate of four kinds of mutual information feature selection approach is compared
Algorithm.Because in addition to information gain, all Algorithm of documents categorization are required for nuisance parameter, it is therefore desirable to which a kind of packaging type search is superfluous
The algorithm of remaining parameter best value.From the point of view of comparative result, for fixed value 0.4 is used using nuisance parameter, in all cases
There is different degrees of raising, improve more obvious in some cases.Thus also further demonstrate that nuisance parameter and choose
Characteristic number be related to the sorting algorithm for using.
Embodiment three:According to former Algorithm of documents categorization evaluation experimental, dashed forward the most with Rochcio, KNN and SVM feature
Go out, the classification accuracy rate of wherein KNN, SvM method is higher, the time complexity and space complexity of Rochcio methods is minimum, because
This herein mainly to these three methods carried out analysis, compare.
Rochcio graders are the methods based on vector space model and minimum range, and its greatest feature is that have well
Feedback function, can according to its formula to classify vector space be modified.The method was carried earliest by Hull in 1994
Go out, since then, Rochcio methods are just widely used.Roeehio formula are:
Wherein W 'jc:The weight of class C center vector, β is the number of positive example in training sample, and γ is counter-example in training sample
Number.
Vector distance measure formulas are:
Rochcio Method And Principles are simple, calculate rapid.Calculation procedure is:It is the higher-dimension in vector space by text representation
Vector, according in training set just. the vector of example gives positive weights, and the vector of counter-example gives negative weights, is added averagely every to calculate
The center of one classification.For the text for belonging to test set, it is calculated to the similarity of each class center, this text is sorted out
In the classification maximum with its similarity.From its calculating process, if to those between class distances than larger and inter- object distance ratio
Less categorical distribution situation, Rochcio graders can reach preferable nicety of grading, and for those do not reach it is this " good
The categorical distribution situation of good distribution ", Rochcio classifier methods effects are poor.But because it calculates simple, rapid, so
Among this method Jing is commonly used for the higher application of time requirement of classifying, and become the mark compared with other sorting techniques
It is accurate.The whole realization step of this grader and evaluation procedure can be expressed as follows:
Apoplexy due to endogenous wind scheming formula is:
Wherein, niIt is LiThe number of class Chinese version, and DijIt is classification LiJ-th text vector.It is determined that the class of text
After not, system finds the text of correlation in the range of the restriction of text library and recommends user.System using the feature after dimensionality reduction to
Quantity space carries out text representation, and using TFDIF characteristic item weighting is carried out.In order to reduce index and matching primitives amount, using TFDIF
The dimensional feature item of value highest 20 represents document D:
D=[z1, TFIDF (z1), Z2, TFIDF (z2) ..., Z20, TFID) F (z20)]
Text categorization task can be regarded as filling in table { 0,1 }, and abscissa is a series of classification, and vertical coordinate is
, to there is a numeral, O represents this document and is not belonging to this classification, and 1 represents this for some documents, every document and each classification
Document belongs to this classification.Uncertain link in order to reduce experiment, in addition it is also necessary to which this categorizing process is carried out more precisely
It is bright.
Automatic Text Categorization has two kinds of typical method of testings, and one kind is one method of testing of training, and second is k point of intersection
Evaluation methodology.A method of testing is trained to be classical evaluation methodology, original training set conjunction T is divided into training set and test set by it
Two parts are closed, using training set feature selection and classifier training are carried out, grader is tested using test set.And
Original training set is closed and is divided into k parts { 1T, TZ ..., Tk } by k point of cross assessment method, carries out k test, finally takes the flat of them
Average is used as final result.
Ttria=T-Ti, Ttes=Ti, i=1,2 ..., k
K point of cross assessment method is usually used in the situation that original training set closes very little, it is therefore an objective to make full use of initial sample to enter
Row training.Most strict also most accurate cross assessment method is L00 methods, it is assumed that have m sample, uses a sample conduct every time
Test sample, all as training sample, the meansigma methodss for finally testing m time are used as final result for remaining sample.
The present invention will be described in detail for above-described embodiment.It should be noted that specific embodiment described herein
Only to explain the present invention, it is not intended to limit the present invention, the product that can realize said function belongs to equivalent and improvement,
It is included within protection scope of the present invention.
The invention has the beneficial effects as follows:Based on theory of information, the process to classifying further is refined the present invention, be specify that
The function of modules in computer version categorizing system, it is ensured that the efficiency and classification handling rate of classification, adds effect complete
Kind module, improves the accuracy that classification is processed.
Claims (1)
1. a kind of computer version categorizing system, it is characterised in that including Text Pretreatment module, Text character extraction module,
Text training managing module, classification processing module, text category flag module and effect improve module;
Text of the calculating section in computer first to the Text Pretreatment module mainly to being input into carries out pretreatment, first
Text is disconnected by the text of the input using participle software, punctuation mark, space are removed, be divided into set of words, connect
And the set of words is further processed, insignificant word is removed, set of words is simplified in formation;
The main basis of the Text character extraction module is feature selection approach, is produced from described simplifying in set of words first
Go out feature word subset, the process for producing the feature word subset is the process for not stopping to search for, the algorithm of search adopts branch
Bound search algorithm, is then evaluated the feature word subset for producing with the evaluation function based on genetic algorithm, is obtained
To evaluation of estimate, and institute's evaluation values are compared with the threshold value for stopping, if institute's evaluation values are bigger than the threshold value of the stopping
Stop search, otherwise continue to search, through evaluating to filter new feature word subset is produced, new feature word
Collection is made up of feature word, and using mutual information method the frequency that the feature word occurs is calculated, and the comprehensive feature word occurs
Frequency, obtain the mapping table between the frequency that the feature word and the feature word occur;
Mapping table between the frequency that the text training managing module occurs to the feature word and the feature word enters
Row is processed, and randomly selects other texts, calculates inverse document frequency, using the inverse document frequency after calculating as input,
By the weighted value of feature word described in training classifier calculated, so as to obtain term weighing matrix;
, according to the term weighing matrix, the classification for arranging classification using svm classifier algorithm is interval, will for the classification processing module
The word simplified in set of words is classified, and obtains word's kinds vector set, and the word of a classification belongs to described
Same vector in word's kinds vector set;
The text category flag module is used to be marked word's kinds vector set, sets up the category label of word
Table, in the category label table of the word with special value of symbol as word classification mark value, by the word classification
Mark value adds word's kinds vector set to obtain labeled word's kinds vector set;
The effect improves module and carries out error statistics to the labeled word's kinds vector set, and the process of statistics is one
Individual random sampling process, first by the labeled word's kinds in the labeled word's kinds vector set to
Amount is extracted, and is ranked up according to the distribution law of word, and it is the region that emphasis is extracted to sort in front 30% region, to what is extracted
The classifying quality of sample is tested, and is adjusted using nuisance parameter, if the frequency of adjustment is too high, illustrates the effect classified
Fruit it is not good enough, come back to the Text character extraction module, the threshold value is modified re-start feature extraction until
The frequency of the adjustment is dropped in safe scope.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610905152.5A CN106570076A (en) | 2016-10-11 | 2016-10-11 | Computer text classification system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610905152.5A CN106570076A (en) | 2016-10-11 | 2016-10-11 | Computer text classification system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106570076A true CN106570076A (en) | 2017-04-19 |
Family
ID=60414153
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610905152.5A Pending CN106570076A (en) | 2016-10-11 | 2016-10-11 | Computer text classification system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106570076A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106503153A (en) * | 2016-10-21 | 2017-03-15 | 江苏理工学院 | Computer text classification system, system and text classification method thereof |
CN107194617A (en) * | 2017-07-06 | 2017-09-22 | 北京航空航天大学 | A kind of app software engineers soft skill categorizing system and method |
CN107579816A (en) * | 2017-09-06 | 2018-01-12 | 中国科学院半导体研究所 | Password dictionary generation method based on recurrent neural network |
CN108388914A (en) * | 2018-02-26 | 2018-08-10 | 中译语通科技股份有限公司 | A kind of grader construction method, grader based on semantic computation |
CN110969006A (en) * | 2019-12-02 | 2020-04-07 | 支付宝(杭州)信息技术有限公司 | Training method and system of text sequencing model |
CN112364629A (en) * | 2020-11-27 | 2021-02-12 | 苏州大学 | Text classification system and method based on redundancy-removing mutual information feature selection |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102081667A (en) * | 2011-01-23 | 2011-06-01 | 浙江大学 | Chinese text classification method based on Base64 coding |
CN102332012A (en) * | 2011-09-13 | 2012-01-25 | 南方报业传媒集团 | Chinese text sorting method based on correlation study between sorts |
CN103116637A (en) * | 2013-02-08 | 2013-05-22 | 无锡南理工科技发展有限公司 | Text sentiment classification method facing Chinese Web comments |
CN103995876A (en) * | 2014-05-26 | 2014-08-20 | 上海大学 | Text classification method based on chi square statistics and SMO algorithm |
-
2016
- 2016-10-11 CN CN201610905152.5A patent/CN106570076A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102081667A (en) * | 2011-01-23 | 2011-06-01 | 浙江大学 | Chinese text classification method based on Base64 coding |
CN102332012A (en) * | 2011-09-13 | 2012-01-25 | 南方报业传媒集团 | Chinese text sorting method based on correlation study between sorts |
CN103116637A (en) * | 2013-02-08 | 2013-05-22 | 无锡南理工科技发展有限公司 | Text sentiment classification method facing Chinese Web comments |
CN103995876A (en) * | 2014-05-26 | 2014-08-20 | 上海大学 | Text classification method based on chi square statistics and SMO algorithm |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106503153A (en) * | 2016-10-21 | 2017-03-15 | 江苏理工学院 | Computer text classification system, system and text classification method thereof |
CN106503153B (en) * | 2016-10-21 | 2019-05-10 | 江苏理工学院 | Computer text classification system |
CN107194617A (en) * | 2017-07-06 | 2017-09-22 | 北京航空航天大学 | A kind of app software engineers soft skill categorizing system and method |
CN107579816A (en) * | 2017-09-06 | 2018-01-12 | 中国科学院半导体研究所 | Password dictionary generation method based on recurrent neural network |
CN107579816B (en) * | 2017-09-06 | 2020-05-19 | 中国科学院半导体研究所 | Method for generating password dictionary based on recurrent neural network |
CN108388914A (en) * | 2018-02-26 | 2018-08-10 | 中译语通科技股份有限公司 | A kind of grader construction method, grader based on semantic computation |
CN108388914B (en) * | 2018-02-26 | 2022-04-01 | 中译语通科技股份有限公司 | Classifier construction method based on semantic calculation and classifier |
CN110969006A (en) * | 2019-12-02 | 2020-04-07 | 支付宝(杭州)信息技术有限公司 | Training method and system of text sequencing model |
CN110969006B (en) * | 2019-12-02 | 2023-03-21 | 支付宝(杭州)信息技术有限公司 | Training method and system of text sequencing model |
CN112364629A (en) * | 2020-11-27 | 2021-02-12 | 苏州大学 | Text classification system and method based on redundancy-removing mutual information feature selection |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106570076A (en) | Computer text classification system | |
CN107291723A (en) | The method and apparatus of web page text classification, the method and apparatus of web page text identification | |
CN103632168B (en) | Classifier integration method for machine learning | |
CN107798033B (en) | Case text classification method in public security field | |
CN105138653B (en) | It is a kind of that method and its recommendation apparatus are recommended based on typical degree and the topic of difficulty | |
CN105975518B (en) | Expectation cross entropy feature selecting Text Classification System and method based on comentropy | |
CN104391835A (en) | Method and device for selecting feature words in texts | |
CN102194013A (en) | Domain-knowledge-based short text classification method and text classification system | |
CN110222744A (en) | A kind of Naive Bayes Classification Model improved method based on attribute weight | |
CN106156372B (en) | A kind of classification method and device of internet site | |
CN104834940A (en) | Medical image inspection disease classification method based on support vector machine (SVM) | |
CN109933670A (en) | A kind of file classification method calculating semantic distance based on combinatorial matrix | |
CN102622373A (en) | Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm | |
CN104239485A (en) | Statistical machine learning-based internet hidden link detection method | |
CN105760493A (en) | Automatic work order classification method for electricity marketing service hot spot 95598 | |
CN105260437A (en) | Text classification feature selection method and application thereof to biomedical text classification | |
CN107273505A (en) | Supervision cross-module state Hash search method based on nonparametric Bayes model | |
CN111680225B (en) | WeChat financial message analysis method and system based on machine learning | |
CN103617435A (en) | Image sorting method and system for active learning | |
CN114707571B (en) | Credit data anomaly detection method based on enhanced isolation forest | |
CN110516074A (en) | Website theme classification method and device based on deep learning | |
CN102129568A (en) | Method for detecting image-based spam email by utilizing improved gauss hybrid model classifier | |
CN106815605B (en) | Data classification method and equipment based on machine learning | |
CN103268346B (en) | Semisupervised classification method and system | |
CN109783633A (en) | Data analysis service procedural model recommended method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170419 |
|
WD01 | Invention patent application deemed withdrawn after publication |