CN103927302A

CN103927302A - Text classification method and system

Info

Publication number: CN103927302A
Application number: CN201310009087.4A
Authority: CN
Inventors: 陈俊波; 李华康; 曾鹏程; 薛贵荣
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2013-01-10
Filing date: 2013-01-10
Publication date: 2014-07-16
Anticipated expiration: 2033-01-10
Also published as: HK1200040A1; CN103927302B

Abstract

The application provides a text classification system and a text classification method. The method comprises the steps of: extracting entries and association rules among the entries from a resource with category association rules, and generating a category association rule base; generating as basic category lexicon based on basic categories in various fields; pre-treating a testing text, and extracting a feature entry from the testing text; comparing the entries in the basic category lexicon with the entries in the category association rule base, computing weight of the entries in the basic category lexicon by using the association rules of the entries, and computing weight of the entries in the association rule base; classifying the testing text by a classifier according to the extracted feature entries and the computed entry weight. According to the text classification system and method, the problem of cross-language platform technology barriers in traditional text classification under different language environments is avoided, and thus a novel text classifier can be achieved by only slightly modifying tree form association rules for some new words and new concepts of old words, without considering the text uniformity in the traditional text classifier.

Description

A kind of file classification method and system

Technical field

The application's patented claim relates to text-processing field, relates in particular to a kind of method and system of text classification.

Background technology

Modal in text-processing is to carry out text classification processing.Text classification generally comprises the expression of text, the process such as selection and the evaluation and feedback that training, result are classified of sorter, and the expression of its Chinese version can be subdivided into again the steps such as text pre-service, index and statistics, feature extraction.Text classification process flow diagram as shown in Figure 1.Pre-service is that original language material is formatted as to same form, is convenient to follow-up unified processing; Index is mainly basic processing unit by document decomposition, reduces the expense of subsequent treatment simultaneously; The groundwork of statistics is to sum up word frequency, item and the dependent probability of classifying, and generates correlation rule storehouse; Feature extraction is the feature that extracts response feature document subject matter from document; And the groundwork of sorter is the correlation rule storehouse based on generating the proper vector of test text, carry out the classification of text.After the classification that completes text, analyze sorter classification results, further Optimum Classification rule, enriches training storehouse etc.

The research of Text Classification at present is mainly placed on eigenvalue extracting and sorter modeling aspect, is below the domestic existing patented technology of concentrating about text classification aspect:

Short text sorting technique and a Text Classification System based on domain knowledge, can not be preferably to the shortcoming that short text is classified for solving areas of information technology traditional text sorting technique.Training data acquisition module is used for obtaining the data of training and obtains learning database; Data and processing module, it is structural data by the data processing of Un-structured that described learning database is carried out to information extraction; Text representation module employing vector space model carries out mathematicization to data and represents; Feature extraction module sorts to the importance of lexical item collection according to TFIDF algorithm; Model building module is given different weights to each lexical item weights, and according to predefined classifying rules classification.The method and system, traditional sorter has been carried out to certain innovation, in sorter, introduce the concept of domanial words, effectively increase the quantity of information of short text, to short text data particularly webpage commodity data do the semantic analysis based on different lexical item collection, and the result of semantic analysis is injected in sorter, for commodity data information has been injected new information, thereby improve the accuracy rate of text classification.

A kind of file classification method based on piece division and position weight comprises: the training to input or test text, after basic pre-service, extract the segment information in text; Each section is considered as to a basic text block, block message is done to statistical study, distribute or predefined ratio according to block size, content of text is re-started to piece and divide, comprise the operation such as merging of text block.Extract Feature Words, quantization weight, and obtain the posterior probability of Feature Words to classification, then analyze the distribution with the Feature Words that maximum a posteriori probability classification conforms to text categories label, finally generate text vector; Utilize sorter to complete disaggregated model training or text classification.The method can be used for the text representation stage of Text Classification System, by enriching traditional expression to content of text information while utilizing Feature Words to build text vector, promotes text classification effect.

A kind of text classification feature selecting and weighing computation method based on domain knowledge, the method is in conjunction with sample statistics and field term structure domain classification feature space, utilize field external knowledge relation, calculate the similarity between term, adjust according to this characteristic of division vector individual features dimensional weight.And adopt support vector machine learning algorithm, and set up field textual classification model, realize field text classification.Yunnan tourism field and non-tour field text classification experimental result are shown, the method classification accuracy has improved 4 percentage points than the text classification effect of improving TFIDF feature weight method.

Two-stage combined file classification method based on probability subject adopts: one-level classification: based on a Naive Bayes Classification method, utilize probability subject feature and refusal condition judgment to classify to test text; Secondary classification: extract Feature Words to being classified by the test text of first order refusal classification based on traditional characteristic extracting method again.This hierarchical composition method is classified to text, the feature of warm different sorters can be very fast in one-level classification, a lot of texts are correctly classified, greatly improve Text Classification System efficiency, good processing mode is provided for Text Classification System is practical; Consider that text feature proposes probability subject, under suitable refusal condition, probability subject completes a large amount of text categorization tasks with very high accuracy.Experimental results show that the application's two-stage combines compared with the single classification of tradition, can greatly reduce time loss and can improve genealogical classification accuracy.

First traditional Text Classification needs to formulate a good classification system of boundary effect as shown in Figure 1, and collect and enough there is the representational text collection of classification as training sample according to this classification system, this step work work of time overhead maximum in traditional text classification work often.Collected abundant, enough after good training text 101 set, to obtain training text 102 after treatment to the pre-service of single text classification, pre-service, for example: comprise the work such as Chinese word segmentation, the inactive vocabulary of generation, Chinese feature selecting, text vector represent.Ripe Chinese word cutting method has had a lot, as CDWS, n-gram, Hidden Markov Model (HMM) etc.The function word of text grammer ingredient not only in article the frequency of occurrences high, and almost nonsensical to participle, i.e. classification of disturbance, has the text dimensionality that is too high, affects classification effectiveness.If raw data is some web datas, also need to weed out the construct noises such as the plug-in unit, header, footer of webpage.The Auto of stop words is also immature, at present mainly special procures inactive mode and realizes by importing existing general stop words and artificial mark project, needs regular hour expense and brings certain artificial instability to system.A large amount of text features can increase space complexity and the time complexity of sorting algorithm on the one hand, may comprise on the other hand a large amount of noise datas, finally affects nicety of grading.The text feature value of main flow is chosen mode at present TFIDF, information gain, mutual information, x statistics cross entropy etc.Training text 102 after treatment is carried out to feature selecting and obtain feature dictionary 103.Along with the increase of text size and amount of text, the computing cost that text feature value is chosen is also linear growth trend.After the selected training text vectors 104 such as feature, traditional text sorting technique generates correlation rule storehouse 105 by Mining Frequent Itemsets Based, then the mode such as rule pruning generates sorter 106.Test text 107, through similar pre-service, obtains test text 108 after treatment, after the selected test text vectors 109 such as feature, utilizes sorter 106 classification to obtain classification 200.

And in existing sorting technique, its crucial rule base is limited in one's ability for the dynamic adjustments of neologisms and stop words, along with the development of computer technology and popularizing rapidly of internet, increasing people bring into use internet to carry out acquisition of information.The resource of the magnanimity of its network and the text resource continuing to bring out out are constantly being challenged extendability and the adaptability in existing correlation rule storehouse.Central China Normal University revive moderately well-off Master's thesis " based on wikipedia build semantic knowledge-base and in the applied research in text classification field " the magnanimity real text (such as wikipedia) that exists on based on internet, a kind of construction method of taxonomy database has been proposed.The method adopts semantic label for referring to, and semantic fingerprint is portrayed semantic formalization knowledge representation method, therefrom proposes the corpus of certain scale, and the annexation between wikipedia webpage is excavated, and automatically builds semantic knowledge-base.But the emphasis of this prior art is to provide a kind of semantic knowledge-base, do not provide the corresponding Text Classification of correlation rule based on this type of knowledge base.

Summary of the invention

For the defect of existing Text Classification, the technical matters that the application's technical scheme will solve has been to provide automatically to generate correlation rule storehouse and be combined with basic classification dictionary based on resource and has realized the method and system of text classification, as, based on entry correlation rule and basic classification dictionary, by analyzing the weave connection result of resource, generate text keyword Classification of Association Rules system, structure Naive Bayes Classifier, carries out text classification to test text.

A kind of Text Classification System of the application, comprising: correlation rule storehouse generation module, by the correlation rule between Resource Access entry and described entry from thering is classification correlation rule, to generate classification correlation rule storehouse; Basis classification dictionary generation module, based on the basic classification formation base classification dictionary in existing each field; Text pretreatment module, for test text is carried out to pre-service, to extract text feature entry; Rule pruning module, compare the entry in entry and the described classification correlation rule storehouse in described basic classification dictionary, utilize the correlation rule of the entry in described classification correlation rule storehouse, entry in described basic classification dictionary is carried out to weight calculation, and calculate the entry weight in described classification correlation rule storehouse; Classifier modules, the described text feature entry of the weight based on described entry and extraction, classifies to described test text.

A kind of file classification method of corresponding the application's system, comprising: from having the correlation rule between Resource Access entry and the described entry of classification correlation rule, to generate classification correlation rule storehouse; Based on the basic classification in existing each field, formation base classification dictionary; Pre-service test text, the feature entry of extraction test text; Compare the entry in entry and the described classification correlation rule storehouse in described basic classification dictionary, utilize the correlation rule of the entry in described classification correlation rule storehouse, entry in described basic classification dictionary is carried out to weight calculation, and calculate the entry weight in described correlation rule storehouse; Use sorter, according to the weight of the described feature entry of extraction and the entry of calculating, described test text is classified.

The application's technical scheme, based on the text classification of carrying out of classification correlation rule and basic classification dictionary, avoided in the classification of traditional text under different language environment across language platform technology barrier problem.Meanwhile, to some neologisms and old word new ideas, as long as the correlation rule of all kinds (tree type, netted, chain etc.) is slightly made an amendment, just can realize new text classifier, without the even problem of traditional text sorter Chinese version branch of worrying.

Brief description of the drawings

In order to be illustrated more clearly in the technical scheme of the embodiment of the present application, below the accompanying drawing of required use during embodiment is described is briefly described, apparently, accompanying drawing in the following describes is only some embodiment of the application, for those of ordinary skill in the art, do not paying under the prerequisite of creative work, can also obtain according to these accompanying drawings other accompanying drawing.

Fig. 1 is traditional text classification schematic diagram.

Fig. 2 is the Text Classification System figure of the application's embodiment.

Fig. 3 is the tree-like correlation rule example of the application's embodiment.

Fig. 4 is the complicated correlation rule example of the application's embodiment.

Fig. 5 is that the root node of the application's embodiment is without the single-link of mark.

Fig. 6 is that some root node of the application's embodiment is without the multilink of mark.

Fig. 7 is the loop chain Pruning strategy of the application's embodiment.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present application, the technical scheme in the embodiment of the present application is clearly and completely described, obviously, described embodiment is only some embodiments of the present application, instead of whole embodiment.Based on the embodiment in the application, those of ordinary skill in the art are not making the every other embodiment obtaining under creative work prerequisite, all belong to the scope of the application's protection.

The technology that the application carries out text classification based on classification correlation rule and basic classification dictionary, in embodiment below, by the basis of formation using wikipedia as classification correlation rule storehouse as an example, but is not limited to this.Wikipedia is a multilingual encyclopaedical cooperation plan based on wiki technology, also be a networking encyclopedia of being write as with different language, its target and aim provide freely encyclopedia for the whole mankind--and write and form with their selected language.By in November, 2011, have to exceed 3,172 ten thousand registered user and numerous non-registered users and contributed 282 kinds of language to exceed the entry of 2,024 ten thousand sections, its editor's number of times has exceeded 1,200,000,000 3,192 ten thousand times.Because wikipedia has multilingually with reference to classification system, the text Classification of Association Rules system that adopts this tree structure correlation rule to set up, can be applied in different language systems completely.Other similar various encyclopaedia databases, such as, Baidu's database, Chinese encyclopaedia net etc., and the category index entry of digital library, can serve as the basis of formation in correlation rule storehouse.For the correlation rule of other topological structures, such as, reticulate texture, chain structure etc. (arbitrarily or at least one) correlation rule or its combination, the application's method and system is applicable equally.Below in connection with accompanying drawing, the application's technical scheme is introduced.

Fig. 2 is the application's Text Classification System figure.This system comprises correlation rule storehouse generation module 201, basic classification dictionary generation module 202, text pretreatment module 203, rule pruning module 204, sorter more than 205 part compositions.The present embodiment is the tree-shaped text classification correlation rule based on wikipedia, but is not limited to this.

Correlation rule storehouse generation module 201 by the rule between Resource Access entry and these entries from having classification correlation rule, to generate classification correlation rule storehouse, is stored the correlation rule between these entries and these entries in classification correlation rule storehouse.

Such as, from having the Internet resources of certain classification correlation rule, wikipedia (Wikipedia) etc. utilize data mining technology or crawler technology to extract the rule between entry and entry, generate classification correlation rule storehouse, as utilize the crawl category index entry of wikipedia of reptile instrument, be saved in a database.It need to be according to certain web page analysis algorithm to filter and irrelevant the linking of theme, remain with the link of use and put it into and wait for the URL queue capturing, and then the webpage URL that selects next step to capture from queue according to certain search strategy, and repeat the action that captures, selects, until reach stop condition.Carry out webpage taking reptile program and capture as example, the webpage of crawler capturing will be stored by system, and then analyzes, filters, and sets up index, so that retrieval and indexing afterwards.Its conventional search strategy is depth-first and breadth First.Reptile program obtains the category index entry of wikipedia, such as: communication-> mobile phone-> frequency, in three entries and the incidence relation between three entries, so in example, be man-to-man father and son's correlation rule, communication (father), mobile phone (son), mobile phone (father), frequency (son), the also correlation rule of possibility one-to-many, an i.e. father, Duo Gezi.And the correlation rule between these entries and entry (classification correlation rule) is saved in a database, for example form: page classifications-> society-> military affairs-> military science-> line formula is fought, such form, generating classification correlation rule storehouse is correlation rule storehouse.Reference links as:

http://zh.wikipedia.org/wiki/Wikipedia:%E5%88%86%E9%A1%9E%E7%B 4%A2%E5%BC%95。

Basis classification dictionary generation module 202, formation base classification dictionary, it generates the basic classification in each field, for classification inquiry, to determine text classification field.The basic classification of this module 202 based on existing/existing each field becomes basic classification dictionary next life, can, by downloading existing/existing basic classification dictionary, such as QQ input method classified lexicon, be saved in local file, in order to using.Reference links as: http:// dict.py.qq.com/

Based on above-mentioned classification correlation rule and basic classification dictionary, can utilize such as the correlation rule of tree structure and set up text correlation rule, contribute to be applied to different language systems, can be across language platform.Classification correlation rule storehouse and basic classification dictionary have been carried out the data of follow-up beta pruning processing and text classification and have been prepared.

Text pretreatment module 203, for test text is processed, is extracted text feature entry.It has Text eigenvector abstraction function, test text is carried out to pre-service, a text to be measured is by simple Chinese word segmentation processing, reject the function words such as some auxiliary words, adverbial word, obtain the lists of keywords of the text, and calculate the mark of keyword as the input of classification sorter 205 by for example TFIDF and length keywords.

Rule pruning module 204, for comparing the entry in entry and the described classification correlation rule storehouse of described basic classification dictionary, utilize the entry correlation rule in described classification correlation rule storehouse, entry in described basic classification dictionary is carried out to weight calculation, and calculate the weight of the entry in described correlation rule storehouse.Rule pruning module 204 is before classification, above-mentioned classification correlation rule and basic classification dictionary are processed, utilize comparison, analysis, calculating, the amendment of setting up entry correlation rule and basic classification dictionary such as tree structure correlation rule, shifting mode simply, ensure the text classification across language platform, and be evenly distributed without the text of misgivings traditional text sorter.

According to rule pruning algorithm, based on the data of correlation rule storehouse generation module 201 and the two preparation of basic classification dictionary generation module 202(), analyze the correlation rule of entry in correlation rule storehouse, entry in basic classification dictionary is carried out to weight analysis calculating, and give sorter 205 weight information, use in order to classification.The principle of work of rule pruning module 204 will be described herein-after.

Classifier modules 205, the weight of the entry that utilization is calculated, as the priori conditions probability of Naive Bayes Classifier, this entry belongs to the conditional probability of a certain classification, and test text is classified.Finally complete text classification and obtain the classification needing.The present embodiment adopts Naive Bayes Classifier to text classification, and the text classifier of other types, through suitably amendment, also can be applied in the application.

Below the principle of work of rule pruning module 204 is described.Rule pruning module 204 is accepted, from the entry correlation rule to correlation rule library module 201 and the entry that comes from the basic classification dictionary generating in basic classification dictionary generation module, the weight of the entry in dictionary to be calculated.Its beta pruning processing has comprised: 1) carry out basic classification dictionary weight estimate, 2) the classification correlation rule storehouse (incidence relation initialization), 3 of initialization dimension base tree) adopt iterative algorithm to all node weights in correlation rule storehouse calculate, 4) to special joint process, 5) ring-type relation in correlation rule is carried out to shear treatment.

1) basic classification dictionary weight is estimated

Suppose to exist N inhomogeneity object dictionary CD={Dict_1, Dict_2 ... Dict_N}, in each Dict file, contain M words/phrases, Dict_i={Word_i1, Word_i2 ...., Word_iM}, the weight w_weight:w_weight=1/DF that calculates words/phrases in each dictionary, DF is Dictionary Frequency, the frequency (number of times) that current word occurs in different dictionaries.Formation base classification dictionary weight table.Such as, phrase " chip frequency " appears in { D_ computing machine } class and { D_ mobile phone } class and occurs, and therefore, the DF value of this phrase is 2, it belongs to conditional probability P ({ D_ computing machine }/" chip frequency of { D_ computing machine } class ")=1/2.And by entry and corresponding weight weight generation table thereof.

2) the correlation rule storehouse of initialization dimension base tree structure

Entry in inquiry classification correlation rule storehouse, for example, entry in the tree-like text classification dictionary of inquiry wikipedia, if there is current entry in the weight table of basic classification dictionary, the weight in weight table is assigned to current entry, otherwise the classification information of this entry is zero (for not marking node), and present node note is done to " mark node ", and the information of storing each node is in classification correlation rule storehouse, this nodal information comprise with it related all classifications with and corresponding weight, its form is such as { classification 1: weight 1, classification 2: weight 2......}.For example:

Entry in wikipedia " weaving city street " do not occur in this dictionary, and " weaving city street " node is left intact;

In the time that the entry " turbocharging " of the node in wikipedia only appears in dictionary { D_ machinery } class, give W_{ turbocharging to " turbocharging " node }={ D_ machinery: 1};

In the time that the entry in wikipedia node " chip frequency " appears in { D_ computing machine } class and { D_ mobile phone } class, give W_{ chip frequency to " chip frequency " node }=D-computing machine: 0.5, mobile phone: 0.5}.

3) adopt iterative algorithm to carry out weight calculation to all nodes in classification correlation rule storehouse

After initialization mark node, for the node that there is no mark, need several correlation rules below considering: as the correlation rule one to one, between the entry of one-to-many.

The entry of inquiry in classification correlation rule storehouse, if there is not current entry in the weight table of basic classification dictionary, after its classification information is set is 0, stores current entry for not marking node in this classification correlation rule storehouse.And be present in the entry rule relation between the entry in basic classification dictionary according to current entry and those, as one to one, one-to-many (referring to a), b), c)) processes, to calculate the weight of current entry.

A) when the nodes X not marking retrieves certain mark node A from bottom to top, and while existing the strand of " 1-1 " to be related to node A (as Fig. 3 (a)), the degree of depth proportion of nodes X and node A is 1, nodes X with correlation rule A is: X=A.Node w_weight_X=w_weight_A

B) as nodes X and last layer node { A, B, ... while existing the multichain of " 1-n " to be related to (as Fig. 3 (b)), this n father node degree of depth is identical and be mark node, this node is 1/n to the degree of depth weights of all father nodes, and the correlation rule between nodes X and node A is:

w_weight_X = \frac{1}{n} w_weight_A

The Mining class association rules of nodes X is the set (first supposing A here, orthogonal between B) of all father nodes:

X = \frac{1}{n} {A, B, . . .}

(as: weight of Fig. 3 (b) X is 1/3{A, B, C})

C) when nodes X and upper layer node exist " 1-n " correlation rule (as Fig. 3 (c)), but the degree of depth of this n father node is not etc., is respectively depth (X, Y), and the degree of depth weights between nodes X and node A are

w_weight_X = \frac{1}{depth (X, A)} / Σ_{Y = 1}^{R} \frac{1}{depth (X, Y)}

R represents the number with the related father node of x, and in Fig. 3 (c), R=2 is relevant with X with two mark nodes of B because there is A.The Mining class association rules of nodes X is:

X = {\frac{d_{X, A}}{d_{X}} * A, \frac{d_{X, B}}{d_{X}} * B, . . .}

Wherein

d_{X, A} = \frac{1}{depth (X, A)}

d_{X} = Σ_{Y = 1}^{R} \frac{1}{depth (X, Y)} = d_{X, A} + d_{X, B} + . . .

(note: in the above-mentioned formula exemplifying, represent to add related father node if added this " ... ".）

As shown in Fig. 3 (c), node A is that 3, B is 4 to the degree of depth of nodes X to the degree of depth of nodes X, so d _{x, A}=1/3, d _{x, B}=1/4.Thus, can calculate nodes X and node A, the relation of B.

D) by there being the classification weight of the entry of multiple father nodes in iterative computation correlation rule repeatedly.

This is a kind of correlation rule of complexity.For calculative node (as the node of the dotted border in Fig. 4), can adopt iterative algorithm, want computing node X, the nodal information that need to progressively calculate its father node and be associated with its father node, and then correlation rule definite and mark node A and B.

4) special joint is processed, for example, the processing of link for there is unlabelled " root " node in correlation rule initialization:

Because root node cannot be known and other the correlation rule (relation) of the entry existing in basic classification dictionary, need to adopt mode below to determine its weights and classification and correlation rule.

A) in the situation of single link, as shown in Figure 5, the classification value of root node is Null, cannot determine that classification, weight are 0, as root node mark { classification: null, weight: 0} in Fig. 5.Child node on its corresponding link has same alike result and weight, child node be labeled as { classification: null, weight: 0}.

B) when there being multilink, and occur the unlabelled situation of some link root node, as shown in Figure 6, current nodal community calculates weights according to method identical in 3 (c).

In Fig. 6, root node A is labeled as { classification: null, weight: 0}, root node B is labeled as { classification: classification 1, weight: 0.7}, being labeled as of child node X empty class order: (1/3)/(1/3+1/2) * 0, classification 1:(1/2)/(1/3+1/2) * 0.7}.

5) ring-type relation in correlation rule is carried out to shear treatment

In complicated correlation rule, having ring problem, as shown in Figure 7, is a kind of correlation rule of loop chain.While calculating X node classification and weight, as occur left side X<-C ←---the loop chain problem of <-A<-X, adopt the mode of cutting off last A<-X relation to calculate, as shown in the dotted line in Fig. 7.In addition, also can adopt manual mark or preferentially calculate the classification of A and the mode of weight solves this problem.

The method corresponding with above-described Text Classification System, comprise the reptile program that uses, extract the correlation rule structure (as: at least one in tree, chain structure and cancellated rule) between entry and described entry from the resource (as: at least one encyclopaedic knowledge storehouse and digital library system) with classification correlation rule, generate classification correlation rule storehouse;

Based on the basic classification formation base classification dictionary in existing each field; Pre-service test text, the feature entry of extraction test text;

Compare the entry in entry and the described classification correlation rule storehouse in described basic classification dictionary, utilize the correlation rule of the entry in described classification correlation rule storehouse, entry in described basic classification dictionary is carried out to weight calculation (as: frequency occurring in the each classification based on this entry in described basic classification dictionary carries out), and calculate the entry weight in described classification correlation rule storehouse, implementation rule is pruned (beta pruning): when comparison, if the entry in described classification correlation rule storehouse is present in described basic classification dictionary, according to the weight of the entry in described basic classification dictionary, the entry in described correlation rule storehouse is carried out to weight assignment, when comparison, if the entry in described classification correlation rule storehouse is not present in described basic classification dictionary, according to this entry in described classification correlation rule storehouse, carry out weight calculation with the entry correlation rule that is present in other entries in the classification correlation rule storehouse in described basic classification dictionary.Described entry correlation rule comprises one-one relationship or many-one relationship etc. between entry.Weight calculation in described correlation rule storehouse is considered each internodal relative depth in described classification correlation rule storehouse.The weight calculation of the entry in described classification correlation rule storehouse can adopt by iterative algorithm is undertaken.

Use sorter, according to the weight of described entry and the described text feature entry of extraction, described test text is classified; Described sorter can use Naive Bayes Classifier, and the priori conditions probability using the weight of described entry as described sorter, classifies to described test text.Based on the application's system and method, generate correlation rule storehouse by the weave connection result of analyzing resource, and generate text keyword Classification of Association Rules system, in conjunction with basic classification dictionary, structure Naive Bayes Classifier, carries out text classification to test text.Visible, there is not artificial instability without traditional artificial expense in it, do not exist a large amount of text features to reduce expense and the time and space complexity of sorting algorithm, and noise data improves nicety of grading less.And, contrasted by above-mentioned system and dictionary, generate sorter by rule trimming, to neologisms and old word new ideas, only need slightly make an amendment to correlation rule, just can upgrade text classifier, without the even problem of misgivings traditional text sorter Chinese version branch.

According to above-mentioned Text Classification System, the application also provides the file classification method of corresponding Text Classification System, the concrete implementation step correspondence of the method the concrete enforcement of the modules of Text Classification System, as aforementioned correlation rule storehouse generation module 201, basis classification dictionary generation module 202, text pretreatment module 203, rule pruning module 204, the processing procedure of sorter 205 grades.The method implementation step is as follows:

From thering is the correlation rule between Resource Access entry and the described entry of classification correlation rule, to generate classification correlation rule storehouse.Wherein, resource can be at least one in encyclopaedic knowledge storehouse and digital library system, the structure of the correlation rule between entry, comprised at least one in tree, chain structure and reticulate texture, and classification correlation rule storehouse can be generated by reptile Program extraction.Specifically process and realize the concrete processing procedure that in system shown in Figure 2, correlation rule storehouse generation module 201 is described.

Based on the basic classification in existing each field, formation base classification dictionary.Specifically process and realize the concrete processing procedure that in system shown in Figure 2, basic classification dictionary generation module 202 is described.

Pre-service test text, the feature entry of extraction test text.Specifically process and realize the concrete processing procedure that system Chinese version pretreatment module 203 shown in Figure 2 is described.

Compare the entry in entry and the described classification correlation rule storehouse in described basic classification storehouse, utilize the correlation rule (for example: entry correlation rule can be one-one relationship or many-one relationship between entry) of the entry in described classification correlation rule storehouse, entry in described basic classification dictionary is carried out to weight calculation (for example: the frequency that entry weight calculation can occur in the each classification in described basic classification dictionary based on this entry carries out), and (for example: entry weight calculation can consider that in described classification correlation rule storehouse, each internodal relative depth is carried out calculate entry weight in described correlation rule storehouse, computational algorithm can adopt iterative algorithm to carry out).Comparison and weight calculation assignment for example: when the entry in the entry in described classification correlation rule storehouse and described basic classification dictionary is compared, if the entry in described classification correlation rule storehouse is present in described basic classification dictionary, according to the weight of the entry in described basic classification dictionary, the entry in described correlation rule storehouse is carried out to weight assignment; And if entry in described classification correlation rule storehouse is not present in described basic classification dictionary, according to this entry in described classification correlation rule storehouse, carry out weight calculation with the entry correlation rule that is present in other entries in the described classification correlation rule storehouse in described basic classification dictionary.Specifically process and realize the concrete processing procedure that in system shown in Figure 2, rule pruning module 204 is described.

Use sorter, according to the weight of the described feature entry of extraction and the entry of calculating, this test text is classified.Sorter can be Naive Bayes Classifier, and the priori conditions probability using entry weight as described sorter, classifies to described test text.Specifically process and realize the concrete processing procedure that in system shown in Figure 2, sorter 205 is described.

Each embodiment in this instructions is general, and the mode of going forward one by one that adopts is described, and what each embodiment stressed is and the difference of other embodiment, between each embodiment identical similar part mutually referring to.

The application can describe in the general context of computer executable instructions, for example program module or unit.Usually, program module or unit can comprise and carry out particular task or realize routine, program, object, assembly, data structure of particular abstract data type etc.In general, program module or unit can be realized by software, hardware or both combinations.Also can in distributed computing environment, put into practice the application, in these distributed computing environment, be executed the task by the teleprocessing equipment being connected by communication network.In distributed computing environment, program module or unit can be arranged in the local and remote computer-readable storage medium including memory device.

Those skilled in the art should understand, the application's embodiment can be provided as method, system or computer program.Therefore, the application can adopt complete hardware implementation example, completely implement software example or the form in conjunction with the embodiment of software and hardware aspect.And the application can adopt the form at one or more upper computer programs of implementing of computer-usable storage medium (including but not limited to magnetic disk memory, CD-ROM, optical memory etc.) that wherein include computer usable program code.

Finally, also it should be noted that, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thereby the process, method, commodity or the equipment that make to comprise a series of key elements not only comprise those key elements, but also comprise other key elements of clearly not listing, or be also included as the intrinsic key element of this process, method, commodity or equipment.The in the situation that of more restrictions not, the key element being limited by statement " comprising ... ", and be not precluded within process, method, commodity or the equipment that comprises described key element and also have other identical element.

Applied principle and the embodiment of specific case to the application herein and set forth, the explanation of above embodiment is just for helping to understand the application's method and main thought thereof; , for one of ordinary skill in the art, according to the application's thought, all will change in specific embodiments and applications, in sum, this description should not be construed as the restriction to the application meanwhile.

Claims

1. a Text Classification System, is characterized in that, comprising:

Correlation rule storehouse generation module (201), by the correlation rule between Resource Access entry and described entry from having classification correlation rule, to generate classification correlation rule storehouse;

Basis classification dictionary generation module (202), based on the basic classification formation base classification dictionary in existing each field;

Text pretreatment module (203), for test text is carried out to pre-service, to extract text feature entry;

Rule pruning module (204), compare the entry in entry and the described classification correlation rule storehouse in described basic classification dictionary, utilize the correlation rule of the entry in described classification correlation rule storehouse, entry in described basic classification dictionary is carried out to weight calculation, and calculate the entry weight in described classification correlation rule storehouse;

Classifier modules (205), the described text feature entry of the weight based on described entry and extraction, classifies to described test text.

2. the system as claimed in claim 1, is characterized in that,

Described resource comprises: at least one in encyclopaedic knowledge storehouse and digital library system;

The weight calculation of the entry in described basic classification dictionary is the frequency occurring in the each classification in described basic classification dictionary based on this entry;

Correlation rule structure between described entry comprises: in tree, chain structure and reticulate texture at least one.

3. the system as claimed in claim 1, is characterized in that, described correlation rule storehouse generation module (201) is further configured to: use reptile program to generate described classification correlation rule storehouse.

4. the system as claimed in claim 1, it is characterized in that, described rule pruning module (204) is further configured to: the entry in the entry in described classification correlation rule storehouse and described basic classification dictionary is compared, if the entry in described classification correlation rule storehouse is present in described basic classification dictionary, according to the weight of the entry in described basic classification dictionary, the entry in described correlation rule storehouse is carried out to weight assignment.

5. system as claimed in claim 4, it is characterized in that, described rule pruning module (204) is further configured to: if the entry in described classification correlation rule storehouse is not present in described basic classification dictionary, according to this entry in described classification correlation rule storehouse, carry out weight calculation with the entry correlation rule that is present in other entries in the described classification correlation rule storehouse in described basic classification dictionary.

6. system as claimed in claim 5, is characterized in that,

Described entry correlation rule comprises between entry: one-one relationship or many-one relationship;

Weight calculation in described correlation rule storehouse is considered each internodal relative depth in described classification correlation rule storehouse;

The weight calculation of the entry in described classification correlation rule storehouse is undertaken by iterative algorithm.

7. the system as claimed in claim 1, is characterized in that, described classifier modules (205) is Naive Bayes Classifier, and the weight of described entry, as the priori conditions probability of described sorter, is classified to described test text.

8. a file classification method, is characterized in that, comprising:

From thering is the correlation rule between Resource Access entry and the described entry of classification correlation rule, to generate classification correlation rule storehouse;

Based on the basic classification in existing each field, formation base classification dictionary;

Pre-service test text, the feature entry of extraction test text;

Compare the entry in entry and the described classification correlation rule storehouse in described basic classification dictionary, utilize the correlation rule of the entry in described classification correlation rule storehouse, entry in described basic classification dictionary is carried out to weight calculation, and calculate the entry weight in described correlation rule storehouse;

Use sorter, according to the weight of the described feature entry of extraction and the entry of calculating, described test text is classified.

9. method as claimed in claim 8, is characterized in that,

Described resource comprises at least one in encyclopaedic knowledge storehouse and digital library system;

10. method as claimed in claim 8, is characterized in that, described classification correlation rule storehouse is to be generated by reptile Program extraction.

11. methods as claimed in claim 8, it is characterized in that, entry in entry in described classification correlation rule storehouse and described basic classification dictionary is compared, as as described in entry in classification correlation rule storehouse be present in as described in basic classification dictionary, according to the weight of the entry in described basic classification dictionary, the entry in described correlation rule storehouse is carried out to weight assignment.

12. methods as claimed in claim 11, it is characterized in that, if the entry in described classification correlation rule storehouse is not present in described basic classification dictionary, according to this entry in described classification correlation rule storehouse, carry out weight calculation with the entry correlation rule that is present in other entries in the described classification correlation rule storehouse in described basic classification dictionary.

13. methods as claimed in claim 12, is characterized in that,

Described entry correlation rule comprises one-one relationship or many-one relationship between entry;

14. methods as claimed in claim 8, is characterized in that, described sorter is Naive Bayes Classifier, and the priori conditions probability using described entry weight as described sorter, classifies to described test text.