CN103927302B - A kind of file classification method and system - Google Patents

A kind of file classification method and system Download PDF

Info

Publication number
CN103927302B
CN103927302B CN201310009087.4A CN201310009087A CN103927302B CN 103927302 B CN103927302 B CN 103927302B CN 201310009087 A CN201310009087 A CN 201310009087A CN 103927302 B CN103927302 B CN 103927302B
Authority
CN
China
Prior art keywords
entry
classification
correlation rule
text
storehouse
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310009087.4A
Other languages
Chinese (zh)
Other versions
CN103927302A (en
Inventor
陈俊波
李华康
曾鹏程
薛贵荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201310009087.4A priority Critical patent/CN103927302B/en
Publication of CN103927302A publication Critical patent/CN103927302A/en
Priority to HK15100449.0A priority patent/HK1200040A1/en
Application granted granted Critical
Publication of CN103927302B publication Critical patent/CN103927302B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a kind of Text Classification System and method.The method is included from the correlation rule between Resource Access entry and the entry with classification correlation rule, generation classification correlation rule storehouse;The basic classification dictionary of basic classification generation based on each field;Pretreatment test text, extracts the document feature sets of test text;The entry in the entry and classification correlation rule storehouse in basic classification dictionary is compared, weight calculation is carried out to the entry in basic classification dictionary using the correlation rule of entry, and calculate the entry weight in correlation rule storehouse;With grader according to the document feature sets for extracting and the weight of the entry of calculating, test text is classified.The technical scheme of the application avoids across the language platform technology barrier problem in traditional text classification under different language environment, and then some neologisms and old word new ideas are only modified slightly to tree-like correlation rule, new text classifier is just realized, without traditional text grader Chinese version branch uniformity problem of worrying.

Description

A kind of file classification method and system
Technical field
The application patent application is related to text-processing field, more particularly to a kind of method and system of text classification.
Background technology
Most commonly text classification treatment is carried out in text-processing.Text classification generally comprises the expression of text, classification The process such as the selection of device and training, the evaluation of result classification and feedback, the expression of its Chinese version can be subdivided into text and locate in advance again Reason, index and the step such as statistics, feature extraction.Text classification flow chart as shown in Figure 1.Pretreatment is by original language material form Same form is turned to, is easy to follow-up being uniformly processed;It is basic processing unit by document decomposition that index is main, while reducing follow-up The expense for the treatment of;The groundwork of statistics is the dependent probability for summing up word frequency, item and classification, generation correlation rule storehouse;Feature Extraction is the feature that response feature document subject matter is extracted from document;And the groundwork of grader is the spy test text Correlation rule storehouse of the vector based on generation is levied, the classification of text is carried out.After the classification for completing text, grader classified and tied Fruit is analyzed, further Optimum Classification rule, enriches training storehouse etc..
The research of current Text Classification focuses primarily on eigenvalue extracting and grader modeling aspect, existing the following is the country Patented technology of some concentrations in terms of text classification:
A kind of short text classification method and Text Classification System based on domain knowledge, for solving areas of information technology in Traditional text sorting technique can not preferably to being classified to short text shortcoming.Training data acquisition module be used for obtain into The data of row training obtain learning database;Data and processing module, information extraction is carried out to the learning database by the number of Un-structured According to being processed as structural data;Text representation module carries out mathematicization expression using vector space model to data;Feature extraction Module is ranked up according to TFIDF algorithms to the importance of lexical item collection;Model building module assigns different to each lexical item weights Weight, and according to classifying rules set in advance classify.The method and system, certain wound has been carried out to traditional grader Newly, the concept of domanial words is introduced in grader, the information content of short text is effectively increased, it is special to short text data It is that webpage commodity data does the semantic analysis based on different lexical item collection, and the result of semantic analysis is injected into grader, is Commodity data information is filled with new information, so as to improve the accuracy rate of text classification.
A kind of file classification method based on block division and position weight includes:The training being input into or test text are passed through After basic pretreatment, the segment information in text is extracted;It is considered as a basic text block by each section, block message is counted Analysis, according to block size distribution or predefined piece of ratio, content of text is re-started block division, including text block merging Deng operation.Feature Words, quantization weight are extracted, and obtains posterior probability of the Feature Words to classification, then analysis has maximum a posteriori The distribution of the Feature Words that probability classification is consistent with text categories label, ultimately produces text vector;Complete to classify using grader Model training or text classification.The method can be used for the text representation stage of Text Classification System, by enriching traditional utilization Expression during Feature Words structure text vector to content of text information, lifts text classification effect.
A kind of text classification feature selecting and weighing computation method based on domain knowledge, the method combination sample statistics with Field term constructs domain classification feature space, using field external knowledge relation, calculate the similarity between term, adjusts according to this Whole characteristic of division vector individual features dimensional weight.And SVMs learning algorithm is used, and field textual classification model is set up, it is real Existing field text classification.To yunnan tourism field with non-tour field text classification test result indicate that, the method classification is accurate Rate improves 4 percentage points than improving the text classification effect of TFIDF feature weight methods.
What a kind of two-stage combined file classification method based on probability subject was used:First-level class:Based on simple pattra leaves This sorting technique, is classified using probability subject feature and refusal condition judgment to test text;Secondary classification:Again based on tradition Feature extracting method extracts Feature Words to being classified by the test text of first order refusal classification.This hierarchical composition method pair Text is classified, and very fast many texts can correctly be divided in first-level class the characteristics of warm different classifications device Class, greatly improves Text Classification System efficiency, and being that Text Classification System is practical provides good processing mode;Consider that text is special Point proposes probability subject, and under the conditions of appropriate refusal, probability subject completes a large amount of text classifications with accuracy very high Task.Experiment proves that the combination of the application two-stage compared with the single classification of tradition, can greatly reduce time loss and can improve system System classification accuracy rate.
Traditional Text Classification is as shown in Figure 1 firstly the need of one preferable bibliography system of boundary effect of formulation, and According to the bibliography system collect it is enough with the representational text collection of classification as training sample, this step work is often The maximum work of time overhead in traditional text classification work.Enough having collected, training text 101 good enough is gathered Afterwards, be to the training text 102 after being processed of single text classification, pretreatment, such as:Including Chinese word segmentation, Generation disables vocabulary, Chinese feature selecting, text vector and the work such as represents.Ripe Chinese word cutting method has had a lot, such as CDWS, n-gram, HMM etc..Not only the frequency of occurrences is high in article for the function word of text grammer part, and Almost nonsensical to participle, i.e. classification of disturbance has the text dimensionality for being too high, influences classification effectiveness.If initial data is Some web datas, in addition it is also necessary to weed out the construct noises such as plug-in unit, header, the footer of webpage.The Auto of stop words It is also immature, it is main at present to be realized by way of import existing general stop words and artificial mark project special procures deactivation, Need regular hour expense and bring certain artificial unstability to system.On the one hand substantial amounts of text feature can increase point The space complexity and time complexity of class algorithm, on the other hand may include substantial amounts of noise data, final influence classification Precision.The text feature value of current main flow chooses mode TFIDF, information gain, mutual information, x statistics cross entropies etc..To treatment Training text 102 afterwards carries out feature selecting and obtains feature dictionary 103.With the increase of text size and amount of text, text The computing cost also linear growth trend that eigen value is chosen.After the selected training text vector 104 such as feature based, tradition text This sorting technique generates correlation rule storehouse 105 by Mining Frequent Itemsets Based, then the mode such as rule pruning generates grader 106.Survey Examination text 107 passes through the selected test text vector such as similar pretreatment, the test text 108 after being processed, feature based After 109, classification 200 is obtained using the classification of grader 106.
And, in existing sorting technique, its crucial rule base is limited in one's ability for the dynamic regulation of neologisms and stop words, With the rapid popularization continued to develop with internet of computer technology, increasing people begin to use the internet to carry out letter Breath is obtained.The resource of the magnanimity of its network and the textual resources for continuing to bring out out constantly challenge existing correlation rule storehouse Autgmentability and adaptability.The moderately well-off Master's thesis of Soviet Union of Central China Normal University《Based on wikipedia build semantic knowledge-base and its The application study in text classification field》Based on magnanimity real text present on internet(Such as wikipedia), it is proposed that one Plant the construction method of taxonomy database.The method uses semantic label to refer to, and semantic fingerprint portrays the formalization knowledge of semanteme Method for expressing, therefrom proposes the corpus of certain scale, the annexation between wikipedia webpage is excavated, automatic structure Build semantic knowledge-base.But the emphasis of this prior art is to provide a kind of semantic knowledge-base, be not given based on such knowledge The corresponding Text Classification of the correlation rule in storehouse.
The content of the invention
For the defect of existing Text Classification, the technical problem that the technical scheme of the application to be solved there is provided base Correlation rule storehouse is automatically generated in resource and the method and system of text classification is implemented in combination with basic classification dictionary, e.g., be based on Entry correlation rule and basic classification dictionary, by analyzing the weave connection result of resource, generate text keyword correlation rule Taxonomic hierarchies, constructs Naive Bayes Classifier, and text classification is carried out to test text.
A kind of Text Classification System of the application, including:Correlation rule storehouse generation module, by from classification correlation rule Resource Access entry and the entry between correlation rule, to generate classification correlation rule storehouse;Basic classification dictionary life Into module, the basic classification dictionary of basic classification generation based on each field;Text Pretreatment module, for test text Pre-processed, to extract text feature entry;Rule pruning module, compares entry in the basic classification dictionary and described Entry in classification correlation rule storehouse, using the correlation rule of the entry in the classification correlation rule storehouse, to the foundation class Entry in mesh dictionary carries out weight calculation, and calculates the entry weight in the classification correlation rule storehouse;Classifier modules, base In the weight and the text feature entry of extraction of the entry, the test text is classified.
A kind of file classification method of the system of correspondence the application, including:From the Resource Access with classification correlation rule Correlation rule between entry and the entry, to generate classification correlation rule storehouse;Based on the basic classification in each field, The basic classification dictionary of generation;Pretreatment test text, extracts the document feature sets of test text;In comparing the basic classification dictionary Entry and the classification correlation rule storehouse in entry, using the correlation rule of the entry in the classification correlation rule storehouse, Weight calculation is carried out to the entry in the basic classification dictionary, and calculates the entry weight in the correlation rule storehouse;Use Grader, according to the document feature sets and the weight of the entry of calculating that extract, classifies to the test text.
The technical scheme of the application, text classification is carried out based on classification correlation rule and basic classification dictionary, it is to avoid Across language platform technology barrier problem under different language environment in traditional text classification.At the same time, to some neologisms and Old word new ideas, as long as to all kinds(Tree-shaped, netted, chain etc.)Correlation rule be modified slightly, it is possible to realize new Text classifier, without traditional text grader Chinese version branch uniformity problem of worrying.
Brief description of the drawings
In order to illustrate more clearly of the technical scheme of the embodiment of the present application, below will be to be used needed for embodiment description Accompanying drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the present application, for this For the those of ordinary skill of field, on the premise of not paying creative work, can also obtain other according to these accompanying drawings Accompanying drawing.
Fig. 1 is traditional text classification schematic diagram.
Fig. 2 is the Text Classification System figure of the application specific embodiment.
Fig. 3 is the tree-like correlation rule example of the application specific embodiment.
Fig. 4 is the complicated correlation rule example of the application specific embodiment.
Fig. 5 is the single-link of the root node without mark of the application specific embodiment.
Fig. 6 is the multilink of some root nodes without mark of the application specific embodiment.
Fig. 7 is the loop chain Pruning strategy of the application specific embodiment.
Specific embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present application, the technical scheme in the embodiment of the present application is carried out clear, complete Site preparation is described, it is clear that described embodiment is only some embodiments of the present application, rather than whole embodiments.It is based on Embodiment in the application, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of the application protection.
The application is based on classification correlation rule and basic classification dictionary and carries out the technology of text classification, in following specific reality Apply in mode, by the structure basis using wikipedia as classification correlation rule storehouse as an example, but be not limited to this.Wiki hundred Section is a multilingual encyclopaedical cooperation plan based on wiki technologies, is also a networking encyclopaedia write as with different language Pandect, its target and objective are for the whole mankind provides free encyclopedia -- formed to write with their selected language 's.By in November, 2011, the registered user and plurality of non-registered users for having had more than 31,720,000 contribute to 282 Entry of the language more than 20,240,000 is planted, its editor's number of times is alreadyd exceed 1,200,000,000 3,192 ten thousand times.Due to wikipedia have it is multi-lingual Speech can be answered completely with reference to bibliography system, the textual association rule classification system set up using the tree structure correlation rule In using different language systems.The similar various encyclopaedia databases of others, such as, and Baidu's database, Chinese encyclopaedia net etc., With the classified index entry of digital library, the structure basis in correlation rule storehouse is can serve as.For other topological structures Correlation rule, such as, and network structure, chain structure etc.(It is any or at least one)Correlation rule or its combination, the side of the application Method and system are equally applicable.The technical scheme of the application is introduced below in conjunction with accompanying drawing.
Fig. 2 is the Text Classification System figure of the application.The system includes correlation rule storehouse generation module 201, basic classification Dictionary generation module 202, Text Pretreatment module 203, rule pruning module 204, the part of grader more than 205 composition.This implementation Example is the tree-shaped text classification correlation rule based on wikipedia, but is not limited to this.
Correlation rule storehouse generation module 201, by from the Resource Access entry with classification correlation rule and these entries Between rule, to generate classification correlation rule storehouse, stored in classification correlation rule storehouse these entries and these entries it Between correlation rule.
Such as, from wikipedia(Wikipedia)Deng with certain classification correlation rule Internet resources in utilize data Digging technology or crawler technology extract the rule between entry and entry, and generation classification correlation rule storehouse such as utilizes reptile work Have the classified index entry of the wikipedia that crawls, be saved in a database.Its need according to certain web page analysis algorithm with Filtering is unrelated with theme to be linked, and the link for remaining with simultaneously puts it into the URL queues for waiting crawl, and then according to certain Search strategy selects the next step webpage URL to be captured from queue, and repeats crawl, the action of selection, until reaching stopping Condition.So that crawlers carry out webpage capture as an example, the webpage of crawler capturing will be stored by system, so be analyzed, mistake Filter, and index is set up, inquiry and retrieval so as to after.Its conventional search strategy is depth-first and breadth First.Reptile Program obtains the classified index entry of wikipedia, such as:Communication->Mobile phone-> frequencies, in three entries and three entries Between incidence relation, in such example be man-to-man father and son's correlation rule, communication(Father), mobile phone(Son), mobile phone(Father)、 Frequency(Son), it is also possible to the father of one-to-many correlation rule, i.e., many height.And by the pass between these entries and entry Connection rule(Classification correlation rule)It is saved in a database, is formed for example:Page classifications->Society->Military->Military science- >Wire type is fought, such form, and generation classification correlation rule storehouse is correlation rule storehouse.Reference link is such as:
http://zh.wikipedia.org/wiki/Wikipedia:%E5%88%86%E9%A1%9E%E7%B4%A2% E5%BC%95
Basic classification dictionary generation module 202, the basic classification dictionary of generation, the basic classification in its each field of generation is used for Classification is inquired about, to determine text classification field.This module 202 generates base based on the basic classification in existing/existing each field Plinth classification dictionary, can be by downloading existing/existing basic classification dictionary, such as QQ input methods classified lexicon is saved in this Ground file, in case using.Reference link is such as:http://dict.py.qq.com/
Based on above-mentioned classification correlation rule and basic classification dictionary, the correlation rule of such as tree structure can be utilized to set up text This correlation rule, helps to be applied to different language systems, can be across language platform.Classification correlation rule storehouse and basic classification Dictionary has carried out follow-up beta pruning treatment and the data of text classification prepare.
Text Pretreatment module 203 is used to process test text, extracts text feature entry.It has text special Vectorial abstraction function is levied, test text is pre-processed, a text to be measured is processed by simple Chinese word segmentation, reject one The function words such as a little auxiliary word, adverbial words, obtain the lists of keywords of the text, and calculate for example, by TFIDF and length keywords The fraction of keyword as classification grader 205 input.
Rule pruning module 204, for comparing entry and the classification correlation rule storehouse in the basic classification dictionary In entry, using the entry correlation rule in the classification correlation rule storehouse, the entry in the basic classification dictionary is entered Row weight calculation, and calculate the weight of entry in the correlation rule storehouse.Rule pruning module 204 before classification, & apos, to above-mentioned Classification correlation rule and basic classification dictionary are processed, using such as tree structure correlation rule set up entry correlation rule and The comparison of basic classification dictionary, analysis, calculating, modification, simply shifting mode, it is ensured that across the text point of language platform Class, and the text of the traditional text grader that need not worry is evenly distributed.
According to rule pruning algorithm, based on correlation rule storehouse generation module 201 and basic classification dictionary generation module 202 (The two data for preparing), entry is analyzed in the correlation rule in correlation rule storehouse, the entry in basic classification dictionary is weighed Weight analysis are calculated, and give grader 205 weight information, in case classification is used.The operation principle of rule pruning module 204 will It is described below.
Classifier modules 205, using the weight of the entry for calculating, as the priori conditions of Naive Bayes Classifier Probability, i.e. this entry belong to the conditional probability of a certain classification, and test text is classified.Finally completing text classification is needed The classification wanted.Using Naive Bayes Classifier to text classification, other kinds of text classifier is passed through the present embodiment Cross suitably modified, it is also possible to be applied in the application.
The operation principle to rule pruning module 204 is described below.Rule pruning module 204 receives to come to association The entry correlation rule of regular library module 201 and the basic classification dictionary for coming from generation in basic classification dictionary generation module Entry, the weight to the entry in dictionary is calculated.Its beta pruning treatment is included:1)Basic classification dictionary weight is carried out to estimate Meter, 2)Initialize the classification correlation rule storehouse of Wiki tree(Incidence relation is initialized)、3)Using iterative algorithm to association In rule base all node weights calculated, 4)Special joint is processed, 5)Annular relation in correlation rule is carried out Shear treatment.
1)Basic classification dictionary weight is estimated
Assuming that there is N number of inhomogeneity purpose dictionary CD={ Dict_1, Dict_2 ... Dict_N }, in each Dict file, Containing M words/phrases, Dict_i=Word_i1, Word_i2 ... and, Word_iM }, calculate word/word in each dictionary The weight w_weight of group:W_weight=1/DF, DF are Dictionary Frequency, and current word is in different dictionaries The frequency (number of times) of appearance.The basic classification dictionary weight table of generation.Such as, phrase " chip frequency " appears in { D_ computers } class Occur in { D_ mobile phones } class, therefore, the DF value of this phrase is 2, and ({ D_ is calculated the conditional probability P that it belongs to { D_ computers } class Machine }/" chip frequency ")=1/2.And entry and its corresponding weight are generated into weight table.
2)Initialize the correlation rule storehouse of Wiki tree structure
Entry in inquiry classification correlation rule storehouse, for example, the word in the tree-like text classification dictionary of inquiry wikipedia Bar, if there is current entry in the weight table of basic classification dictionary, current entry is assigned to by the weight in weight table, no Then the category information of the entry is zero(Not mark node), and present node is denoted as " mark node ", and store each node Information in classification correlation rule storehouse, all classifications and its corresponding weight of the nodal information including associated system, Its form is such as:{ classification 1:Weight 1, classification 2:Weight 2...... }.For example:
Entry " weaving city street " in wikipedia occurs not in this dictionary, then " weaving city street " node is not appointed Manage where;
When the entry " turbocharging " of the node in wikipedia is only present in dictionary { D_ machineries } class, to " turbine Supercharging " node assigns W_ { turbocharging }={ D_ machineries:1};
It is right when during the entry " chip frequency " in wikipedia node appears in { D_ computers } class and { D_ mobile phones } class " chip frequency " node assigns W_ { chip frequency }={ D- computers:0.5, mobile phone:0.5}.
3)Weight calculation is carried out to all nodes in classification correlation rule storehouse using iterative algorithm
After initialization mark node, for the node without mark, it is necessary to consider following several correlation rules:Such as a pair First, the correlation rule between one-to-many entry.
Entry in inquiry classification correlation rule storehouse, if in the weight table of basic classification dictionary and in the absence of current word Bar, after its category information is set for 0, the current entry of storage is not mark node in the classification correlation rule storehouse.And foundation Current entry and those be present in entry rule relation between the entry in basic classification dictionary, it is such as one-to-one, one-to-many(Ginseng See a)、b)、c))To process, to calculate the weight of current entry.
A) when the nodes X not marked retrieves certain mark node A from bottom to top, and there is the list of " 1-1 " with node A (such as Fig. 3 during chain relation(a)), nodes X is 1 with the depth proportion of node A, then nodes X is with the correlation rule of A:X=A.Section Point w_weight_X=w_weight_A
B) (such as Fig. 3 when nodes X and last layer node { A, B ... } have many chain relations of " 1-n "(b)), this n father Node depth is identical and is mark node, then the node to the depth weights of all father nodes be 1/n, then nodes X and node A Between correlation rule be:
The Mining class association rules of nodes X are the set (first assuming A here, orthogonal between B) of all father nodes:
(Such as:Fig. 3(b)The weight of X is 1/3 { A, B, C })
C) when nodes X and upper layer node presence " 1-n " correlation rule (such as Fig. 3(c)), but this n father node depth not Deng, respectively depth (X, Y), then the depth weights between nodes X and node A are
R represents the number with the related father nodes of x, such as Fig. 3(c)In, R=2 is because have A and B two marks node and X It is relevant.Then the Mining class association rules of nodes X are:
Wherein
(Note:In the above-mentioned formula for enumerating, represent needs plus the related father node of institute if " ... " has been added.)
As shown in Fig. 3 (c), the depth of node A to nodes X is 4 for the depth of 3, B to nodes X, so dX, A=1/ 3, dX, B=1/4.Thus, it is possible to calculate the relation of nodes X and node A, B.
D) classified weight of the entry for there are multiple father nodes in correlation rule is calculated by successive ignition.
This is a kind of complicated correlation rule.For calculative node(Such as the node of the dotted border in Fig. 4), can To use iterative algorithm, calculate node X, it is necessary to step by step calculation goes out its father node and the node letter being associated with its father node Breath, and then determine the correlation rule with mark node A and B.
4)Special joint is processed, for example, there is unlabelled " root " node in being initialized for correlation rule The treatment of link:
Because root node cannot know the correlation rule with others entry present in basic classification dictionary(Relation), Need to determine its weights and classification and correlation rule using following mode.
A) in the case of single link, as shown in figure 5, the classification value of root node is Null, i.e., classification, power cannot be determined Weight is 0, the root node mark { classification in such as Fig. 5:Null, weight:0}.Child node on its corresponding link has same genus Property and weight, i.e. child node are labeled as { classification:Null, weight:0}.
B) when there is multilink, and there are some unlabelled situations of link root node, as shown in fig. 6, current Nodal community calculates weights according to identical method in 3 (c).
As in Fig. 6, root node A's is labeled as { classification:Null, weight:0 }, root node B is labeled as { classification:Classification 1, Weight:0.7 }, then child node X is labeled as { empty class mesh:(1/3)/(1/3+1/2) * 0, classification 1:(1/2)/(1/3+1/2)* 0.7}。
5)Shear treatment is carried out to annular relation in correlation rule
There are problems that ring in complicated correlation rule, as shown in fig. 7, being a kind of correlation rule of loop chain.Calculate X node classes When mesh and weight, the X such as on the left of appearance<-C←---<-A<The loop chain problem of-X, then using last A of cut-out<- X relations Mode is calculated, as shown by a dashed line in fig 7.Alternatively, it is also possible to using mark manually or the preferential classification and weight for calculating A Mode solve the problem.
The method corresponding with Text Classification System described above, including crawlers are used, associated from classification The resource of rule(Such as:At least one of encyclopaedic knowledge storehouse and digital library system)Extract between entry and the entry Correlation rule structure(Such as:At least one in the rule of tree, chain structure and network structure), generation classification association Rule base;
The basic classification dictionary of basic classification generation based on each field;Pretreatment test text, extracts test text Document feature sets;
The entry in the entry and the classification correlation rule storehouse in the basic classification dictionary is compared, using the classification The correlation rule of the entry in correlation rule storehouse, weight calculation is carried out to the entry in the basic classification dictionary(Such as:Based on this The frequency occurred in each classification of the entry in the basic classification dictionary is carried out), and calculate the classification correlation rule storehouse Entry weight, implementation rule prune(Beta pruning):During comparison, if the entry in the classification correlation rule storehouse be present in it is described In basic classification dictionary, then the weight of the entry in the basic classification dictionary is entered to the entry in the correlation rule storehouse Row weight assignment;During comparison, if the entry in the classification correlation rule storehouse is not present in the basic classification dictionary, According to the entry in the classification correlation rule storehouse and the classification correlation rule storehouse being present in the basic classification dictionary In the entry correlation rule of other entries carry out weight calculation.The entry correlation rule includes one-one relationship between entry Or many-one relationship etc..Phase in the weight calculation consideration classification correlation rule storehouse in the correlation rule storehouse between each node To depth.The weight calculation of the entry in the classification correlation rule storehouse can be used and carried out by iterative algorithm.
Using grader, according to the weight and the text feature entry of extraction of the entry, to the test text Classified;The grader can use Naive Bayes Classifier, using the weight of the entry as the grader Priori conditions probability, classifies to the test text.System and method based on the application, by the tissue for analyzing resource Association results generation correlation rule storehouse, and text keyword Classification of Association Rules system is generated, with reference to basic classification dictionary, construction Naive Bayes Classifier, text classification is carried out to test text.It can be seen that, it does not exist artificially not without traditional artificial expense Stability, the expense and time and space for reducing sorting algorithm in the absence of a large amount of text features is complicated, and noise data is improved point less Class precision.Also, contrasted by above-mentioned system and dictionary, grader is generated by rule trimming, to neologisms and old word new ideas, Correlation rule need to be only modified slightly, it is possible to text classifier is updated, without the traditional text grader Chinese version branch that worries Uniformity problem.
According to above-mentioned Text Classification System, present invention also provides the file classification method of correspondence Text Classification System, should The specific implementation step of method correspond to the specific implementation of the modules of Text Classification System, such as foregoing correlation rule storehouse generation Module 201, basic classification dictionary generation module 202, Text Pretreatment module 203, rule pruning module 204, grader 205 etc. Processing procedure.The method implementation steps are as follows:
Correlation rule between Resource Access entry and the entry with classification correlation rule, to generate classification Correlation rule storehouse.Wherein, resource can be at least one of encyclopaedic knowledge storehouse and digital library system, the pass between entry Join the structure of rule, include at least one of tree, chain structure and network structure, and classification correlation rule storehouse can be with Extracted by crawlers and generated.Specifically process and realize the description of correlation rule storehouse generation module 201 in system shown in Figure 2 Concrete processing procedure.
Based on the basic classification in each field, the basic classification dictionary of generation.Specifically process and realize system shown in Figure 2 The concrete processing procedure of basic classification dictionary generation module 202 description in system.
Pretreatment test text, extracts the document feature sets of test text.In specifically processing and realizing system shown in Figure 2 The concrete processing procedure of the description of Text Pretreatment module 203.
The entry in the entry and the classification correlation rule storehouse in the basic classification storehouse is compared, is closed using the classification The correlation rule of the entry in connection rule base(For example:Entry correlation rule can be one-one relationship or one-to-many between entry Relation), weight calculation is carried out to the entry in the basic classification dictionary(For example:Entry weight calculation can be based on the entry The frequency occurred in each classification in the basic classification dictionary is carried out), and calculate the power of the entry in the correlation rule storehouse Weight(For example:The relative depth that entry weight calculation can contemplate in the classification correlation rule storehouse between each node is carried out, and is calculated and is calculated Method can be carried out using iterative algorithm).Compare with weight calculation assignment for example:By the entry in the classification correlation rule storehouse with When entry in the basic classification dictionary is compared, if the entry in the classification correlation rule storehouse is present in the foundation class In mesh dictionary, then the weight of the entry in the basic classification dictionary carries out weight to the entry in the correlation rule storehouse Assignment;And if the entry in the classification correlation rule storehouse is not present in the basic classification dictionary, then according to described Its in the entry in classification correlation rule storehouse and the classification correlation rule storehouse being present in the basic classification dictionary He carries out weight calculation by the entry correlation rule of entry.Specifically process and realize rule pruning module in system shown in Figure 2 The concrete processing procedure of 204 descriptions.
Using grader, according to the document feature sets and the weight of the entry of calculating that extract, the test text is entered Row classification.Grader can be Naive Bayes Classifier, using entry weight as the grader priori conditions probability, it is right The test text is classified.The specific of the description of grader 205 treats in specifically processing and realizing system shown in Figure 2 Journey.
Each embodiment in this specification is typically described by the way of progressive, and what each embodiment was stressed is With the difference of other embodiment, between each embodiment identical similar part mutually referring to.
The application can be described in the general context of computer executable instructions, such as program Module or unit.Usually, program module or unit can include performing particular task or realize particular abstract data type Routine, program, object, component, data structure etc..In general, program module or unit can be by softwares, hardware or both Combination realize.The application can also be in a distributed computing environment put into practice, in these DCEs, by passing through Communication network and connected remote processing devices perform task.In a distributed computing environment, program module or unit can With in the local and remote computer-readable storage medium including including storage device.
It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or computer program Product.Therefore, the application can be using the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Apply the form of example.And, the application can be used and wherein include the computer of computer usable program code at one or more Usable storage medium(Including but not limited to magnetic disk storage, CD-ROM, optical memory etc.)The computer program of upper implementation is produced The form of product.
Finally, in addition it is also necessary to explanation, term " including ", "comprising" or its any other variant be intended to it is non-exclusive Property include so that process, method, commodity or equipment including a series of key elements not only include those key elements, and Also include other key elements being not expressly set out, or also include intrinsic for this process, method, commodity or equipment Key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that including described Also there is other identical element in the process of key element, method, commodity or equipment.
Specific case used herein is set forth to the principle and implementation method of the application, and above example is said It is bright to be only intended to help and understand the present processes and its main thought;Simultaneously for those of ordinary skill in the art, foundation The thought of the application, will change in specific embodiments and applications, and in sum, this specification content is not It is interpreted as the limitation to the application.

Claims (14)

1. a kind of Text Classification System, it is characterised in that including:
Correlation rule storehouse generation module (201), by from Resource Access entry and the entry with classification correlation rule Between correlation rule, to generate classification correlation rule storehouse;
Basic classification dictionary generation module (202), the basic classification dictionary of basic classification generation based on each field;
Text Pretreatment module (203), for being pre-processed to test text, to extract text feature entry;
Rule pruning module (204), compares the word in the entry and the classification correlation rule storehouse in the basic classification dictionary Bar, using the correlation rule of the entry in the classification correlation rule storehouse, weighs to the entry in the basic classification dictionary Re-computation, and calculate the entry weight in the classification correlation rule storehouse;
Classifier modules (205), in weight, the classification correlation rule storehouse based on the entry in the basic classification dictionary The weight of entry and the text feature entry of extraction, classify to the test text.
2. the system as claimed in claim 1, it is characterised in that
The resource includes:At least one of encyclopaedic knowledge storehouse and digital library system;
The weight calculation of the entry in the basic classification dictionary is all kinds of in the basic classification dictionary based on the entry The frequency occurred in mesh;
Correlation rule structure between the entry includes:At least one of tree, chain structure and network structure.
3. the system as claimed in claim 1, it is characterised in that further matched somebody with somebody in correlation rule storehouse generation module (201) It is set to:The classification correlation rule storehouse is generated using crawlers.
4. the system as claimed in claim 1, it is characterised in that the rule pruning module (204) is further configured to:Will Entry in the classification correlation rule storehouse is compared with the entry in the basic classification dictionary, if the classification correlation rule Entry in storehouse is present in the basic classification dictionary, then the weight of the entry in the basic classification dictionary is to described Entry in correlation rule storehouse carries out weight assignment.
5. system as claimed in claim 4, it is characterised in that the rule pruning module (204) is further configured to:Such as Entry in really described classification correlation rule storehouse is not present in the basic classification dictionary, then associate rule according in the classification The word of other entries in the entry then in storehouse and the classification correlation rule storehouse being present in the basic classification dictionary Bar correlation rule carries out weight calculation.
6. system as claimed in claim 5, it is characterised in that
The entry correlation rule is included between entry:One-one relationship or many-one relationship;
Depth proportion or depth in the weight calculation consideration classification correlation rule storehouse in the correlation rule storehouse between each node Degree weights;
The weight calculation of the entry in the classification correlation rule storehouse is carried out by iterative algorithm.
7. the system as claimed in claim 1, it is characterised in that the classifier modules (205) are Naive Bayes Classifier, The weight of the entry in weight, the classification correlation rule storehouse of the entry in the basic classification dictionary is used as the grader Priori conditions probability, the test text is classified.
8. a kind of file classification method, it is characterised in that including:
Correlation rule between Resource Access entry and the entry with classification correlation rule, to generate classification association Rule base;
Based on the basic classification in each field, the basic classification dictionary of generation;
Pretreatment test text, extracts the document feature sets of test text;
The entry in the entry and the classification correlation rule storehouse in the basic classification dictionary is compared, is associated using the classification The correlation rule of the entry in rule base, carries out weight calculation, and calculate the pass to the entry in the basic classification dictionary Entry weight in connection rule base;
Use grader, the entry in weight, the classification correlation rule storehouse based on the entry in the basic classification dictionary Weight and extraction the text feature entry, the test text is classified.
9. method as claimed in claim 8, it is characterised in that
The resource includes at least one of encyclopaedic knowledge storehouse and digital library system;
The weight calculation of the entry in the basic classification dictionary is all kinds of in the basic classification dictionary based on the entry The frequency occurred in mesh;
Correlation rule structure between the entry includes:At least one of tree, chain structure and network structure.
10. method as claimed in claim 8, it is characterised in that the classification correlation rule storehouse is to be extracted to give birth to by crawlers Into.
11. methods as claimed in claim 8, it is characterised in that by the entry in the classification correlation rule storehouse and the base Entry in plinth classification dictionary is compared, and the entry in classification correlation rule storehouse is present in the basic classification dictionary as described, Then the weight of the entry in the basic classification dictionary carries out weight assignment to the entry in the correlation rule storehouse.
12. methods as claimed in claim 11, it is characterised in that if the entry in the classification correlation rule storehouse does not exist In the basic classification dictionary, then according in the classification correlation rule storehouse the entry and be present in the foundation class The entry correlation rule of other entries in the classification correlation rule storehouse in mesh dictionary carries out weight calculation.
13. methods as claimed in claim 12, it is characterised in that
The entry correlation rule includes one-one relationship or many-one relationship between entry;
Depth proportion or depth in the weight calculation consideration classification correlation rule storehouse in the correlation rule storehouse between each node Degree weights;
The weight calculation of the entry in the classification correlation rule storehouse is carried out by iterative algorithm.
14. methods as claimed in claim 8, it is characterised in that the grader is Naive Bayes Classifier, by the base The weight of the entry in the weight of the entry in plinth classification dictionary, the classification correlation rule storehouse as the grader priori Conditional probability, classifies to the test text.
CN201310009087.4A 2013-01-10 2013-01-10 A kind of file classification method and system Active CN103927302B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201310009087.4A CN103927302B (en) 2013-01-10 2013-01-10 A kind of file classification method and system
HK15100449.0A HK1200040A1 (en) 2013-01-10 2015-01-15 Method and system for text categorization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310009087.4A CN103927302B (en) 2013-01-10 2013-01-10 A kind of file classification method and system

Publications (2)

Publication Number Publication Date
CN103927302A CN103927302A (en) 2014-07-16
CN103927302B true CN103927302B (en) 2017-05-31

Family

ID=51145525

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310009087.4A Active CN103927302B (en) 2013-01-10 2013-01-10 A kind of file classification method and system

Country Status (2)

Country Link
CN (1) CN103927302B (en)
HK (1) HK1200040A1 (en)

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104199959A (en) * 2014-09-18 2014-12-10 浪潮软件集团有限公司 Text classification method for Internet tax-related data
CN105528356B (en) * 2014-09-29 2019-01-18 阿里巴巴集团控股有限公司 Structured tag generation method, application method and device
CN104462347B (en) * 2014-12-04 2018-05-18 北京国双科技有限公司 The sorting technique and device of keyword
CN104679728B (en) * 2015-02-06 2018-08-31 中国农业大学 A kind of text similarity detection method
CN105512270B (en) * 2015-12-04 2020-02-21 上海优扬新媒信息技术有限公司 Method and device for determining related objects
CN106570109B (en) * 2016-11-01 2020-07-24 深圳市点通数据有限公司 Method for automatically generating question bank knowledge points through text analysis
CN108090040B (en) * 2016-11-23 2021-08-17 北京国双科技有限公司 Text information classification method and system
CN107357895B (en) * 2017-01-05 2020-05-19 大连理工大学 Text representation processing method based on bag-of-words model
CN108804408A (en) * 2017-04-27 2018-11-13 安徽富驰信息技术有限公司 Information extraction system based on domain-specialist knowledge system and information extraction method
CN107506472B (en) * 2017-09-05 2020-09-08 淮阴工学院 Method for classifying browsed webpages of students
CN108280164B (en) * 2018-01-18 2021-10-01 武汉大学 Short text filtering and classifying method based on category related words
CN108520030B (en) * 2018-03-27 2022-02-11 深圳中兴网信科技有限公司 Text classification method, text classification system and computer device
CN108509638B (en) * 2018-04-11 2023-06-27 联想(北京)有限公司 Question extraction method and electronic equipment
CN108549723B (en) * 2018-04-28 2022-04-05 北京神州泰岳软件股份有限公司 Text concept classification method and device and server
CN109145529B (en) * 2018-09-12 2021-12-03 重庆工业职业技术学院 Text similarity analysis method and system for copyright authentication
CN109460730B (en) * 2018-11-03 2022-06-17 上海犀语科技有限公司 Analysis method and device for line and page changing of table
CN111694948B (en) * 2019-03-12 2024-05-17 北京京东尚科信息技术有限公司 Text classification method and system, electronic equipment and storage medium
CN110427626B (en) * 2019-07-31 2022-12-09 北京明略软件***有限公司 Keyword extraction method and device
CN110674635B (en) * 2019-09-27 2023-04-25 北京妙笔智能科技有限公司 Method and device for dividing text paragraphs
CN113673210B (en) * 2020-05-13 2023-12-01 复旦大学 document generation system
CN111737719B (en) * 2020-07-17 2020-11-24 支付宝(杭州)信息技术有限公司 Privacy-protecting text classification method and device
CN112256986A (en) * 2020-10-19 2021-01-22 中国互联网金融协会 Method and device for monitoring virtual currency website, electronic equipment and storage medium
CN112527953B (en) * 2020-11-20 2023-06-20 出门问问创新科技有限公司 Rule matching method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102194013A (en) * 2011-06-23 2011-09-21 上海毕佳数据有限公司 Domain-knowledge-based short text classification method and text classification system
EP2369505A1 (en) * 2010-03-26 2011-09-28 British Telecommunications public limited company Text classifier system
US8145636B1 (en) * 2009-03-13 2012-03-27 Google Inc. Classifying text into hierarchical categories
CN102622373A (en) * 2011-01-31 2012-08-01 中国科学院声学研究所 Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8145636B1 (en) * 2009-03-13 2012-03-27 Google Inc. Classifying text into hierarchical categories
EP2369505A1 (en) * 2010-03-26 2011-09-28 British Telecommunications public limited company Text classifier system
CN102622373A (en) * 2011-01-31 2012-08-01 中国科学院声学研究所 Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm
CN102194013A (en) * 2011-06-23 2011-09-21 上海毕佳数据有限公司 Domain-knowledge-based short text classification method and text classification system

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Classifying text documents by associating terms with text categories;Osmar R. Zaiane等;《Australian Computer Science Communications》;20020228;第24卷(第2期);第215-222页 *
Text Documents Classification by Associating Terms with Text Categories;V. Srividhya等;《Applications of Soft Computing,AISC 58》;20091231;第58卷;第223-231页 *
基于关联规则的中文文本自动分类算法研究;杨柯;《中国优秀硕士学位论文全文数据库 信息科技辑》;20071215;第2007年卷(第6期);I138-43 *
基于关联规则的文本分类研究;赵耀;《中国优秀硕士学位论文全文数据库 信息科技辑》;20101215;第2010年卷(第12期);I138-397 *

Also Published As

Publication number Publication date
HK1200040A1 (en) 2015-07-31
CN103927302A (en) 2014-07-16

Similar Documents

Publication Publication Date Title
CN103927302B (en) A kind of file classification method and system
CN106250412B (en) Knowledge mapping construction method based on the fusion of multi-source entity
CN104834735B (en) A kind of documentation summary extraction method based on term vector
CN108763213A (en) Theme feature text key word extracting method
CN107102989A (en) A kind of entity disambiguation method based on term vector, convolutional neural networks
CN106951438A (en) A kind of event extraction system and method towards open field
CN103049532A (en) Method for creating knowledge base engine on basis of sudden event emergency management and method for inquiring knowledge base engine
CN103970729A (en) Multi-subject extracting method based on semantic categories
CN104298715B (en) A kind of more indexed results ordering by merging methods based on TF IDF
Nasution New method for extracting keyword for the social actor
CN106844632A (en) Based on the product review sensibility classification method and device that improve SVMs
CN105654144B (en) A kind of social network ontologies construction method based on machine learning
CN103970730A (en) Method for extracting multiple subject terms from single Chinese text
CN106294344A (en) Video retrieval method and device
CN109376352A (en) A kind of patent text modeling method based on word2vec and semantic similarity
CN101404033A (en) Automatic generation method and system for noumenon hierarchical structure
KR20190135129A (en) Apparatus and Method for Documents Classification Using Documents Organization and Deep Learning
CN106547864A (en) A kind of Personalized search based on query expansion
CN109241278A (en) Scientific research knowledge management method and system
CN104391852B (en) A kind of method and apparatus for establishing keyword dictionary
Monisha et al. Classification of bengali questions towards a factoid question answering system
CN106708926A (en) Realization method for analysis model supporting massive long text data classification
CN106649262A (en) Protection method for enterprise hardware facility sensitive information in social media
CN102541913B (en) VSM classifier trainings, the identification of the OSSP pages and the OSS Resource Access methods of web oriented
Kaothanthong et al. Headline2Vec: A CNN-based feature for Thai clickbait headlines classification

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 1200040

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant
REG Reference to a national code

Ref country code: HK

Ref legal event code: GR

Ref document number: 1200040

Country of ref document: HK