CN103927302B - A kind of file classification method and system - Google Patents
A kind of file classification method and system Download PDFInfo
- Publication number
- CN103927302B CN103927302B CN201310009087.4A CN201310009087A CN103927302B CN 103927302 B CN103927302 B CN 103927302B CN 201310009087 A CN201310009087 A CN 201310009087A CN 103927302 B CN103927302 B CN 103927302B
- Authority
- CN
- China
- Prior art keywords
- entry
- classification
- correlation rule
- text
- storehouse
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application provides a kind of Text Classification System and method.The method is included from the correlation rule between Resource Access entry and the entry with classification correlation rule, generation classification correlation rule storehouse;The basic classification dictionary of basic classification generation based on each field;Pretreatment test text, extracts the document feature sets of test text;The entry in the entry and classification correlation rule storehouse in basic classification dictionary is compared, weight calculation is carried out to the entry in basic classification dictionary using the correlation rule of entry, and calculate the entry weight in correlation rule storehouse;With grader according to the document feature sets for extracting and the weight of the entry of calculating, test text is classified.The technical scheme of the application avoids across the language platform technology barrier problem in traditional text classification under different language environment, and then some neologisms and old word new ideas are only modified slightly to tree-like correlation rule, new text classifier is just realized, without traditional text grader Chinese version branch uniformity problem of worrying.
Description
Technical field
The application patent application is related to text-processing field, more particularly to a kind of method and system of text classification.
Background technology
Most commonly text classification treatment is carried out in text-processing.Text classification generally comprises the expression of text, classification
The process such as the selection of device and training, the evaluation of result classification and feedback, the expression of its Chinese version can be subdivided into text and locate in advance again
Reason, index and the step such as statistics, feature extraction.Text classification flow chart as shown in Figure 1.Pretreatment is by original language material form
Same form is turned to, is easy to follow-up being uniformly processed;It is basic processing unit by document decomposition that index is main, while reducing follow-up
The expense for the treatment of;The groundwork of statistics is the dependent probability for summing up word frequency, item and classification, generation correlation rule storehouse;Feature
Extraction is the feature that response feature document subject matter is extracted from document;And the groundwork of grader is the spy test text
Correlation rule storehouse of the vector based on generation is levied, the classification of text is carried out.After the classification for completing text, grader classified and tied
Fruit is analyzed, further Optimum Classification rule, enriches training storehouse etc..
The research of current Text Classification focuses primarily on eigenvalue extracting and grader modeling aspect, existing the following is the country
Patented technology of some concentrations in terms of text classification:
A kind of short text classification method and Text Classification System based on domain knowledge, for solving areas of information technology in
Traditional text sorting technique can not preferably to being classified to short text shortcoming.Training data acquisition module be used for obtain into
The data of row training obtain learning database;Data and processing module, information extraction is carried out to the learning database by the number of Un-structured
According to being processed as structural data;Text representation module carries out mathematicization expression using vector space model to data;Feature extraction
Module is ranked up according to TFIDF algorithms to the importance of lexical item collection;Model building module assigns different to each lexical item weights
Weight, and according to classifying rules set in advance classify.The method and system, certain wound has been carried out to traditional grader
Newly, the concept of domanial words is introduced in grader, the information content of short text is effectively increased, it is special to short text data
It is that webpage commodity data does the semantic analysis based on different lexical item collection, and the result of semantic analysis is injected into grader, is
Commodity data information is filled with new information, so as to improve the accuracy rate of text classification.
A kind of file classification method based on block division and position weight includes:The training being input into or test text are passed through
After basic pretreatment, the segment information in text is extracted;It is considered as a basic text block by each section, block message is counted
Analysis, according to block size distribution or predefined piece of ratio, content of text is re-started block division, including text block merging
Deng operation.Feature Words, quantization weight are extracted, and obtains posterior probability of the Feature Words to classification, then analysis has maximum a posteriori
The distribution of the Feature Words that probability classification is consistent with text categories label, ultimately produces text vector;Complete to classify using grader
Model training or text classification.The method can be used for the text representation stage of Text Classification System, by enriching traditional utilization
Expression during Feature Words structure text vector to content of text information, lifts text classification effect.
A kind of text classification feature selecting and weighing computation method based on domain knowledge, the method combination sample statistics with
Field term constructs domain classification feature space, using field external knowledge relation, calculate the similarity between term, adjusts according to this
Whole characteristic of division vector individual features dimensional weight.And SVMs learning algorithm is used, and field textual classification model is set up, it is real
Existing field text classification.To yunnan tourism field with non-tour field text classification test result indicate that, the method classification is accurate
Rate improves 4 percentage points than improving the text classification effect of TFIDF feature weight methods.
What a kind of two-stage combined file classification method based on probability subject was used:First-level class:Based on simple pattra leaves
This sorting technique, is classified using probability subject feature and refusal condition judgment to test text;Secondary classification:Again based on tradition
Feature extracting method extracts Feature Words to being classified by the test text of first order refusal classification.This hierarchical composition method pair
Text is classified, and very fast many texts can correctly be divided in first-level class the characteristics of warm different classifications device
Class, greatly improves Text Classification System efficiency, and being that Text Classification System is practical provides good processing mode;Consider that text is special
Point proposes probability subject, and under the conditions of appropriate refusal, probability subject completes a large amount of text classifications with accuracy very high
Task.Experiment proves that the combination of the application two-stage compared with the single classification of tradition, can greatly reduce time loss and can improve system
System classification accuracy rate.
Traditional Text Classification is as shown in Figure 1 firstly the need of one preferable bibliography system of boundary effect of formulation, and
According to the bibliography system collect it is enough with the representational text collection of classification as training sample, this step work is often
The maximum work of time overhead in traditional text classification work.Enough having collected, training text 101 good enough is gathered
Afterwards, be to the training text 102 after being processed of single text classification, pretreatment, such as:Including Chinese word segmentation,
Generation disables vocabulary, Chinese feature selecting, text vector and the work such as represents.Ripe Chinese word cutting method has had a lot, such as
CDWS, n-gram, HMM etc..Not only the frequency of occurrences is high in article for the function word of text grammer part, and
Almost nonsensical to participle, i.e. classification of disturbance has the text dimensionality for being too high, influences classification effectiveness.If initial data is
Some web datas, in addition it is also necessary to weed out the construct noises such as plug-in unit, header, the footer of webpage.The Auto of stop words
It is also immature, it is main at present to be realized by way of import existing general stop words and artificial mark project special procures deactivation,
Need regular hour expense and bring certain artificial unstability to system.On the one hand substantial amounts of text feature can increase point
The space complexity and time complexity of class algorithm, on the other hand may include substantial amounts of noise data, final influence classification
Precision.The text feature value of current main flow chooses mode TFIDF, information gain, mutual information, x statistics cross entropies etc..To treatment
Training text 102 afterwards carries out feature selecting and obtains feature dictionary 103.With the increase of text size and amount of text, text
The computing cost also linear growth trend that eigen value is chosen.After the selected training text vector 104 such as feature based, tradition text
This sorting technique generates correlation rule storehouse 105 by Mining Frequent Itemsets Based, then the mode such as rule pruning generates grader 106.Survey
Examination text 107 passes through the selected test text vector such as similar pretreatment, the test text 108 after being processed, feature based
After 109, classification 200 is obtained using the classification of grader 106.
And, in existing sorting technique, its crucial rule base is limited in one's ability for the dynamic regulation of neologisms and stop words,
With the rapid popularization continued to develop with internet of computer technology, increasing people begin to use the internet to carry out letter
Breath is obtained.The resource of the magnanimity of its network and the textual resources for continuing to bring out out constantly challenge existing correlation rule storehouse
Autgmentability and adaptability.The moderately well-off Master's thesis of Soviet Union of Central China Normal University《Based on wikipedia build semantic knowledge-base and its
The application study in text classification field》Based on magnanimity real text present on internet(Such as wikipedia), it is proposed that one
Plant the construction method of taxonomy database.The method uses semantic label to refer to, and semantic fingerprint portrays the formalization knowledge of semanteme
Method for expressing, therefrom proposes the corpus of certain scale, the annexation between wikipedia webpage is excavated, automatic structure
Build semantic knowledge-base.But the emphasis of this prior art is to provide a kind of semantic knowledge-base, be not given based on such knowledge
The corresponding Text Classification of the correlation rule in storehouse.
The content of the invention
For the defect of existing Text Classification, the technical problem that the technical scheme of the application to be solved there is provided base
Correlation rule storehouse is automatically generated in resource and the method and system of text classification is implemented in combination with basic classification dictionary, e.g., be based on
Entry correlation rule and basic classification dictionary, by analyzing the weave connection result of resource, generate text keyword correlation rule
Taxonomic hierarchies, constructs Naive Bayes Classifier, and text classification is carried out to test text.
A kind of Text Classification System of the application, including:Correlation rule storehouse generation module, by from classification correlation rule
Resource Access entry and the entry between correlation rule, to generate classification correlation rule storehouse;Basic classification dictionary life
Into module, the basic classification dictionary of basic classification generation based on each field;Text Pretreatment module, for test text
Pre-processed, to extract text feature entry;Rule pruning module, compares entry in the basic classification dictionary and described
Entry in classification correlation rule storehouse, using the correlation rule of the entry in the classification correlation rule storehouse, to the foundation class
Entry in mesh dictionary carries out weight calculation, and calculates the entry weight in the classification correlation rule storehouse;Classifier modules, base
In the weight and the text feature entry of extraction of the entry, the test text is classified.
A kind of file classification method of the system of correspondence the application, including:From the Resource Access with classification correlation rule
Correlation rule between entry and the entry, to generate classification correlation rule storehouse;Based on the basic classification in each field,
The basic classification dictionary of generation;Pretreatment test text, extracts the document feature sets of test text;In comparing the basic classification dictionary
Entry and the classification correlation rule storehouse in entry, using the correlation rule of the entry in the classification correlation rule storehouse,
Weight calculation is carried out to the entry in the basic classification dictionary, and calculates the entry weight in the correlation rule storehouse;Use
Grader, according to the document feature sets and the weight of the entry of calculating that extract, classifies to the test text.
The technical scheme of the application, text classification is carried out based on classification correlation rule and basic classification dictionary, it is to avoid
Across language platform technology barrier problem under different language environment in traditional text classification.At the same time, to some neologisms and
Old word new ideas, as long as to all kinds(Tree-shaped, netted, chain etc.)Correlation rule be modified slightly, it is possible to realize new
Text classifier, without traditional text grader Chinese version branch uniformity problem of worrying.
Brief description of the drawings
In order to illustrate more clearly of the technical scheme of the embodiment of the present application, below will be to be used needed for embodiment description
Accompanying drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the present application, for this
For the those of ordinary skill of field, on the premise of not paying creative work, can also obtain other according to these accompanying drawings
Accompanying drawing.
Fig. 1 is traditional text classification schematic diagram.
Fig. 2 is the Text Classification System figure of the application specific embodiment.
Fig. 3 is the tree-like correlation rule example of the application specific embodiment.
Fig. 4 is the complicated correlation rule example of the application specific embodiment.
Fig. 5 is the single-link of the root node without mark of the application specific embodiment.
Fig. 6 is the multilink of some root nodes without mark of the application specific embodiment.
Fig. 7 is the loop chain Pruning strategy of the application specific embodiment.
Specific embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present application, the technical scheme in the embodiment of the present application is carried out clear, complete
Site preparation is described, it is clear that described embodiment is only some embodiments of the present application, rather than whole embodiments.It is based on
Embodiment in the application, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made
Embodiment, belongs to the scope of the application protection.
The application is based on classification correlation rule and basic classification dictionary and carries out the technology of text classification, in following specific reality
Apply in mode, by the structure basis using wikipedia as classification correlation rule storehouse as an example, but be not limited to this.Wiki hundred
Section is a multilingual encyclopaedical cooperation plan based on wiki technologies, is also a networking encyclopaedia write as with different language
Pandect, its target and objective are for the whole mankind provides free encyclopedia -- formed to write with their selected language
's.By in November, 2011, the registered user and plurality of non-registered users for having had more than 31,720,000 contribute to 282
Entry of the language more than 20,240,000 is planted, its editor's number of times is alreadyd exceed 1,200,000,000 3,192 ten thousand times.Due to wikipedia have it is multi-lingual
Speech can be answered completely with reference to bibliography system, the textual association rule classification system set up using the tree structure correlation rule
In using different language systems.The similar various encyclopaedia databases of others, such as, and Baidu's database, Chinese encyclopaedia net etc.,
With the classified index entry of digital library, the structure basis in correlation rule storehouse is can serve as.For other topological structures
Correlation rule, such as, and network structure, chain structure etc.(It is any or at least one)Correlation rule or its combination, the side of the application
Method and system are equally applicable.The technical scheme of the application is introduced below in conjunction with accompanying drawing.
Fig. 2 is the Text Classification System figure of the application.The system includes correlation rule storehouse generation module 201, basic classification
Dictionary generation module 202, Text Pretreatment module 203, rule pruning module 204, the part of grader more than 205 composition.This implementation
Example is the tree-shaped text classification correlation rule based on wikipedia, but is not limited to this.
Correlation rule storehouse generation module 201, by from the Resource Access entry with classification correlation rule and these entries
Between rule, to generate classification correlation rule storehouse, stored in classification correlation rule storehouse these entries and these entries it
Between correlation rule.
Such as, from wikipedia(Wikipedia)Deng with certain classification correlation rule Internet resources in utilize data
Digging technology or crawler technology extract the rule between entry and entry, and generation classification correlation rule storehouse such as utilizes reptile work
Have the classified index entry of the wikipedia that crawls, be saved in a database.Its need according to certain web page analysis algorithm with
Filtering is unrelated with theme to be linked, and the link for remaining with simultaneously puts it into the URL queues for waiting crawl, and then according to certain
Search strategy selects the next step webpage URL to be captured from queue, and repeats crawl, the action of selection, until reaching stopping
Condition.So that crawlers carry out webpage capture as an example, the webpage of crawler capturing will be stored by system, so be analyzed, mistake
Filter, and index is set up, inquiry and retrieval so as to after.Its conventional search strategy is depth-first and breadth First.Reptile
Program obtains the classified index entry of wikipedia, such as:Communication->Mobile phone-> frequencies, in three entries and three entries
Between incidence relation, in such example be man-to-man father and son's correlation rule, communication(Father), mobile phone(Son), mobile phone(Father)、
Frequency(Son), it is also possible to the father of one-to-many correlation rule, i.e., many height.And by the pass between these entries and entry
Connection rule(Classification correlation rule)It is saved in a database, is formed for example:Page classifications->Society->Military->Military science-
>Wire type is fought, such form, and generation classification correlation rule storehouse is correlation rule storehouse.Reference link is such as:
http://zh.wikipedia.org/wiki/Wikipedia:%E5%88%86%E9%A1%9E%E7%B4%A2% E5%BC%95。
Basic classification dictionary generation module 202, the basic classification dictionary of generation, the basic classification in its each field of generation is used for
Classification is inquired about, to determine text classification field.This module 202 generates base based on the basic classification in existing/existing each field
Plinth classification dictionary, can be by downloading existing/existing basic classification dictionary, such as QQ input methods classified lexicon is saved in this
Ground file, in case using.Reference link is such as:http://dict.py.qq.com/
Based on above-mentioned classification correlation rule and basic classification dictionary, the correlation rule of such as tree structure can be utilized to set up text
This correlation rule, helps to be applied to different language systems, can be across language platform.Classification correlation rule storehouse and basic classification
Dictionary has carried out follow-up beta pruning treatment and the data of text classification prepare.
Text Pretreatment module 203 is used to process test text, extracts text feature entry.It has text special
Vectorial abstraction function is levied, test text is pre-processed, a text to be measured is processed by simple Chinese word segmentation, reject one
The function words such as a little auxiliary word, adverbial words, obtain the lists of keywords of the text, and calculate for example, by TFIDF and length keywords
The fraction of keyword as classification grader 205 input.
Rule pruning module 204, for comparing entry and the classification correlation rule storehouse in the basic classification dictionary
In entry, using the entry correlation rule in the classification correlation rule storehouse, the entry in the basic classification dictionary is entered
Row weight calculation, and calculate the weight of entry in the correlation rule storehouse.Rule pruning module 204 before classification, & apos, to above-mentioned
Classification correlation rule and basic classification dictionary are processed, using such as tree structure correlation rule set up entry correlation rule and
The comparison of basic classification dictionary, analysis, calculating, modification, simply shifting mode, it is ensured that across the text point of language platform
Class, and the text of the traditional text grader that need not worry is evenly distributed.
According to rule pruning algorithm, based on correlation rule storehouse generation module 201 and basic classification dictionary generation module 202
(The two data for preparing), entry is analyzed in the correlation rule in correlation rule storehouse, the entry in basic classification dictionary is weighed
Weight analysis are calculated, and give grader 205 weight information, in case classification is used.The operation principle of rule pruning module 204 will
It is described below.
Classifier modules 205, using the weight of the entry for calculating, as the priori conditions of Naive Bayes Classifier
Probability, i.e. this entry belong to the conditional probability of a certain classification, and test text is classified.Finally completing text classification is needed
The classification wanted.Using Naive Bayes Classifier to text classification, other kinds of text classifier is passed through the present embodiment
Cross suitably modified, it is also possible to be applied in the application.
The operation principle to rule pruning module 204 is described below.Rule pruning module 204 receives to come to association
The entry correlation rule of regular library module 201 and the basic classification dictionary for coming from generation in basic classification dictionary generation module
Entry, the weight to the entry in dictionary is calculated.Its beta pruning treatment is included:1)Basic classification dictionary weight is carried out to estimate
Meter, 2)Initialize the classification correlation rule storehouse of Wiki tree(Incidence relation is initialized)、3)Using iterative algorithm to association
In rule base all node weights calculated, 4)Special joint is processed, 5)Annular relation in correlation rule is carried out
Shear treatment.
1)Basic classification dictionary weight is estimated
Assuming that there is N number of inhomogeneity purpose dictionary CD={ Dict_1, Dict_2 ... Dict_N }, in each Dict file,
Containing M words/phrases, Dict_i=Word_i1, Word_i2 ... and, Word_iM }, calculate word/word in each dictionary
The weight w_weight of group:W_weight=1/DF, DF are Dictionary Frequency, and current word is in different dictionaries
The frequency (number of times) of appearance.The basic classification dictionary weight table of generation.Such as, phrase " chip frequency " appears in { D_ computers } class
Occur in { D_ mobile phones } class, therefore, the DF value of this phrase is 2, and ({ D_ is calculated the conditional probability P that it belongs to { D_ computers } class
Machine }/" chip frequency ")=1/2.And entry and its corresponding weight are generated into weight table.
2)Initialize the correlation rule storehouse of Wiki tree structure
Entry in inquiry classification correlation rule storehouse, for example, the word in the tree-like text classification dictionary of inquiry wikipedia
Bar, if there is current entry in the weight table of basic classification dictionary, current entry is assigned to by the weight in weight table, no
Then the category information of the entry is zero(Not mark node), and present node is denoted as " mark node ", and store each node
Information in classification correlation rule storehouse, all classifications and its corresponding weight of the nodal information including associated system,
Its form is such as:{ classification 1:Weight 1, classification 2:Weight 2...... }.For example:
Entry " weaving city street " in wikipedia occurs not in this dictionary, then " weaving city street " node is not appointed
Manage where;
When the entry " turbocharging " of the node in wikipedia is only present in dictionary { D_ machineries } class, to " turbine
Supercharging " node assigns W_ { turbocharging }={ D_ machineries:1};
It is right when during the entry " chip frequency " in wikipedia node appears in { D_ computers } class and { D_ mobile phones } class
" chip frequency " node assigns W_ { chip frequency }={ D- computers:0.5, mobile phone:0.5}.
3)Weight calculation is carried out to all nodes in classification correlation rule storehouse using iterative algorithm
After initialization mark node, for the node without mark, it is necessary to consider following several correlation rules:Such as a pair
First, the correlation rule between one-to-many entry.
Entry in inquiry classification correlation rule storehouse, if in the weight table of basic classification dictionary and in the absence of current word
Bar, after its category information is set for 0, the current entry of storage is not mark node in the classification correlation rule storehouse.And foundation
Current entry and those be present in entry rule relation between the entry in basic classification dictionary, it is such as one-to-one, one-to-many(Ginseng
See a)、b)、c))To process, to calculate the weight of current entry.
A) when the nodes X not marked retrieves certain mark node A from bottom to top, and there is the list of " 1-1 " with node A
(such as Fig. 3 during chain relation(a)), nodes X is 1 with the depth proportion of node A, then nodes X is with the correlation rule of A:X=A.Section
Point w_weight_X=w_weight_A
B) (such as Fig. 3 when nodes X and last layer node { A, B ... } have many chain relations of " 1-n "(b)), this n father
Node depth is identical and is mark node, then the node to the depth weights of all father nodes be 1/n, then nodes X and node A
Between correlation rule be:
The Mining class association rules of nodes X are the set (first assuming A here, orthogonal between B) of all father nodes:
(Such as:Fig. 3(b)The weight of X is 1/3 { A, B, C })
C) when nodes X and upper layer node presence " 1-n " correlation rule (such as Fig. 3(c)), but this n father node depth not
Deng, respectively depth (X, Y), then the depth weights between nodes X and node A are
R represents the number with the related father nodes of x, such as Fig. 3(c)In, R=2 is because have A and B two marks node and X
It is relevant.Then the Mining class association rules of nodes X are:
Wherein
(Note:In the above-mentioned formula for enumerating, represent needs plus the related father node of institute if " ... " has been added.)
As shown in Fig. 3 (c), the depth of node A to nodes X is 4 for the depth of 3, B to nodes X, so dX, A=1/
3, dX, B=1/4.Thus, it is possible to calculate the relation of nodes X and node A, B.
D) classified weight of the entry for there are multiple father nodes in correlation rule is calculated by successive ignition.
This is a kind of complicated correlation rule.For calculative node(Such as the node of the dotted border in Fig. 4), can
To use iterative algorithm, calculate node X, it is necessary to step by step calculation goes out its father node and the node letter being associated with its father node
Breath, and then determine the correlation rule with mark node A and B.
4)Special joint is processed, for example, there is unlabelled " root " node in being initialized for correlation rule
The treatment of link:
Because root node cannot know the correlation rule with others entry present in basic classification dictionary(Relation),
Need to determine its weights and classification and correlation rule using following mode.
A) in the case of single link, as shown in figure 5, the classification value of root node is Null, i.e., classification, power cannot be determined
Weight is 0, the root node mark { classification in such as Fig. 5:Null, weight:0}.Child node on its corresponding link has same genus
Property and weight, i.e. child node are labeled as { classification:Null, weight:0}.
B) when there is multilink, and there are some unlabelled situations of link root node, as shown in fig. 6, current
Nodal community calculates weights according to identical method in 3 (c).
As in Fig. 6, root node A's is labeled as { classification:Null, weight:0 }, root node B is labeled as { classification:Classification 1,
Weight:0.7 }, then child node X is labeled as { empty class mesh:(1/3)/(1/3+1/2) * 0, classification 1:(1/2)/(1/3+1/2)*
0.7}。
5)Shear treatment is carried out to annular relation in correlation rule
There are problems that ring in complicated correlation rule, as shown in fig. 7, being a kind of correlation rule of loop chain.Calculate X node classes
When mesh and weight, the X such as on the left of appearance<-C←---<-A<The loop chain problem of-X, then using last A of cut-out<- X relations
Mode is calculated, as shown by a dashed line in fig 7.Alternatively, it is also possible to using mark manually or the preferential classification and weight for calculating A
Mode solve the problem.
The method corresponding with Text Classification System described above, including crawlers are used, associated from classification
The resource of rule(Such as:At least one of encyclopaedic knowledge storehouse and digital library system)Extract between entry and the entry
Correlation rule structure(Such as:At least one in the rule of tree, chain structure and network structure), generation classification association
Rule base;
The basic classification dictionary of basic classification generation based on each field;Pretreatment test text, extracts test text
Document feature sets;
The entry in the entry and the classification correlation rule storehouse in the basic classification dictionary is compared, using the classification
The correlation rule of the entry in correlation rule storehouse, weight calculation is carried out to the entry in the basic classification dictionary(Such as:Based on this
The frequency occurred in each classification of the entry in the basic classification dictionary is carried out), and calculate the classification correlation rule storehouse
Entry weight, implementation rule prune(Beta pruning):During comparison, if the entry in the classification correlation rule storehouse be present in it is described
In basic classification dictionary, then the weight of the entry in the basic classification dictionary is entered to the entry in the correlation rule storehouse
Row weight assignment;During comparison, if the entry in the classification correlation rule storehouse is not present in the basic classification dictionary,
According to the entry in the classification correlation rule storehouse and the classification correlation rule storehouse being present in the basic classification dictionary
In the entry correlation rule of other entries carry out weight calculation.The entry correlation rule includes one-one relationship between entry
Or many-one relationship etc..Phase in the weight calculation consideration classification correlation rule storehouse in the correlation rule storehouse between each node
To depth.The weight calculation of the entry in the classification correlation rule storehouse can be used and carried out by iterative algorithm.
Using grader, according to the weight and the text feature entry of extraction of the entry, to the test text
Classified;The grader can use Naive Bayes Classifier, using the weight of the entry as the grader
Priori conditions probability, classifies to the test text.System and method based on the application, by the tissue for analyzing resource
Association results generation correlation rule storehouse, and text keyword Classification of Association Rules system is generated, with reference to basic classification dictionary, construction
Naive Bayes Classifier, text classification is carried out to test text.It can be seen that, it does not exist artificially not without traditional artificial expense
Stability, the expense and time and space for reducing sorting algorithm in the absence of a large amount of text features is complicated, and noise data is improved point less
Class precision.Also, contrasted by above-mentioned system and dictionary, grader is generated by rule trimming, to neologisms and old word new ideas,
Correlation rule need to be only modified slightly, it is possible to text classifier is updated, without the traditional text grader Chinese version branch that worries
Uniformity problem.
According to above-mentioned Text Classification System, present invention also provides the file classification method of correspondence Text Classification System, should
The specific implementation step of method correspond to the specific implementation of the modules of Text Classification System, such as foregoing correlation rule storehouse generation
Module 201, basic classification dictionary generation module 202, Text Pretreatment module 203, rule pruning module 204, grader 205 etc.
Processing procedure.The method implementation steps are as follows:
Correlation rule between Resource Access entry and the entry with classification correlation rule, to generate classification
Correlation rule storehouse.Wherein, resource can be at least one of encyclopaedic knowledge storehouse and digital library system, the pass between entry
Join the structure of rule, include at least one of tree, chain structure and network structure, and classification correlation rule storehouse can be with
Extracted by crawlers and generated.Specifically process and realize the description of correlation rule storehouse generation module 201 in system shown in Figure 2
Concrete processing procedure.
Based on the basic classification in each field, the basic classification dictionary of generation.Specifically process and realize system shown in Figure 2
The concrete processing procedure of basic classification dictionary generation module 202 description in system.
Pretreatment test text, extracts the document feature sets of test text.In specifically processing and realizing system shown in Figure 2
The concrete processing procedure of the description of Text Pretreatment module 203.
The entry in the entry and the classification correlation rule storehouse in the basic classification storehouse is compared, is closed using the classification
The correlation rule of the entry in connection rule base(For example:Entry correlation rule can be one-one relationship or one-to-many between entry
Relation), weight calculation is carried out to the entry in the basic classification dictionary(For example:Entry weight calculation can be based on the entry
The frequency occurred in each classification in the basic classification dictionary is carried out), and calculate the power of the entry in the correlation rule storehouse
Weight(For example:The relative depth that entry weight calculation can contemplate in the classification correlation rule storehouse between each node is carried out, and is calculated and is calculated
Method can be carried out using iterative algorithm).Compare with weight calculation assignment for example:By the entry in the classification correlation rule storehouse with
When entry in the basic classification dictionary is compared, if the entry in the classification correlation rule storehouse is present in the foundation class
In mesh dictionary, then the weight of the entry in the basic classification dictionary carries out weight to the entry in the correlation rule storehouse
Assignment;And if the entry in the classification correlation rule storehouse is not present in the basic classification dictionary, then according to described
Its in the entry in classification correlation rule storehouse and the classification correlation rule storehouse being present in the basic classification dictionary
He carries out weight calculation by the entry correlation rule of entry.Specifically process and realize rule pruning module in system shown in Figure 2
The concrete processing procedure of 204 descriptions.
Using grader, according to the document feature sets and the weight of the entry of calculating that extract, the test text is entered
Row classification.Grader can be Naive Bayes Classifier, using entry weight as the grader priori conditions probability, it is right
The test text is classified.The specific of the description of grader 205 treats in specifically processing and realizing system shown in Figure 2
Journey.
Each embodiment in this specification is typically described by the way of progressive, and what each embodiment was stressed is
With the difference of other embodiment, between each embodiment identical similar part mutually referring to.
The application can be described in the general context of computer executable instructions, such as program
Module or unit.Usually, program module or unit can include performing particular task or realize particular abstract data type
Routine, program, object, component, data structure etc..In general, program module or unit can be by softwares, hardware or both
Combination realize.The application can also be in a distributed computing environment put into practice, in these DCEs, by passing through
Communication network and connected remote processing devices perform task.In a distributed computing environment, program module or unit can
With in the local and remote computer-readable storage medium including including storage device.
It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or computer program
Product.Therefore, the application can be using the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware
Apply the form of example.And, the application can be used and wherein include the computer of computer usable program code at one or more
Usable storage medium(Including but not limited to magnetic disk storage, CD-ROM, optical memory etc.)The computer program of upper implementation is produced
The form of product.
Finally, in addition it is also necessary to explanation, term " including ", "comprising" or its any other variant be intended to it is non-exclusive
Property include so that process, method, commodity or equipment including a series of key elements not only include those key elements, and
Also include other key elements being not expressly set out, or also include intrinsic for this process, method, commodity or equipment
Key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that including described
Also there is other identical element in the process of key element, method, commodity or equipment.
Specific case used herein is set forth to the principle and implementation method of the application, and above example is said
It is bright to be only intended to help and understand the present processes and its main thought;Simultaneously for those of ordinary skill in the art, foundation
The thought of the application, will change in specific embodiments and applications, and in sum, this specification content is not
It is interpreted as the limitation to the application.
Claims (14)
1. a kind of Text Classification System, it is characterised in that including:
Correlation rule storehouse generation module (201), by from Resource Access entry and the entry with classification correlation rule
Between correlation rule, to generate classification correlation rule storehouse;
Basic classification dictionary generation module (202), the basic classification dictionary of basic classification generation based on each field;
Text Pretreatment module (203), for being pre-processed to test text, to extract text feature entry;
Rule pruning module (204), compares the word in the entry and the classification correlation rule storehouse in the basic classification dictionary
Bar, using the correlation rule of the entry in the classification correlation rule storehouse, weighs to the entry in the basic classification dictionary
Re-computation, and calculate the entry weight in the classification correlation rule storehouse;
Classifier modules (205), in weight, the classification correlation rule storehouse based on the entry in the basic classification dictionary
The weight of entry and the text feature entry of extraction, classify to the test text.
2. the system as claimed in claim 1, it is characterised in that
The resource includes:At least one of encyclopaedic knowledge storehouse and digital library system;
The weight calculation of the entry in the basic classification dictionary is all kinds of in the basic classification dictionary based on the entry
The frequency occurred in mesh;
Correlation rule structure between the entry includes:At least one of tree, chain structure and network structure.
3. the system as claimed in claim 1, it is characterised in that further matched somebody with somebody in correlation rule storehouse generation module (201)
It is set to:The classification correlation rule storehouse is generated using crawlers.
4. the system as claimed in claim 1, it is characterised in that the rule pruning module (204) is further configured to:Will
Entry in the classification correlation rule storehouse is compared with the entry in the basic classification dictionary, if the classification correlation rule
Entry in storehouse is present in the basic classification dictionary, then the weight of the entry in the basic classification dictionary is to described
Entry in correlation rule storehouse carries out weight assignment.
5. system as claimed in claim 4, it is characterised in that the rule pruning module (204) is further configured to:Such as
Entry in really described classification correlation rule storehouse is not present in the basic classification dictionary, then associate rule according in the classification
The word of other entries in the entry then in storehouse and the classification correlation rule storehouse being present in the basic classification dictionary
Bar correlation rule carries out weight calculation.
6. system as claimed in claim 5, it is characterised in that
The entry correlation rule is included between entry:One-one relationship or many-one relationship;
Depth proportion or depth in the weight calculation consideration classification correlation rule storehouse in the correlation rule storehouse between each node
Degree weights;
The weight calculation of the entry in the classification correlation rule storehouse is carried out by iterative algorithm.
7. the system as claimed in claim 1, it is characterised in that the classifier modules (205) are Naive Bayes Classifier,
The weight of the entry in weight, the classification correlation rule storehouse of the entry in the basic classification dictionary is used as the grader
Priori conditions probability, the test text is classified.
8. a kind of file classification method, it is characterised in that including:
Correlation rule between Resource Access entry and the entry with classification correlation rule, to generate classification association
Rule base;
Based on the basic classification in each field, the basic classification dictionary of generation;
Pretreatment test text, extracts the document feature sets of test text;
The entry in the entry and the classification correlation rule storehouse in the basic classification dictionary is compared, is associated using the classification
The correlation rule of the entry in rule base, carries out weight calculation, and calculate the pass to the entry in the basic classification dictionary
Entry weight in connection rule base;
Use grader, the entry in weight, the classification correlation rule storehouse based on the entry in the basic classification dictionary
Weight and extraction the text feature entry, the test text is classified.
9. method as claimed in claim 8, it is characterised in that
The resource includes at least one of encyclopaedic knowledge storehouse and digital library system;
The weight calculation of the entry in the basic classification dictionary is all kinds of in the basic classification dictionary based on the entry
The frequency occurred in mesh;
Correlation rule structure between the entry includes:At least one of tree, chain structure and network structure.
10. method as claimed in claim 8, it is characterised in that the classification correlation rule storehouse is to be extracted to give birth to by crawlers
Into.
11. methods as claimed in claim 8, it is characterised in that by the entry in the classification correlation rule storehouse and the base
Entry in plinth classification dictionary is compared, and the entry in classification correlation rule storehouse is present in the basic classification dictionary as described,
Then the weight of the entry in the basic classification dictionary carries out weight assignment to the entry in the correlation rule storehouse.
12. methods as claimed in claim 11, it is characterised in that if the entry in the classification correlation rule storehouse does not exist
In the basic classification dictionary, then according in the classification correlation rule storehouse the entry and be present in the foundation class
The entry correlation rule of other entries in the classification correlation rule storehouse in mesh dictionary carries out weight calculation.
13. methods as claimed in claim 12, it is characterised in that
The entry correlation rule includes one-one relationship or many-one relationship between entry;
Depth proportion or depth in the weight calculation consideration classification correlation rule storehouse in the correlation rule storehouse between each node
Degree weights;
The weight calculation of the entry in the classification correlation rule storehouse is carried out by iterative algorithm.
14. methods as claimed in claim 8, it is characterised in that the grader is Naive Bayes Classifier, by the base
The weight of the entry in the weight of the entry in plinth classification dictionary, the classification correlation rule storehouse as the grader priori
Conditional probability, classifies to the test text.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310009087.4A CN103927302B (en) | 2013-01-10 | 2013-01-10 | A kind of file classification method and system |
HK15100449.0A HK1200040A1 (en) | 2013-01-10 | 2015-01-15 | Method and system for text categorization |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310009087.4A CN103927302B (en) | 2013-01-10 | 2013-01-10 | A kind of file classification method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103927302A CN103927302A (en) | 2014-07-16 |
CN103927302B true CN103927302B (en) | 2017-05-31 |
Family
ID=51145525
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310009087.4A Active CN103927302B (en) | 2013-01-10 | 2013-01-10 | A kind of file classification method and system |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN103927302B (en) |
HK (1) | HK1200040A1 (en) |
Families Citing this family (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104199959A (en) * | 2014-09-18 | 2014-12-10 | 浪潮软件集团有限公司 | Text classification method for Internet tax-related data |
CN105528356B (en) * | 2014-09-29 | 2019-01-18 | 阿里巴巴集团控股有限公司 | Structured tag generation method, application method and device |
CN104462347B (en) * | 2014-12-04 | 2018-05-18 | 北京国双科技有限公司 | The sorting technique and device of keyword |
CN104679728B (en) * | 2015-02-06 | 2018-08-31 | 中国农业大学 | A kind of text similarity detection method |
CN105512270B (en) * | 2015-12-04 | 2020-02-21 | 上海优扬新媒信息技术有限公司 | Method and device for determining related objects |
CN106570109B (en) * | 2016-11-01 | 2020-07-24 | 深圳市点通数据有限公司 | Method for automatically generating question bank knowledge points through text analysis |
CN108090040B (en) * | 2016-11-23 | 2021-08-17 | 北京国双科技有限公司 | Text information classification method and system |
CN107357895B (en) * | 2017-01-05 | 2020-05-19 | 大连理工大学 | Text representation processing method based on bag-of-words model |
CN108804408A (en) * | 2017-04-27 | 2018-11-13 | 安徽富驰信息技术有限公司 | Information extraction system based on domain-specialist knowledge system and information extraction method |
CN107506472B (en) * | 2017-09-05 | 2020-09-08 | 淮阴工学院 | Method for classifying browsed webpages of students |
CN108280164B (en) * | 2018-01-18 | 2021-10-01 | 武汉大学 | Short text filtering and classifying method based on category related words |
CN108520030B (en) * | 2018-03-27 | 2022-02-11 | 深圳中兴网信科技有限公司 | Text classification method, text classification system and computer device |
CN108509638B (en) * | 2018-04-11 | 2023-06-27 | 联想(北京)有限公司 | Question extraction method and electronic equipment |
CN108549723B (en) * | 2018-04-28 | 2022-04-05 | 北京神州泰岳软件股份有限公司 | Text concept classification method and device and server |
CN109145529B (en) * | 2018-09-12 | 2021-12-03 | 重庆工业职业技术学院 | Text similarity analysis method and system for copyright authentication |
CN109460730B (en) * | 2018-11-03 | 2022-06-17 | 上海犀语科技有限公司 | Analysis method and device for line and page changing of table |
CN111694948B (en) * | 2019-03-12 | 2024-05-17 | 北京京东尚科信息技术有限公司 | Text classification method and system, electronic equipment and storage medium |
CN110427626B (en) * | 2019-07-31 | 2022-12-09 | 北京明略软件***有限公司 | Keyword extraction method and device |
CN110674635B (en) * | 2019-09-27 | 2023-04-25 | 北京妙笔智能科技有限公司 | Method and device for dividing text paragraphs |
CN113673210B (en) * | 2020-05-13 | 2023-12-01 | 复旦大学 | document generation system |
CN111737719B (en) * | 2020-07-17 | 2020-11-24 | 支付宝(杭州)信息技术有限公司 | Privacy-protecting text classification method and device |
CN112256986A (en) * | 2020-10-19 | 2021-01-22 | 中国互联网金融协会 | Method and device for monitoring virtual currency website, electronic equipment and storage medium |
CN112527953B (en) * | 2020-11-20 | 2023-06-20 | 出门问问创新科技有限公司 | Rule matching method and device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102194013A (en) * | 2011-06-23 | 2011-09-21 | 上海毕佳数据有限公司 | Domain-knowledge-based short text classification method and text classification system |
EP2369505A1 (en) * | 2010-03-26 | 2011-09-28 | British Telecommunications public limited company | Text classifier system |
US8145636B1 (en) * | 2009-03-13 | 2012-03-27 | Google Inc. | Classifying text into hierarchical categories |
CN102622373A (en) * | 2011-01-31 | 2012-08-01 | 中国科学院声学研究所 | Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm |
-
2013
- 2013-01-10 CN CN201310009087.4A patent/CN103927302B/en active Active
-
2015
- 2015-01-15 HK HK15100449.0A patent/HK1200040A1/en unknown
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8145636B1 (en) * | 2009-03-13 | 2012-03-27 | Google Inc. | Classifying text into hierarchical categories |
EP2369505A1 (en) * | 2010-03-26 | 2011-09-28 | British Telecommunications public limited company | Text classifier system |
CN102622373A (en) * | 2011-01-31 | 2012-08-01 | 中国科学院声学研究所 | Statistic text classification system and statistic text classification method based on term frequency-inverse document frequency (TF*IDF) algorithm |
CN102194013A (en) * | 2011-06-23 | 2011-09-21 | 上海毕佳数据有限公司 | Domain-knowledge-based short text classification method and text classification system |
Non-Patent Citations (4)
Title |
---|
Classifying text documents by associating terms with text categories;Osmar R. Zaiane等;《Australian Computer Science Communications》;20020228;第24卷(第2期);第215-222页 * |
Text Documents Classification by Associating Terms with Text Categories;V. Srividhya等;《Applications of Soft Computing,AISC 58》;20091231;第58卷;第223-231页 * |
基于关联规则的中文文本自动分类算法研究;杨柯;《中国优秀硕士学位论文全文数据库 信息科技辑》;20071215;第2007年卷(第6期);I138-43 * |
基于关联规则的文本分类研究;赵耀;《中国优秀硕士学位论文全文数据库 信息科技辑》;20101215;第2010年卷(第12期);I138-397 * |
Also Published As
Publication number | Publication date |
---|---|
HK1200040A1 (en) | 2015-07-31 |
CN103927302A (en) | 2014-07-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103927302B (en) | A kind of file classification method and system | |
CN106250412B (en) | Knowledge mapping construction method based on the fusion of multi-source entity | |
CN104834735B (en) | A kind of documentation summary extraction method based on term vector | |
CN108763213A (en) | Theme feature text key word extracting method | |
CN107102989A (en) | A kind of entity disambiguation method based on term vector, convolutional neural networks | |
CN106951438A (en) | A kind of event extraction system and method towards open field | |
CN103049532A (en) | Method for creating knowledge base engine on basis of sudden event emergency management and method for inquiring knowledge base engine | |
CN103970729A (en) | Multi-subject extracting method based on semantic categories | |
CN104298715B (en) | A kind of more indexed results ordering by merging methods based on TF IDF | |
Nasution | New method for extracting keyword for the social actor | |
CN106844632A (en) | Based on the product review sensibility classification method and device that improve SVMs | |
CN105654144B (en) | A kind of social network ontologies construction method based on machine learning | |
CN103970730A (en) | Method for extracting multiple subject terms from single Chinese text | |
CN106294344A (en) | Video retrieval method and device | |
CN109376352A (en) | A kind of patent text modeling method based on word2vec and semantic similarity | |
CN101404033A (en) | Automatic generation method and system for noumenon hierarchical structure | |
KR20190135129A (en) | Apparatus and Method for Documents Classification Using Documents Organization and Deep Learning | |
CN106547864A (en) | A kind of Personalized search based on query expansion | |
CN109241278A (en) | Scientific research knowledge management method and system | |
CN104391852B (en) | A kind of method and apparatus for establishing keyword dictionary | |
Monisha et al. | Classification of bengali questions towards a factoid question answering system | |
CN106708926A (en) | Realization method for analysis model supporting massive long text data classification | |
CN106649262A (en) | Protection method for enterprise hardware facility sensitive information in social media | |
CN102541913B (en) | VSM classifier trainings, the identification of the OSSP pages and the OSS Resource Access methods of web oriented | |
Kaothanthong et al. | Headline2Vec: A CNN-based feature for Thai clickbait headlines classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 1200040 Country of ref document: HK |
|
GR01 | Patent grant | ||
GR01 | Patent grant | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: GR Ref document number: 1200040 Country of ref document: HK |