CN103049568A

CN103049568A - Method for classifying documents in mass document library

Info

Publication number: CN103049568A
Application number: CN2012105930968A
Authority: CN
Inventors: 江潮
Original assignee: WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd
Current assignee: Language network (Wuhan) Information Technology Co., Ltd.
Priority date: 2012-12-31
Filing date: 2012-12-31
Publication date: 2013-04-17
Anticipated expiration: 2032-12-31
Also published as: CN103049568B

Abstract

The invention provides a method for classifying documents in a mass document library. The method includes: determining keywords of each document in the document library and correspondence between each keyword and the document that the keyword belongs to; matching the keywords one by one in a term base, using industry category attribute of a term matching with each keyword as the industry category attribute of the keyword belonging to the corresponding document; determining same maximum industry category attributes in each document according to the correspondence; and using the industry category attribute with the maximum attribution as the category of the corresponding document. Documents in a reference library are subjected to term retrieval according to the idea of backward matching. The term base is a set with a character sequence index structure, string matching by dichotomy in the term base needs 1+log2n times of matching calculation at most, and accordingly matching times are decreased greatly, the matching process is simplified and efficiency in document classification is improved.

Description

Method to the document classification of magnanimity document library

Technical field

The present invention relates to computer realm, in particular to a kind of method of the document classification to the magnanimity document library.

Background technology

Translation list of references storehouse (hereinafter to be referred as reference library), it is a document library that the supplementary translation resource of magnanimity document is arranged, method with general similarity retrieval is classified by certain industry, subject, field to it, need to carry out very huge text similarity coupling and calculate, the time of expending and space all are that system is difficult to bear.

The document in the reference library is carried out the calculating of term quantity by large-scale term corpus, can carry out to document the Preliminary division of the attributes such as industry, subject, field, the character string pattern matching that spends is calculated greatly to be less than and is carried out the calculated amount that the text similarity coupling is calculated.

Large-scale term corpus is a big collection that comprises term marking information, possesses the term language material of multiple index structure, and its scale is generally in 1,000,000 to ten million ranks, and large can arrive hundred million grades.The markup information that this method need be used has: the industry of term, subject, realm information, the index structure that need use are the character sequence index.

Usually the method that will classify by the term quantity in industry, subject, field with reference to the document in the storehouse, adopting with the term in the terminology bank is that keyword carries out string matching in document, obtains every profession and trade, the subject of each document, the term quantity in field.

Because the document in the reference library is a kind of unsorted text at random space, classify in this way, need to be with 1,000,000, ten million so the term of more than one hundred million meters be keyword, in the reference library document of magnanimity, carry out the order coupling, the time of expending like this is also very huge, and (the term number of establishing the term corpus is n, the number of files in reference documents storehouse is m, and wherein the average word number of document is k, and then its time complexity is o(m * n * k).), and whole matching process will will carry out string matching repeatedly to the identical word of the different document in the reference library, and matching process repeats very much.

Summary of the invention

The present invention aims to provide a kind of method of the document classification to the magnanimity document library, with solve adopt the term coupling mode to complicated, the consuming time long problem of the document classification of reference library.

In an embodiment of the present invention, provide a kind of method of the document classification to the magnanimity document library, having comprised:

Determine each keyword of all documents in the document library and the corresponding relation of each keyword and its each document that belongs to;

Described each keyword is mated in terminology bank one by one, with the category of employment attribute of the term of each keyword coupling, the category of employment attribute that belongs at each document of its correspondence as this keyword;

According to described corresponding relation, determine that each document comprises identical maximum category of employment attribute;

The category of employment attribute that ownership is maximum is as the classification of each document.

The present invention takes a kind of thinking of negative relational matching to carry out the document of reference library is carried out the term retrieval, namely with all words in (being document library) in the reference library as keyword, in the term corpus, mate, because the term corpus is a set that possesses the character sequence index structure, adopting dichotomy to carry out therein string matching at most only needs 1+log2n coupling to calculate (n is the term number of term corpus), even mate in hundred million grades term corpus, the matching times of a word in the term corpus also is no more than 30 times.Reduce greatly the number of times of coupling, simplified matching process, improved the efficient to document classification, realized the fast automatic classification of magnanimity document.

Description of drawings

Accompanying drawing described herein is used to provide a further understanding of the present invention, consists of the application's a part, and illustrative examples of the present invention and explanation thereof are used for explaining the present invention, do not consist of improper restriction of the present invention.In the accompanying drawings:

Fig. 1 shows the process flow diagram of embodiment;

Fig. 2 shows the process flow diagram of another embodiment.

Embodiment

Below with reference to the accompanying drawings and in conjunction with the embodiments, describe the present invention in detail.Referring to Fig. 1, the step of embodiment comprises:

S11: determine each keyword of all documents in the document library and the corresponding relation of each keyword and its each document that belongs to;

S12: described each keyword is mated in terminology bank one by one, with the category of employment attribute of the term of each keyword coupling, the category of employment attribute that in each document of its correspondence, belongs to as this keyword;

S13: according to described corresponding relation, determine identical maximum category of employment attributes that each document comprises;

S14: maximum category of employment attributes is as the classification of each document.

The present invention takes a kind of thinking of negative relational matching to carry out the document of reference library is carried out the term retrieval, namely with all words in (being document library) in the reference library as keyword, in the term corpus, mate, because the term corpus is a set that possesses the character sequence index structure, adopting dichotomy to carry out therein string matching at most only needs 1+log2n coupling to calculate (n is the term number of term corpus), even mate in hundred million grades term corpus, the matching times of a word in the term corpus also is no more than 30 times.Reduce greatly matching times, simplified matching process, improved the efficient to document classification, realized the fast automatic classification of magnanimity document.

Preferably, in an embodiment, each described document is carried out word segmentation processing, remove the word of stop words, the concrete meaning of nothing, obtain described each keyword.

Preferably, also comprise: determine a plurality of positional informations that each keyword occurs at its corresponding each document; Wherein, the quantity of described positional information equals this keyword in the word frequency of its corresponding each document.

By this positional information, but the position that the recorded key word occurs in each document, when the long L of the word of term surpasses keyword, can be according to the keyword behind this position, mate with term, with the category of employment attribute of determining that this keyword is belonged in current document.

Preferably, specify the step of above-described embodiment below by embodiment: comprising:

S21: all documents to reference library carry out document code, are designated as docID.

S22: all documents in the reference library are carried out word segmentation processing, remove stop words wherein, obtain all set of words of reference library, each word is numbered, be designated as wordID.Each word is keyword.

S23: calculate the number of times that each word occurs in different document, i.e. word frequency tf.

S24: calculate the positional information that each word occurs in each document, namely this word is which word in the document.

Just obtained a word lists structure as shown in table 1 below for each word like this:

Table 1

In table 1, set up the corresponding relation of the word a plurality of documents corresponding with it, and the positional information and the word frequency that occur at each document.

For example: following table 2 expression " database " these words occur twice in document doc0010, and it the position occurs for respectively at the 10th and the 100th character place; Occur 3 times at document doc0020, it the position occurs respectively at the the 20th, the 200th and the 300th character place.

Table 2

Just set up like this information table of all words of a reference library by said method.

S25: by the order of reference library word information table, take word as pattern string, in the term corpus, carry out pattern match.

Because the term corpus can mate with simple dichotomy by the character sequence index, it is term number in the term corpus that required matching times is not more than 1+log2n(n).Concrete matching process is as follows:

If with first word match success of certain term, the word that then calculates this term is long, is made as L, if L=1 then this word be term, the match is successful, returns the industry, subject, domain attribute information of this term to the affiliated document of this word; If corresponding a plurality of documents then return the industry, subject, domain attribute information of this term to a plurality of documents under this word.

If with first word match success of certain term, the word that then calculates this term is long; If the described long L of word that matches term＞1 then travels through the positional information in each document corresponding to described current keyword one by one;

For example: current keyword is " database ", and the term that matches is " database software "; The match is successful to match first word " database " of term.The long L=2 of word of term " database software "＞1 then travels through the positional information 10,100 among the document doc0010 of keyword place.

After traversing each positional information of current document, in the document, extract successively L-1 keyword after each positional information;

With L-1 the keyword that extracts at every turn, mate greater than 1 term with the long L of the described word that matches.

After position 10, find next keyword " software ".Second word " software " in keyword " software " and the term " database software " mated.

If L-1 the keyword that extracts carries out greater than 1 term with the long L of the described word that matches that the match is successful, the category of employment attribute that then the category of employment attribute of this term is belonged at the described current document of its correspondence as described current keyword.

After the match is successful, with the category of employment information of term " database software " as the category of employment information of keyword " database " in document doc0010.

S26: order has been mated all keywords in the reference library word information table.

S27: calculate every profession and trade, the subject of each document, the term number in field, according to the industry of document, the term quantity of ambit, determine identical, the highest category of employment attribute, according to this classification attribute, the document is included into certain industry, subject, field.

Preferably, the word frequency of record is used in the process that comprises identical maximum category of employment attribute of determining each document, and the word frequency of the keyword by the document is done product calculation and got final product, and for example, the term that the keyword B of A document mates belongs to the C industry; The word frequency of keyword B in the A document is 5, and then the C category of employment attribute that comprises of A document is 5.

Obviously, those skilled in the art should be understood that, above-mentioned each module of the present invention or each step can realize with general calculation element, they can concentrate on the single calculation element, perhaps be distributed on the network that a plurality of calculation elements form, alternatively, they can be realized with the executable program code of calculation element, thereby, they can be stored in the memory storage and be carried out by calculation element, perhaps they are made into respectively each integrated circuit modules, perhaps a plurality of modules in them or step are made into the single integrated circuit module and realize.Like this, the present invention is not restricted to any specific hardware and software combination.

The above is the preferred embodiments of the present invention only, is not limited to the present invention, and for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. the method to the document classification of magnanimity document library is characterized in that, comprising:

2. method according to claim 1 is characterized in that, each described document is carried out word segmentation processing, removes the word of stop words, the concrete meaning of nothing, obtains described each keyword.

3. method according to claim 1 is characterized in that, also comprises:

Determine a plurality of positional informations that each keyword occurs at its corresponding each document; Wherein, the quantity of described positional information equals this keyword in the word frequency of its corresponding each document.

4. method according to claim 3 is characterized in that, described matching process comprises:

If the described long L=1 of word that matches term determines that then the match is successful.

5. method according to claim 4 is characterized in that, described matching process also comprises:

If the described long L of word that matches term＞1 then travels through the positional information in each document corresponding to described current keyword one by one;

With L-1 the keyword that extracts at every turn, carry out Corresponding matching with the long L of the described word that matches greater than rear L-1 word of 1 term.

6. method according to claim 5 is characterized in that, determines the category of employment attribute that each keyword belongs in each document of its correspondence;

If L-1 the keyword that extracts carries out greater than 1 term with the long L of the described word that matches that the match is successful, the category of employment attribute that then the category of employment attribute of this term is belonged in the described current document of its correspondence as described current keyword.

7. method according to claim 4 is characterized in that, adopts dichotomy, and current keyword is searched in described terminology bank.