CN102654865A

CN102654865A - Method and system for digital object classification

Info

Publication number: CN102654865A
Application number: CN2011100498066A
Authority: CN
Inventors: 朱鹏翔
Original assignee: Individual
Current assignee: Individual
Priority date: 2011-03-02
Filing date: 2011-03-02
Publication date: 2012-09-05

Abstract

The invention provides a method and a system for digital object classification. The method comprises the following steps of: obtaining a clustering method of digital objects; generating coarse classification methods of clustered sets, including a method for estimating classification parameters, so as to form a primary classifier; using the clustering result to regulate the parameter of the primary classifier, and determining a final classifier by being combined with a logical reasoning method. In one embodiment, the parameters are determined according to the primary classification result by adopting a likelihood estimation method, and the parameters are modified by adopting a posteriori estimation method of probabilistic reasoning so as to determine the final classifier, so that the influence of interference information is avoided effectively, and the defect of ambiguity caused by uncertain semantic information in the digital knowledge objects is remedied. By utilizing the classification method and the classification system provided in the invention, the accuracy and the expandability of classification of the digital knowledge objects can be improved.

Description

A kind of digital object sorting technique and system

Technical field

The invention belongs to Knowledge Management Domain.The taxonomic organization of relate generally to knowledge, retrieval and excavation.In particular to passing through computer technology, the knowledge that can read the computing machine that shows with digital object is automatically carried out taxonomic organization, and automatic result behind the tissue is provided retrieved and the necessary numerical characteristic of excavation.

Background technology

At present, obtainable is that the knowledge of presentation increases fast with the manageable digital object of computing machine, makes people to well understand and effectively utilizes this great deal of information.How helping the user to organize these knowledge and find required emphasis knowledge with mode efficiently is a challenging task, also is the core purpose of Knowledge Management Domain.

The study of knowledge statistical relationship has been become a research focus of Knowledge Management Domain; It obtains and fields such as utilization, GIS-Geographic Information System and natural language understanding in bioinformatics, systems biology, internet hunt, social network, likelihood model, has all obtained high attention.It is that relation/logical expressions, probability inference mechanism (the uncertain processing), machine learning and data mining are integrated, and is the knowledge management method of purpose with the likelihood model of obtaining in the data.Statistics in the statistical relationship study refers to adopt based on the probability of theory of probability to be represented and inference mechanism, and relation is meant that first order logic representes and concern expression; Study is equal to data mining, is meant the statistical relationship model that from data, learns.At present the statistical relationship learning method mainly contain method based on the Bayesian net, based on the method for (conceal) Markov model, the method for netting etc. based on the method for stochastic grammar with based on Markov.

The method of research and utilization statistical relationship study of the present invention realizes knowledge acquisition, taxonomic organization, excavation and characteristic mark process in the information management.There are a lot of achievements in research in these areas in the prior art, roughly can be divided into supervision type, semi-supervised type and do not have three types on supervision type.All have to a certain degree defective separately: supervision type method is in order to obtain a large amount of training dataset of parameter estimation needs of statistical relationship, and particularly some fixedly is difficult to obtain in the sector application in practical environment for this, and is poor for applicability; Semi-supervised type method can receive the influence of local data's characteristic distribution, the deviation that causes univers parameter to be estimated, though there is research to improve with the likelihood estimation approach, effect is still not obvious when computing machine is handled this process automatically; No supervision type method needs strict predefine priori tabulation, like Keyword List etc., and poor expandability.What therefore needs were new carries out the method that taxonomic organization manages to digital object knowledge; In order to improve in the information management process,, accessible in order to realize computing machine to the method that the statistic of classification relation of digital object is learnt and generated; Efficiently, extendible information management process.

Summary of the invention

To the problems referred to above, make the present invention.

The present invention proposes a kind of digital object sorting technique and system; To the accessible digital object knowledge of computing machine; Carry out the taxonomic organization of robotization, the statistical relationship study through to the digital object characteristic of division realizes the lifting to assorting process applicability and extensibility.

The present invention is broadly divided into following steps: 1) preprocessing process; 2) characteristic vector space of acquisition digital object; 3) obtain the initial training collection; 4) iteration sorter study; 5) final sorter is set up.

At first, preprocessing process is the needs according to information management, earlier non-Object of Knowledge in the original knowledge collection or nonspecific research industry object is cleaned, and extended meeting brings the non-knowledge information of interference behind the place to go.

Next, the particular demands of the utilization industry of studying and the demand of computer system processor ability be the knowledge digital objectization just.

The 3rd, during initial training set generated, initial training gathered under the support of tangible priori that semantic analysis based on class name forms.In practical application, design a kind of method based on description, be used to set up sorter, wherein each classification has semantic relevant feature set, and its degree of correlation has embodied the statistical dependence parameter.Based on the preliminary classification device, comprise that the initial training set of affirmation and negation sample is created, to be used for iteration sorter study subsequently.

The 4th, at iteration sorter learning phase, in each iteration, be used to set up the training set of current iteration from the sorter classification results of last iteration.Then, the new sorter of structure from the training set of upgrading.At last, practical new sorter replaces the sorter of the last iteration remaining digital object of classifying.After all digital object classification were accomplished, when the classifiers convergence of formation was perhaps satisfied other end conditions, iterative process stopped.

The 5th, at final sorter establishment stage, stop selecting to meet most the sorter of the cluster result of acquisition in advance resulting all sorters in back from iterative learning, as final sorter.Because there are not the initial training data in the present invention's hypothesis, mainly utilizes pseudo-maximal possibility estimation on the scheme for the sorter selection, and utilize the first order logic relation to revise.

In the present invention; The cluster result of digital object and the aligning analysis between the classification results are performed; And be integrated into training set and build jointly in the process of upright and the study of iteration sorter; So, the possible prejudice and the ambiguity that derive from class name and corresponding semantic analysis are controlled, and the accuracy of training data of having guaranteed to be produced and final classification results is improved.

On the other hand, the method that the present invention adopted does not need the fixing Keyword List of initial training data or initial agreement to classify.On the contrary, the present invention is employed in and under the support of existing knowledge source class name is carried out semantic analysis and set up the initial training set.Because existing external knowledge source can cover a plurality of fields, therefore when the field collection changed, the method for this aspect still can be easy to be applied to a plurality of different fields and concentrate, thereby reduces extra manual intervention work, improved the degree that computer automation is handled.

In addition, the mechanism set up of final sorter provided by the present invention can reduce the too great deviations that sorter causes owing to the existence of the noise data in the iteration sorter learning process.Thereby improve the accuracy of final classification.

From the description of following examples, can find out concrete feature and advantage of the present invention.The present invention is not limited to description or other the concrete embodiment in following examples.

Description of drawings

Accompanying drawing 1 is the entire block diagram of digital object categorizing system S100;

Accompanying drawing 2 is process flow diagrams of the course of work of digital object categorizing system S100 shown in Figure 1;

Accompanying drawing 3 is structured flowcharts of the instance of the adjustment generating apparatus S103 in the categorizing system shown in Figure 1;

Accompanying drawing 4 is structured flowcharts of the rude classification device S102 in the categorizing system shown in Figure 1;

Accompanying drawing 5 is according to the embodiment of the invention, and the adjustment generating apparatus 103 in the categorizing system shown in Figure 1 is taked the course of work process flow diagram of iteration sorter study;

Accompanying drawing 6 is the schematic block diagram that are used to realize computer system of the present invention.

Embodiment

Classifier generation method proposed by the invention and system can be applicable to knowledge acquisition and the filtration in the general knowledge management process, knowledge classification tissue, knowledge search and data mining or the like.

The entire block diagram of categorizing system S100 shown in Figure 1.If shown in, set is clustered into a plurality of groups in advance through clustering apparatus S107 from the digital object of knowledge base S105, and cluster result is stored among the S104 of cluster result storehouse.The cluster result of being stored among the S104 of cluster result storehouse about collection of document will be used for actual concrete knowledge management application.Belong to the common knowledge technology of this area about the method for cluster,, do not do detailed description not as research emphasis of the present invention.Classifier system according to the embodiment of the invention shown in Figure 1 comprises deriving means S101, rude classification device S102 and adjustment generating apparatus S103.

Shown in Figure 2 is the process flow diagram of the course of work of categorizing system S100 among Fig. 1.

At first,, earlier pending data are carried out pre-service work, to filtering and clean with the irrelevant original contents of application at step 201 place.

Secondly,, will pass through the standardization processing that the original figure object of cleaning carries out vectorization, form the accessible digital object expression-form of computing machine that is applicable to application program at step 202 place.

The 3rd, the digital object after the standardization processing is at first handled carrying out rude classification by rude classification device S102, thereby obtains the rude classification result, shown in step 203.The classification of supervision type, the semi-supervised type of this area common knowledge technology of for example, in this instructions background technology, describing are classified or are not had supervision type sorting technique and all can be used for realizing the rude classification purpose.In some certain embodiments, can adopt the training set of outside input, also can be through generating training set automatically, to reach adaptive effect with reference to semantic information about class name from the external knowledge source.

Simultaneously, at step 204 place, S104 obtains the cluster result about this set of storage in advance to deriving means S101 from the cluster result storehouse.At this moment, all be provided to adjustment generating apparatus S103 place from the rude classification result of rude classification device S102 with from the cluster result of deriving means S101.

At step 205 place, utilize cluster result that the rude classification result from the rude classification device is adjusted, thereby generate final sorter S106.

At step 206 place, the set that has obtained at step 202 place is provided to the final sorter S106 that is generated, and each during final sorter S106 will gather is classification-designated to a classification, and classification results is stored among the document classification results repository S108.Process finishes.

Shown in Figure 3 is the adjustment generating apparatus block diagram of categorizing system.Comprising probability calculation cell S 301 and aligned units S302.

At first, at probability calculation cell S 301 places, calculate prior probability corresponding to the rude classification result.As previously mentioned, the computational problem of prior probability can be converted into the weight w to various classification company in the rude classification device _i(i=1 ... M) estimate.Therefore the parameter learning task is exactly the weights that estimate all formula in the knowledge base.An original data object storehouse is exactly a vector x=(x ₁..., x _l..., x _n).A given data library of object, the weights of sorter can be learnt through the method for maximal possibility estimation in principle.Be parameter w _iRegard fixed value as, and suppose that all data satisfy parameter w _i, make the likelihood probability P of X=x through calculating _w(X=x) get peaked w _i(i=1 ..., the m) value of getting parms.

\frac{&PartialD;}{&PartialD; w_{i}} \log P_{w} (X = x) = n_{i} (x) - \underset{x^{'}}{Σ} P_{w} (X = x^{'}) n_{i} (x^{'}) - - - (1)

Wherein, in the conventional method, n _i(x) and n _i(x ') can calculate from data-object library, but counting yield is low, therefore estimates to substitute with maximum pseudo-likelihood probability, that is:

\frac{&PartialD;}{&PartialD; w_{i}} \log P_{w} (X = x) = Σ_{l = 1}^{n} [n_{i} (x) - P_{w} (X_{l} = 0 | {MB}_{x} (X_{l})) \times n_{i} (X_{l = 0})] - - - (2)

- P_{w} (X_{l} = 1 | {MB}_{x} (X_{l})) \times n_{i} (X_{l = 1})

Wherein: P _w(X=x) be pseudo-likelihood probability, MB _x(X _l) expression X _lThe Markov probability cover.Make the parameter learning problem be converted into nonlinear optimal problem.

In aligned units S302, calculate and aim at model.General, after a cluster result formed, alignment result can be expressed as posterior probability:

P_{w}^{'} (X = x^{'}) = \frac{P_{w} (X = x) P (X = x^{'})}{P (x^{'})} - - - (3)

Wherein, prior probability P _w(X=x) from the rough sort result, therefore, final aligning model can be expressed as:

P_{w}^{'} (X = x^{'}) = \frac{P_{w} (X = x) \underset{x = x^{'}}{Σ} P (x | C)}{\underset{C}{Σ} P (X = x) \underset{C}{Σ} (\frac{ΣP (x_{n} | C)}{ΣP (x_{n}^{'} | C)})} - - - (4)

Wherein, C is the cluster set of digital object repository through forming after the cluster.

According to the probability model shown in the formula (4), realize that promptly this final sorter is with respect to the rude classification device through the final sorter of cluster result adjustment; Owing to passed through registration process; Nicety of grading is higher, and introducing that can the merits and demerits cluster result, and the classification deviation is under control.

Fig. 4 is that the rude classification device is derived from the moving synoptic diagram that generates training set according to external knowledge.Comprise training set generating unit S401 and unit S402.Training set generating unit S401 generates training set with reference to the automated randomized extracted data of input and the screening in possible external knowledge source.Automatically the training that generates then promptly is provided to unit S402 with the learning classification device, and accomplishes the parameter estimation of sorter.

Fig. 5 is the course of work process flow diagram that the adjustment generating apparatus 103 in the categorizing system is taked the study of iteration sorter.Its workflow is following:

At first, at step 501 place, the training set that produces in the generative process as a result at rude classification is as the initial training collection.During each iteration, can use certain known sorter learning method to utilize training set to generate the middle classification device at step 502 place.At step 503 place, new sorter is used to the document among the document library S105 is classified, to obtain new middle classification result.At step 504 place, judge whether to satisfy stopping criterion for iteration.This stopping criterion for iteration is confirmed by user oneself.If stopping criterion for iteration is not satisfied, process then advances to step 505, utilizes the middle classification result of epicycle iteration to generate the new training set that is used for next iteration.If end condition satisfies, then process advances to step 506, and a series of middle classification devices that in iterative process, produced are retained.At step 507 place, from a series of middle classification devices that iterative process, produced, select and calibrate a minimum final sorter of conduct of cost most then, finally this iterative process finishes.

Fig. 6 is the schematic block diagram that is used to realize computer system of the present invention.Comprise application server S601, be used to handle predefined formula calculating and carry out the entire system application service; User interface S602 is used to realize the butt joint to outside KBS and storage system; Pre-service middleware module S603 is used for the preprocessing process to external data; Data object standardization middleware module S604 is used for realizing to pretreated data object vectorization so that the subsequent calculations processing; Cluster analysis middleware module S605 is used to realize cluster calculation; Automatically classification middleware module S606 is used to carry out the iterative computation of automatic classification.

Document classification method and system according to the embodiment of the invention has more than been described; And emphasis has been described sorter and has been generated automatically; Can find out according to foregoing description; The present invention has following effect: utilize cluster result and repeatedly the mode of iteration improve assorting process to digital object, cut down possible error, guaranteed the accuracy of final classification results; In addition, among the present invention, the training dataset of outside input is not a necessary condition, and system can be derived from moving generation training set according to external knowledge, and continues to optimize through iterative process, has expanded the applicability of system.

The above is merely embodiments of the invention, and the present invention can also realize with other concrete forms, all any modifications of within spirit of the present invention and principle, being made, is equal to replacement etc., and military camp is included among the scope of the present invention.

Claims

1. digital object classifier generation method comprises:

Obtain the clustering method of digital object;

Result's rude classification method after the generation cluster, and form the preliminary classification device; And

With cluster result the preliminary classification device is carried out parameter adjustment, and form final sorter.

2. parameter adjustment step as claimed in claim 1 comprises:

Calculating is corresponding to the parameter estimation of said rude classification result's preliminary classification device;

Utilize cluster result and maximum pseudo-likelihood method of estimation that preliminary classification device parameter is revised, to generate posterior probability corresponding to accordingly result; And generate said final sorter according to said posterior probability.

3. method as claimed in claim 2 is wherein utilized in the maximum pseudo-likelihood method of estimation, utilizes maximum pseudo-likelihood estimator to replace general maximum likelihood estimator, and combines the method for first order logic predicate to carry out the correction of parameter value.

4. method as claimed in claim 2, wherein said estimates of parameters are to utilize training set to obtain, and training set generates through following process automatically:

Obtain class name with said object set relevant classification;

Generate relevant key value based on described class name;

Utilize the said object set of said key class to obtain the middle classification result; And obtain said training set from said middle classification result.

5. method as claimed in claim 4, wherein, the step that generates said key value also comprises:

With reference to the external knowledge source the said class name of obtaining is reclassified; And based on generating said key value through the class name that reclassifies.

6. method as claimed in claim 4, wherein said key value is described as representativeness, and the said step that obtains the middle classification result comprises:

Utilize said representative the description to search for said object set as query term; And will be in respective classes as the object marking in the hit list of Search Results.

7. method as claimed in claim 6, wherein with the preceding predetermined number object marking in the said hit list to respective classes.The said step that obtains said training set through the middle classification result comprises:

Adjust said middle classification result to generate the middle classification device with said cluster result; And selection generates said training set from the corresponding adjusted classification results of said middle classification device.

8. method as claimed in claim 7; Wherein adjust with said cluster result said preliminary classification result with the step that generates final sorter in; Carry out the study of iteration sorter with said training set as the initial training collection; Thereby learn one group of middle classification device, and one of selection there is sorter most as said final sorter from said one group of middle classification device.

9. final classifier system comprises:

Getter is used to obtain the cluster result of object set;

The rude classification device, the rude classification result who is used to generate said object set is to obtain the rude classification device; And adjusting gear, be used for adjusting said rude classification result to generate final sorter with said cluster result.

10. system as claimed in claim 13, wherein said adjusting gear comprises:

The prior probability computing unit is used to calculate the prior probability corresponding to said rude classification result; And

Aligned units; Utilize maximum pseudo-likelihood method of estimation and first order logic predicate method to make said rude classification result aim at said cluster result, and generate said final sorter according to said posterior probability with the posterior probability of generation corresponding to said alignment result.