CN102411636A

CN102411636A - Cross-language text classifying method aiming at topic drift problem

Info

Publication number: CN102411636A
Application number: CN2011104532367A
Authority: CN
Inventors: 戴林; 孙守成
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2011-12-30
Filing date: 2011-12-30
Publication date: 2012-04-11

Abstract

The invention relates to a cross-language text classifying method aiming at a topic drift problem, which aims at classifying a C language document to be classified into a class of a target language E and comprises the following steps of: 1, training a C language text classifier; 2, training an E language text classifier; 3, computing a correlation matrix of a C language class and an E language class; 4, translating the C language document to be classified into an E language by using a machine, calculating the probability of the translated document belonging to the E language class; 5, correcting a result of the step 4 by using the class correlation matrix; and 6, classifying the document to be classified into the E language class with highest probability. The cross-language text classifying method is used for correcting a classifying result by using class correlation, accords with perceptual intuition comprehension, has stronger interpretability, and solves the topic drift problem of the cross-language text classification.

Description

A kind ofly stride the language text sorting technique to the theme drifting problem

Technical field

The present invention relates to a kind of file classification method, particularly a kind of to the theme drifting problem stride the language text sorting technique, belong to technical field of information retrieval.

Background technology

The fast development of internet has produced the text message of magnanimity, and online information is made up of multilingual, and the user hopes the document that different language constitutes is unified classification sometimes, is head it off, strides the language text classification and just arises at the historic moment.

Because the difference of various countries' economy, politics, culture, the country variant people's concern is also different, thereby the content of the webpage that is made up of different language also can be different, and this is reflected in strides in the language text classification, is exactly the theme drifting problem.Promptly for the document of same classification different language, the characteristic that feature extraction obtains is not quite similar.Give an example; Golfer Tiger Woods is very welcome in the U.S., often appears on the webpage of English " physical culture " classification, and be Liu Xiang, Yao Ming and on the webpage of Chinese " physical culture " classification, more star occurs; Like this; When feature extraction, may in characterizing the characteristic of English sport category, Tiger Woods occur, and that in the characteristic that characterizes Chinese sport category, occur is Liu Xiang, Yao Ming.

The theme drifting problem has brought certain difficulty to striding the language text classification, and the general category method has generally all been ignored this problem.

Summary of the invention

The objective of the invention is to prior art problems, in striding the language text classification, consider the theme drifting problem, stride the language text sorting result more accurately with rationally thereby make.

Thought of the present invention is to have proposed a kind of solution of drifting about based on the theme of class correlativity.Type correlativity is the correlativity that is used for measuring two classes, and its value is big more, explains that these two classes are relevant more.Utilize this kind correlativity, the result that single language classification device is obtained proofreaies and correct, and improves classifying quality.

The objective of the invention is to realize through following technical scheme:

A kind of to the main body drifting problem stride the language text sorting technique, the purpose of this method is that the C Language Document of waiting to classify is referred in the class of target language E, may further comprise the steps:

Step 1, training C language text sorter;

Step 2, training E language text sorter;

The correlation matrix of step 3, calculating C class of languages and E class of languages, correlation matrix is expressed as A=(a _Ij) _{M * n}, its element a _IjExpression C class of languages CC _iWith E class of languages CE _jBetween correlativity, m and n are respectively the number of C class of languages and E class of languages;

Step 4, use the mechanical translation C Language Document of will waiting to classify to translate into the E language, calculate the probability that document after the translation belongs to E language type;

Step 5, use type correlation matrix are proofreaied and correct the result of step 4;

Step 6, will treat that classifying documents is included into the highest E class of languages of probability.

Beneficial effect

Method provided by the invention uses a type correlativity that classification results is proofreaied and correct, and meets intuitivism apprehension, and stronger interpretation is arranged, and has solved the theme drifting problem of striding the language text classification.

Description of drawings

Fig. 1 is a basic principle schematic of the present invention.

Embodiment

Below in conjunction with accompanying drawing, describe preferred implementation of the present invention in detail, to guarantee thorough to instance of the present invention.

We suppose that the C class of languages has CC ₁, CC ₂..., CC _mThe E class of languages has CE ₁, CE ₂..., CE _nAs required, we will assign to the C Language Document in the E class of languages, also maybe the E Language Document be assigned in the C class of languages.The method used in view of both of these case is identical, and we only discuss how the C Language Document is categorized into the E class of languages.

As shown in Figure 1, type CC _iWith class CE _jBetween relevance pass through a _IjCome quantization means, wait to sort out document D with Probability p (CC _i| D) type of being classified into CC _i, its translation document D ' is with Probability p (CE _j| D ') type of being classified into CE _j, our task is exactly document D type of belonging to CE _jProbability tables be shown p (CC _i| D), p (CE _j| D ') and a _IjFunction.

Concrete classifying step is:

Step 1, training C language classification device.This step can be subdivided into language material collection, text representation again, training set is trained processes such as obtaining sorter; Sorting algorithm commonly used has naive Bayesian algorithm (NaiveBayes); Nearest neighbor algorithm (kNN), SVMs (Support Vector Machine) etc.In the present invention, the training of C language classification device is not limited to a certain specific sorting algorithm, and above-mentioned algorithm all is suitable for.

Step 2, training E language classification device.Similar with a last step, this step also can be subdivided into language material collection, text representation, training set is trained processes such as obtaining sorter, and the training of E language classification device also is not limited to a certain specific sorting algorithm.

The correlation matrix of step 3, calculating C class of languages and E class of languages; Correlation matrix is expressed as A=(a _Ij) _{M * n}, its element a _IjExpression Chinese type CC _iWith English type CE _jBetween correlativity;

Correlation matrix can obtain through several different methods, for example:

1) handmarking's two values matrix

For each type in the C language, the relevance between artificial mark itself and each type of E language.A kind of the most simply mark mode is that this matrix is marked into two values matrix, promptly relevantly gets 1, uncorrelatedly gets 0.This method is simple, but when manual work marks the correlativity between class and the class, receives the influence of subjective factor bigger.

2) maximal possibility estimation

The background document collection of mark C language, document wherein is labeled as C class of languages and E class of languages simultaneously, marking document collection shape as:

χ = {x^{t}, r_{c}^{t}, r_{a}^{t}}_{t = 1}^{N}

X wherein ^tBe from training set, to extract the proper vector that obtains;

Be the m dimensional vector, it is the mark of training document about the C class of languages, if one piece of document belongs to C class of languages CC _i, then the document is corresponding I component is 1, and all the other components are 0; Be n-dimensional vector, it is the mark of training document about the E class of languages, if one piece of document belongs to E class of languages CE _j, then the document is corresponding J component is 1, and all the other components are 0.

Be marked as C class of languages CC among the bidding note document sets x _iThe number of document be M, and in this M piece of writing document, be marked as E class of languages CE _jNumber be M ', then

The advantage of this method is that the calculating for correlation matrix is more accurately, and its shortcoming is that workload is bigger, need carry out the manual work mark to large-scale background document collection.

3) based on the mark of cluster

The background language material of given C language uses clustering algorithm (like k-means etc.) that it is carried out cluster, and the granule size of cluster will guarantee the purity of class as a result; Each group in the artificial mark cluster result is to the correlativity of E class of languages; This correlativity is a two-value, promptly relevantly gets 1, uncorrelatedly gets 0, thereby obtains correlation matrix A.The advantage of present embodiment is to calculate correlation matrix more exactly, and the workload that its deficiency is to carry out artificial mark is bigger.

Step 4, calculating C Language Document belong to the probability of E class of languages.Given one piece of C Language Document D can obtain posterior probability vector α=(p (CC according to C language classification device ₁| D), p (CC ₂| D) ..., p (CC _m| D)), p (CC wherein _i| D) expression document D type of belonging to CC _iProbability.Then, document D is translated as E Language Document D ' through mechanical translation, likewise, for any type of CE _j, we can obtain posterior probability vector p (CE _j| D '), document D ' type of belonging to CE just _jProbability.

Step 5, use type correlation matrix are proofreaied and correct the result of step 4, with document D type of belonging to CE _jProbability tables be shown p (CC _i| D), p (CE _j| D ') and a _IjFunction.

The method of revising also can have multiple, and the user can define the weights of each several part according to practical application, for example;

1) with document D type of belonging to CE _jDefinition of probability be: p (CE _j| D)=λ p (CE _j| D ')+(1-λ) max _iP (CC _i| D) a _Ij, 0＜parameter lambda＜1 wherein, parameter lambda plays a part to regulate the correction dynamics.P (CE in the definition _j| D ') expression single language classification device classifying quality, max _iP (CC _i| D) a _IjBe according to of the correction of class correlativity to single language classification device.

2) with document D type of belonging to CE _jDefinition of probability be: p (CE _j| D)=max{p (CE _j| D '), max _iP (CC _i| D) a _Ij.Need not consider correction factor λ during the method training classifier, but classifying quality possibly be not so good as embodiment 1 ideal.

Step 6, classification.Be referred to posterior probability p (CE to document D _j| D) that type of maximum, like this, we have just accomplished the language classification of striding to document.

It should be understood that this embodiment is the instantiation that the present invention implements, should not be the restriction of protection domain of the present invention.Under the situation that does not break away from spirit of the present invention and scope, modification or the change of foregoing being carried out equivalence all should be included within the present invention's scope required for protection.

Claims

One kind to the main body drifting problem stride the language text sorting technique, the purpose of this method is that the C Language Document of waiting to classify is referred in the class of target language E, may further comprise the steps:

Step 1, training C language text sorter;

Step 2, training E language text sorter;

The correlation matrix of step 3, calculating C class of languages and E class of languages, correlation matrix is expressed as A=(a _Ij) _{M * n}, its element a _IjExpression C class of languages CC _iWith E class of languages CE _jBetween correlativity, m and n are respectively the number of C class of languages and E class of languages;

Step 4, use the mechanical translation C Language Document of will waiting to classify to translate into the E language, calculate the probability that document after the translation belongs to E language type;

Step 5, use type correlation matrix are proofreaied and correct the result of step 4;

Step 6, will treat that classifying documents is included into the highest E class of languages of probability.
2. a kind of language text sorting technique of striding according to claim 1 is characterized in that the computing method of correlation matrix do in the step 3; For each type in the C language, artificial mark its with each type of E language between relevance, this matrix is marked into two values matrix, promptly be correlated with and get 1, uncorrelatedly get 0.。
3. a kind of language text sorting technique of striding according to claim 1 is characterized in that, in step 3, uses maximum likelihood estimate to obtain correlation matrix, and concrete grammar is:

The background document collection of mark C language, document wherein is labeled as C class of languages and E class of languages simultaneously, marking document collection shape as:

$χ = {x^{t}, r_{c}^{t}, r_{a}^{t}}_{t = 1}^{N}$

X wherein ^tBe from training set, to extract the proper vector that obtains; Be the m dimensional vector, it is the mark of training document about the C class of languages, if one piece of document belongs to C class of languages CC _i, then the document is corresponding
I component is 1, and all the other components are 0;
Be n-dimensional vector, it is the mark of training document about the E class of languages, if one piece of document belongs to E class of languages CE _j, then the document is corresponding
J component is 1, and all the other components are 0;

Be marked as C class of languages CC among the bidding note document sets x _iThe number of document be M, and in this M piece of writing document, be marked as E class of languages CE _jNumber be M ', then
4. a kind of language text sorting technique of striding according to claim 1 is characterized in that, in step 3, uses the mask method based on cluster to obtain correlation matrix, and concrete grammar is:

The background language material of given C language uses clustering algorithm (like k-means etc.) that it is carried out cluster, and the granule size of cluster will guarantee the purity of class as a result; Each group in the artificial mark cluster result is to the correlativity of E class of languages; This correlativity is a two-value, promptly relevantly gets 1, uncorrelatedly gets 0, thereby obtains correlation matrix A.
5. according to each described a kind of language text sorting technique of striding of claim 1 to 4, it is characterized in that the bearing calibration described in the step 5 is: with document D type of belonging to CE _jDefinition of probability be p (CE _j| D)=λ p (CE _j| D ')+(1-λ) max _iP (CC _i| D) a _Ij, 0＜parameter lambda＜1 wherein, parameter lambda plays a part to regulate the correction dynamics.
6. according to each described a kind of language text sorting technique of striding of claim 1 to 4, it is characterized in that the bearing calibration described in the step 5 is: with document D type of belonging to CE _jDefinition of probability be p (CE _j| D)=max{p (CE _j| D '), max _iP (CC _i| D) a _Ij.