CN102411636A - Cross-language text classifying method aiming at topic drift problem - Google Patents

Cross-language text classifying method aiming at topic drift problem Download PDF

Info

Publication number
CN102411636A
CN102411636A CN2011104532367A CN201110453236A CN102411636A CN 102411636 A CN102411636 A CN 102411636A CN 2011104532367 A CN2011104532367 A CN 2011104532367A CN 201110453236 A CN201110453236 A CN 201110453236A CN 102411636 A CN102411636 A CN 102411636A
Authority
CN
China
Prior art keywords
class
language
document
languages
language text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011104532367A
Other languages
Chinese (zh)
Inventor
戴林
孙守成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN2011104532367A priority Critical patent/CN102411636A/en
Publication of CN102411636A publication Critical patent/CN102411636A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a cross-language text classifying method aiming at a topic drift problem, which aims at classifying a C language document to be classified into a class of a target language E and comprises the following steps of: 1, training a C language text classifier; 2, training an E language text classifier; 3, computing a correlation matrix of a C language class and an E language class; 4, translating the C language document to be classified into an E language by using a machine, calculating the probability of the translated document belonging to the E language class; 5, correcting a result of the step 4 by using the class correlation matrix; and 6, classifying the document to be classified into the E language class with highest probability. The cross-language text classifying method is used for correcting a classifying result by using class correlation, accords with perceptual intuition comprehension, has stronger interpretability, and solves the topic drift problem of the cross-language text classification.

Description

A kind ofly stride the language text sorting technique to the theme drifting problem
Technical field
The present invention relates to a kind of file classification method, particularly a kind of to the theme drifting problem stride the language text sorting technique, belong to technical field of information retrieval.
Background technology
The fast development of internet has produced the text message of magnanimity, and online information is made up of multilingual, and the user hopes the document that different language constitutes is unified classification sometimes, is head it off, strides the language text classification and just arises at the historic moment.
Because the difference of various countries' economy, politics, culture, the country variant people's concern is also different, thereby the content of the webpage that is made up of different language also can be different, and this is reflected in strides in the language text classification, is exactly the theme drifting problem.Promptly for the document of same classification different language, the characteristic that feature extraction obtains is not quite similar.Give an example; Golfer Tiger Woods is very welcome in the U.S., often appears on the webpage of English " physical culture " classification, and be Liu Xiang, Yao Ming and on the webpage of Chinese " physical culture " classification, more star occurs; Like this; When feature extraction, may in characterizing the characteristic of English sport category, Tiger Woods occur, and that in the characteristic that characterizes Chinese sport category, occur is Liu Xiang, Yao Ming.
The theme drifting problem has brought certain difficulty to striding the language text classification, and the general category method has generally all been ignored this problem.
Summary of the invention
The objective of the invention is to prior art problems, in striding the language text classification, consider the theme drifting problem, stride the language text sorting result more accurately with rationally thereby make.
Thought of the present invention is to have proposed a kind of solution of drifting about based on the theme of class correlativity.Type correlativity is the correlativity that is used for measuring two classes, and its value is big more, explains that these two classes are relevant more.Utilize this kind correlativity, the result that single language classification device is obtained proofreaies and correct, and improves classifying quality.
The objective of the invention is to realize through following technical scheme:
A kind of to the main body drifting problem stride the language text sorting technique, the purpose of this method is that the C Language Document of waiting to classify is referred in the class of target language E, may further comprise the steps:
Step 1, training C language text sorter;
Step 2, training E language text sorter;
The correlation matrix of step 3, calculating C class of languages and E class of languages, correlation matrix is expressed as A=(a Ij) M * n, its element a IjExpression C class of languages CC iWith E class of languages CE jBetween correlativity, m and n are respectively the number of C class of languages and E class of languages;
Step 4, use the mechanical translation C Language Document of will waiting to classify to translate into the E language, calculate the probability that document after the translation belongs to E language type;
Step 5, use type correlation matrix are proofreaied and correct the result of step 4;
Step 6, will treat that classifying documents is included into the highest E class of languages of probability.
Beneficial effect
Method provided by the invention uses a type correlativity that classification results is proofreaied and correct, and meets intuitivism apprehension, and stronger interpretation is arranged, and has solved the theme drifting problem of striding the language text classification.
Description of drawings
Fig. 1 is a basic principle schematic of the present invention.
Embodiment
Below in conjunction with accompanying drawing, describe preferred implementation of the present invention in detail, to guarantee thorough to instance of the present invention.
We suppose that the C class of languages has CC 1, CC 2..., CC mThe E class of languages has CE 1, CE 2..., CE nAs required, we will assign to the C Language Document in the E class of languages, also maybe the E Language Document be assigned in the C class of languages.The method used in view of both of these case is identical, and we only discuss how the C Language Document is categorized into the E class of languages.
As shown in Figure 1, type CC iWith class CE jBetween relevance pass through a IjCome quantization means, wait to sort out document D with Probability p (CC i| D) type of being classified into CC i, its translation document D ' is with Probability p (CE j| D ') type of being classified into CE j, our task is exactly document D type of belonging to CE jProbability tables be shown p (CC i| D), p (CE j| D ') and a IjFunction.
Concrete classifying step is:
Step 1, training C language classification device.This step can be subdivided into language material collection, text representation again, training set is trained processes such as obtaining sorter; Sorting algorithm commonly used has naive Bayesian algorithm (NaiveBayes); Nearest neighbor algorithm (kNN), SVMs (Support Vector Machine) etc.In the present invention, the training of C language classification device is not limited to a certain specific sorting algorithm, and above-mentioned algorithm all is suitable for.
Step 2, training E language classification device.Similar with a last step, this step also can be subdivided into language material collection, text representation, training set is trained processes such as obtaining sorter, and the training of E language classification device also is not limited to a certain specific sorting algorithm.
The correlation matrix of step 3, calculating C class of languages and E class of languages; Correlation matrix is expressed as A=(a Ij) M * n, its element a IjExpression Chinese type CC iWith English type CE jBetween correlativity;
Correlation matrix can obtain through several different methods, for example:
1) handmarking's two values matrix
For each type in the C language, the relevance between artificial mark itself and each type of E language.A kind of the most simply mark mode is that this matrix is marked into two values matrix, promptly relevantly gets 1, uncorrelatedly gets 0.This method is simple, but when manual work marks the correlativity between class and the class, receives the influence of subjective factor bigger.
2) maximal possibility estimation
The background document collection of mark C language, document wherein is labeled as C class of languages and E class of languages simultaneously, marking document collection shape as:
χ = { x t , r c t , r a t } t = 1 N
X wherein tBe from training set, to extract the proper vector that obtains;
Figure BDA0000126957490000032
Be the m dimensional vector, it is the mark of training document about the C class of languages, if one piece of document belongs to C class of languages CC i, then the document is corresponding I component is 1, and all the other components are 0; Be n-dimensional vector, it is the mark of training document about the E class of languages, if one piece of document belongs to E class of languages CE j, then the document is corresponding J component is 1, and all the other components are 0.
Be marked as C class of languages CC among the bidding note document sets x iThe number of document be M, and in this M piece of writing document, be marked as E class of languages CE jNumber be M ', then
Figure BDA0000126957490000036
The advantage of this method is that the calculating for correlation matrix is more accurately, and its shortcoming is that workload is bigger, need carry out the manual work mark to large-scale background document collection.
3) based on the mark of cluster
The background language material of given C language uses clustering algorithm (like k-means etc.) that it is carried out cluster, and the granule size of cluster will guarantee the purity of class as a result; Each group in the artificial mark cluster result is to the correlativity of E class of languages; This correlativity is a two-value, promptly relevantly gets 1, uncorrelatedly gets 0, thereby obtains correlation matrix A.The advantage of present embodiment is to calculate correlation matrix more exactly, and the workload that its deficiency is to carry out artificial mark is bigger.
Step 4, calculating C Language Document belong to the probability of E class of languages.Given one piece of C Language Document D can obtain posterior probability vector α=(p (CC according to C language classification device 1| D), p (CC 2| D) ..., p (CC m| D)), p (CC wherein i| D) expression document D type of belonging to CC iProbability.Then, document D is translated as E Language Document D ' through mechanical translation, likewise, for any type of CE j, we can obtain posterior probability vector p (CE j| D '), document D ' type of belonging to CE just jProbability.
Step 5, use type correlation matrix are proofreaied and correct the result of step 4, with document D type of belonging to CE jProbability tables be shown p (CC i| D), p (CE j| D ') and a IjFunction.
The method of revising also can have multiple, and the user can define the weights of each several part according to practical application, for example;
1) with document D type of belonging to CE jDefinition of probability be: p (CE j| D)=λ p (CE j| D ')+(1-λ) max iP (CC i| D) a Ij, 0<parameter lambda<1 wherein, parameter lambda plays a part to regulate the correction dynamics.P (CE in the definition j| D ') expression single language classification device classifying quality, max iP (CC i| D) a IjBe according to of the correction of class correlativity to single language classification device.
2) with document D type of belonging to CE jDefinition of probability be: p (CE j| D)=max{p (CE j| D '), max iP (CC i| D) a Ij.Need not consider correction factor λ during the method training classifier, but classifying quality possibly be not so good as embodiment 1 ideal.
Step 6, classification.Be referred to posterior probability p (CE to document D j| D) that type of maximum, like this, we have just accomplished the language classification of striding to document.
It should be understood that this embodiment is the instantiation that the present invention implements, should not be the restriction of protection domain of the present invention.Under the situation that does not break away from spirit of the present invention and scope, modification or the change of foregoing being carried out equivalence all should be included within the present invention's scope required for protection.

Claims (6)

  1. One kind to the main body drifting problem stride the language text sorting technique, the purpose of this method is that the C Language Document of waiting to classify is referred in the class of target language E, may further comprise the steps:
    Step 1, training C language text sorter;
    Step 2, training E language text sorter;
    The correlation matrix of step 3, calculating C class of languages and E class of languages, correlation matrix is expressed as A=(a Ij) M * n, its element a IjExpression C class of languages CC iWith E class of languages CE jBetween correlativity, m and n are respectively the number of C class of languages and E class of languages;
    Step 4, use the mechanical translation C Language Document of will waiting to classify to translate into the E language, calculate the probability that document after the translation belongs to E language type;
    Step 5, use type correlation matrix are proofreaied and correct the result of step 4;
    Step 6, will treat that classifying documents is included into the highest E class of languages of probability.
  2. 2. a kind of language text sorting technique of striding according to claim 1 is characterized in that the computing method of correlation matrix do in the step 3; For each type in the C language, artificial mark its with each type of E language between relevance, this matrix is marked into two values matrix, promptly be correlated with and get 1, uncorrelatedly get 0.。
  3. 3. a kind of language text sorting technique of striding according to claim 1 is characterized in that, in step 3, uses maximum likelihood estimate to obtain correlation matrix, and concrete grammar is:
    The background document collection of mark C language, document wherein is labeled as C class of languages and E class of languages simultaneously, marking document collection shape as:
    χ = { x t , r c t , r a t } t = 1 N
    X wherein tBe from training set, to extract the proper vector that obtains; Be the m dimensional vector, it is the mark of training document about the C class of languages, if one piece of document belongs to C class of languages CC i, then the document is corresponding
    Figure FDA0000126957480000013
    I component is 1, and all the other components are 0;
    Figure FDA0000126957480000014
    Be n-dimensional vector, it is the mark of training document about the E class of languages, if one piece of document belongs to E class of languages CE j, then the document is corresponding
    Figure FDA0000126957480000015
    J component is 1, and all the other components are 0;
    Be marked as C class of languages CC among the bidding note document sets x iThe number of document be M, and in this M piece of writing document, be marked as E class of languages CE jNumber be M ', then
    Figure FDA0000126957480000016
  4. 4. a kind of language text sorting technique of striding according to claim 1 is characterized in that, in step 3, uses the mask method based on cluster to obtain correlation matrix, and concrete grammar is:
    The background language material of given C language uses clustering algorithm (like k-means etc.) that it is carried out cluster, and the granule size of cluster will guarantee the purity of class as a result; Each group in the artificial mark cluster result is to the correlativity of E class of languages; This correlativity is a two-value, promptly relevantly gets 1, uncorrelatedly gets 0, thereby obtains correlation matrix A.
  5. 5. according to each described a kind of language text sorting technique of striding of claim 1 to 4, it is characterized in that the bearing calibration described in the step 5 is: with document D type of belonging to CE jDefinition of probability be p (CE j| D)=λ p (CE j| D ')+(1-λ) max iP (CC i| D) a Ij, 0<parameter lambda<1 wherein, parameter lambda plays a part to regulate the correction dynamics.
  6. 6. according to each described a kind of language text sorting technique of striding of claim 1 to 4, it is characterized in that the bearing calibration described in the step 5 is: with document D type of belonging to CE jDefinition of probability be p (CE j| D)=max{p (CE j| D '), max iP (CC i| D) a Ij.
CN2011104532367A 2011-12-30 2011-12-30 Cross-language text classifying method aiming at topic drift problem Pending CN102411636A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011104532367A CN102411636A (en) 2011-12-30 2011-12-30 Cross-language text classifying method aiming at topic drift problem

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011104532367A CN102411636A (en) 2011-12-30 2011-12-30 Cross-language text classifying method aiming at topic drift problem

Publications (1)

Publication Number Publication Date
CN102411636A true CN102411636A (en) 2012-04-11

Family

ID=45913707

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011104532367A Pending CN102411636A (en) 2011-12-30 2011-12-30 Cross-language text classifying method aiming at topic drift problem

Country Status (1)

Country Link
CN (1) CN102411636A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103336852A (en) * 2013-07-24 2013-10-02 清华大学 Cross-language ontology construction method and device
CN103577498A (en) * 2012-08-09 2014-02-12 北京百度网讯科技有限公司 Method and device for automatically establishing classification rule for cross-language
CN104584005A (en) * 2012-08-22 2015-04-29 株式会社东芝 Document classification device and document classification method

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050071152A1 (en) * 2003-09-29 2005-03-31 Hitachi, Ltd. Cross lingual text classification apparatus and method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050071152A1 (en) * 2003-09-29 2005-03-31 Hitachi, Ltd. Cross lingual text classification apparatus and method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
NURIA BEL,CORNELIS H.A.KOSTER,MARTA VILLEGAS: "《Cross-Lingual Text Categorization》", 《LNCS》 *
高影繁: "《基于跨语言文本分类的多语资源组织方法研究》", 《信息***》 *
高影繁等: "《跨语言文本分类技术研究进展》", 《综述与述评》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577498A (en) * 2012-08-09 2014-02-12 北京百度网讯科技有限公司 Method and device for automatically establishing classification rule for cross-language
CN103577498B (en) * 2012-08-09 2018-09-07 北京百度网讯科技有限公司 A kind of method and apparatus building classifying rules automatically across language
CN104584005A (en) * 2012-08-22 2015-04-29 株式会社东芝 Document classification device and document classification method
CN104584005B (en) * 2012-08-22 2018-01-05 株式会社东芝 Document sorting apparatus and Document Classification Method
CN103336852A (en) * 2013-07-24 2013-10-02 清华大学 Cross-language ontology construction method and device
CN103336852B (en) * 2013-07-24 2017-04-05 清华大学 Across language ontology construction method and device

Similar Documents

Publication Publication Date Title
Barbieri et al. Multimodal emoji prediction
Liu et al. Supervised matrix factorization for cross-modality hashing
Marujo et al. Supervised topical key phrase extraction of news stories using crowdsourcing, light filtering and co-reference normalization
CN105912576B (en) Emotion classification method and system
Kesiman et al. Benchmarking of document image analysis tasks for palm leaf manuscripts from southeast asia
CN105808530B (en) Interpretation method and device in a kind of statistical machine translation
Layton et al. Recentred local profiles for authorship attribution
CN101075228A (en) Method and apparatus for named entity recognition in natural language
CN109086357A (en) Sensibility classification method, device, equipment and medium based on variation autocoder
CN107391565B (en) Matching method of cross-language hierarchical classification system based on topic model
CN112613324A (en) Semantic emotion recognition method, device, equipment and storage medium
US20140289238A1 (en) Document creation support apparatus, method and program
GB2583679A (en) Searching multilingual documents based on document structure extraction
Li et al. Publication date estimation for printed historical documents using convolutional neural networks
CN103020167A (en) Chinese text classification method for computer
CN102567529B (en) Cross-language text classification method based on two-view active learning technology
CN110334362B (en) Method for solving and generating untranslated words based on medical neural machine translation
Tian et al. Query difficulty prediction for web image search
CN102411636A (en) Cross-language text classifying method aiming at topic drift problem
Cosma et al. Self-supervised representation learning on document images
CN108038166A (en) A kind of Chinese microblog emotional analysis method based on the subjective and objective skewed popularity of lexical item
CN103744958A (en) Webpage classification algorithm based on distributed computation
Kambhatla Minority vote: at-least-n voting improves recall for extracting relations
CN105138520A (en) Event trigger word recognition method and device
Fakeri-Tabrizi et al. Multiview self-learning

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120411