CN102411636A - Cross-language text classifying method aiming at topic drift problem - Google Patents
Cross-language text classifying method aiming at topic drift problem Download PDFInfo
- Publication number
- CN102411636A CN102411636A CN2011104532367A CN201110453236A CN102411636A CN 102411636 A CN102411636 A CN 102411636A CN 2011104532367 A CN2011104532367 A CN 2011104532367A CN 201110453236 A CN201110453236 A CN 201110453236A CN 102411636 A CN102411636 A CN 102411636A
- Authority
- CN
- China
- Prior art keywords
- class
- language
- document
- languages
- language text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a cross-language text classifying method aiming at a topic drift problem, which aims at classifying a C language document to be classified into a class of a target language E and comprises the following steps of: 1, training a C language text classifier; 2, training an E language text classifier; 3, computing a correlation matrix of a C language class and an E language class; 4, translating the C language document to be classified into an E language by using a machine, calculating the probability of the translated document belonging to the E language class; 5, correcting a result of the step 4 by using the class correlation matrix; and 6, classifying the document to be classified into the E language class with highest probability. The cross-language text classifying method is used for correcting a classifying result by using class correlation, accords with perceptual intuition comprehension, has stronger interpretability, and solves the topic drift problem of the cross-language text classification.
Description
Technical field
The present invention relates to a kind of file classification method, particularly a kind of to the theme drifting problem stride the language text sorting technique, belong to technical field of information retrieval.
Background technology
The fast development of internet has produced the text message of magnanimity, and online information is made up of multilingual, and the user hopes the document that different language constitutes is unified classification sometimes, is head it off, strides the language text classification and just arises at the historic moment.
Because the difference of various countries' economy, politics, culture, the country variant people's concern is also different, thereby the content of the webpage that is made up of different language also can be different, and this is reflected in strides in the language text classification, is exactly the theme drifting problem.Promptly for the document of same classification different language, the characteristic that feature extraction obtains is not quite similar.Give an example; Golfer Tiger Woods is very welcome in the U.S., often appears on the webpage of English " physical culture " classification, and be Liu Xiang, Yao Ming and on the webpage of Chinese " physical culture " classification, more star occurs; Like this; When feature extraction, may in characterizing the characteristic of English sport category, Tiger Woods occur, and that in the characteristic that characterizes Chinese sport category, occur is Liu Xiang, Yao Ming.
The theme drifting problem has brought certain difficulty to striding the language text classification, and the general category method has generally all been ignored this problem.
Summary of the invention
The objective of the invention is to prior art problems, in striding the language text classification, consider the theme drifting problem, stride the language text sorting result more accurately with rationally thereby make.
Thought of the present invention is to have proposed a kind of solution of drifting about based on the theme of class correlativity.Type correlativity is the correlativity that is used for measuring two classes, and its value is big more, explains that these two classes are relevant more.Utilize this kind correlativity, the result that single language classification device is obtained proofreaies and correct, and improves classifying quality.
The objective of the invention is to realize through following technical scheme:
A kind of to the main body drifting problem stride the language text sorting technique, the purpose of this method is that the C Language Document of waiting to classify is referred in the class of target language E, may further comprise the steps:
Step 1, training C language text sorter;
Step 2, training E language text sorter;
The correlation matrix of step 3, calculating C class of languages and E class of languages, correlation matrix is expressed as A=(a
Ij)
M * n, its element a
IjExpression C class of languages CC
iWith E class of languages CE
jBetween correlativity, m and n are respectively the number of C class of languages and E class of languages;
Step 4, use the mechanical translation C Language Document of will waiting to classify to translate into the E language, calculate the probability that document after the translation belongs to E language type;
Step 5, use type correlation matrix are proofreaied and correct the result of step 4;
Step 6, will treat that classifying documents is included into the highest E class of languages of probability.
Beneficial effect
Method provided by the invention uses a type correlativity that classification results is proofreaied and correct, and meets intuitivism apprehension, and stronger interpretation is arranged, and has solved the theme drifting problem of striding the language text classification.
Description of drawings
Fig. 1 is a basic principle schematic of the present invention.
Embodiment
Below in conjunction with accompanying drawing, describe preferred implementation of the present invention in detail, to guarantee thorough to instance of the present invention.
We suppose that the C class of languages has CC
1, CC
2..., CC
mThe E class of languages has CE
1, CE
2..., CE
nAs required, we will assign to the C Language Document in the E class of languages, also maybe the E Language Document be assigned in the C class of languages.The method used in view of both of these case is identical, and we only discuss how the C Language Document is categorized into the E class of languages.
As shown in Figure 1, type CC
iWith class CE
jBetween relevance pass through a
IjCome quantization means, wait to sort out document D with Probability p (CC
i| D) type of being classified into CC
i, its translation document D ' is with Probability p (CE
j| D ') type of being classified into CE
j, our task is exactly document D type of belonging to CE
jProbability tables be shown p (CC
i| D), p (CE
j| D ') and a
IjFunction.
Concrete classifying step is:
Step 1, training C language classification device.This step can be subdivided into language material collection, text representation again, training set is trained processes such as obtaining sorter; Sorting algorithm commonly used has naive Bayesian algorithm (NaiveBayes); Nearest neighbor algorithm (kNN), SVMs (Support Vector Machine) etc.In the present invention, the training of C language classification device is not limited to a certain specific sorting algorithm, and above-mentioned algorithm all is suitable for.
Step 2, training E language classification device.Similar with a last step, this step also can be subdivided into language material collection, text representation, training set is trained processes such as obtaining sorter, and the training of E language classification device also is not limited to a certain specific sorting algorithm.
The correlation matrix of step 3, calculating C class of languages and E class of languages; Correlation matrix is expressed as A=(a
Ij)
M * n, its element a
IjExpression Chinese type CC
iWith English type CE
jBetween correlativity;
Correlation matrix can obtain through several different methods, for example:
1) handmarking's two values matrix
For each type in the C language, the relevance between artificial mark itself and each type of E language.A kind of the most simply mark mode is that this matrix is marked into two values matrix, promptly relevantly gets 1, uncorrelatedly gets 0.This method is simple, but when manual work marks the correlativity between class and the class, receives the influence of subjective factor bigger.
2) maximal possibility estimation
The background document collection of mark C language, document wherein is labeled as C class of languages and E class of languages simultaneously, marking document collection shape as:
X wherein
tBe from training set, to extract the proper vector that obtains;
Be the m dimensional vector, it is the mark of training document about the C class of languages, if one piece of document belongs to C class of languages CC
i, then the document is corresponding
I component is 1, and all the other components are 0;
Be n-dimensional vector, it is the mark of training document about the E class of languages, if one piece of document belongs to E class of languages CE
j, then the document is corresponding
J component is 1, and all the other components are 0.
Be marked as C class of languages CC among the bidding note document sets x
iThe number of document be M, and in this M piece of writing document, be marked as E class of languages CE
jNumber be M ', then
The advantage of this method is that the calculating for correlation matrix is more accurately, and its shortcoming is that workload is bigger, need carry out the manual work mark to large-scale background document collection.
3) based on the mark of cluster
The background language material of given C language uses clustering algorithm (like k-means etc.) that it is carried out cluster, and the granule size of cluster will guarantee the purity of class as a result; Each group in the artificial mark cluster result is to the correlativity of E class of languages; This correlativity is a two-value, promptly relevantly gets 1, uncorrelatedly gets 0, thereby obtains correlation matrix A.The advantage of present embodiment is to calculate correlation matrix more exactly, and the workload that its deficiency is to carry out artificial mark is bigger.
Step 4, calculating C Language Document belong to the probability of E class of languages.Given one piece of C Language Document D can obtain posterior probability vector α=(p (CC according to C language classification device
1| D), p (CC
2| D) ..., p (CC
m| D)), p (CC wherein
i| D) expression document D type of belonging to CC
iProbability.Then, document D is translated as E Language Document D ' through mechanical translation, likewise, for any type of CE
j, we can obtain posterior probability vector p (CE
j| D '), document D ' type of belonging to CE just
jProbability.
Step 5, use type correlation matrix are proofreaied and correct the result of step 4, with document D type of belonging to CE
jProbability tables be shown p (CC
i| D), p (CE
j| D ') and a
IjFunction.
The method of revising also can have multiple, and the user can define the weights of each several part according to practical application, for example;
1) with document D type of belonging to CE
jDefinition of probability be: p (CE
j| D)=λ p (CE
j| D ')+(1-λ) max
iP (CC
i| D) a
Ij, 0<parameter lambda<1 wherein, parameter lambda plays a part to regulate the correction dynamics.P (CE in the definition
j| D ') expression single language classification device classifying quality, max
iP (CC
i| D) a
IjBe according to of the correction of class correlativity to single language classification device.
2) with document D type of belonging to CE
jDefinition of probability be: p (CE
j| D)=max{p (CE
j| D '), max
iP (CC
i| D) a
Ij.Need not consider correction factor λ during the method training classifier, but classifying quality possibly be not so good as embodiment 1 ideal.
Step 6, classification.Be referred to posterior probability p (CE to document D
j| D) that type of maximum, like this, we have just accomplished the language classification of striding to document.
It should be understood that this embodiment is the instantiation that the present invention implements, should not be the restriction of protection domain of the present invention.Under the situation that does not break away from spirit of the present invention and scope, modification or the change of foregoing being carried out equivalence all should be included within the present invention's scope required for protection.
Claims (6)
- One kind to the main body drifting problem stride the language text sorting technique, the purpose of this method is that the C Language Document of waiting to classify is referred in the class of target language E, may further comprise the steps:Step 1, training C language text sorter;Step 2, training E language text sorter;The correlation matrix of step 3, calculating C class of languages and E class of languages, correlation matrix is expressed as A=(a Ij) M * n, its element a IjExpression C class of languages CC iWith E class of languages CE jBetween correlativity, m and n are respectively the number of C class of languages and E class of languages;Step 4, use the mechanical translation C Language Document of will waiting to classify to translate into the E language, calculate the probability that document after the translation belongs to E language type;Step 5, use type correlation matrix are proofreaied and correct the result of step 4;Step 6, will treat that classifying documents is included into the highest E class of languages of probability.
- 2. a kind of language text sorting technique of striding according to claim 1 is characterized in that the computing method of correlation matrix do in the step 3; For each type in the C language, artificial mark its with each type of E language between relevance, this matrix is marked into two values matrix, promptly be correlated with and get 1, uncorrelatedly get 0.。
- 3. a kind of language text sorting technique of striding according to claim 1 is characterized in that, in step 3, uses maximum likelihood estimate to obtain correlation matrix, and concrete grammar is:The background document collection of mark C language, document wherein is labeled as C class of languages and E class of languages simultaneously, marking document collection shape as:X wherein tBe from training set, to extract the proper vector that obtains; Be the m dimensional vector, it is the mark of training document about the C class of languages, if one piece of document belongs to C class of languages CC i, then the document is corresponding I component is 1, and all the other components are 0; Be n-dimensional vector, it is the mark of training document about the E class of languages, if one piece of document belongs to E class of languages CE j, then the document is corresponding J component is 1, and all the other components are 0;
- 4. a kind of language text sorting technique of striding according to claim 1 is characterized in that, in step 3, uses the mask method based on cluster to obtain correlation matrix, and concrete grammar is:The background language material of given C language uses clustering algorithm (like k-means etc.) that it is carried out cluster, and the granule size of cluster will guarantee the purity of class as a result; Each group in the artificial mark cluster result is to the correlativity of E class of languages; This correlativity is a two-value, promptly relevantly gets 1, uncorrelatedly gets 0, thereby obtains correlation matrix A.
- 5. according to each described a kind of language text sorting technique of striding of claim 1 to 4, it is characterized in that the bearing calibration described in the step 5 is: with document D type of belonging to CE jDefinition of probability be p (CE j| D)=λ p (CE j| D ')+(1-λ) max iP (CC i| D) a Ij, 0<parameter lambda<1 wherein, parameter lambda plays a part to regulate the correction dynamics.
- 6. according to each described a kind of language text sorting technique of striding of claim 1 to 4, it is characterized in that the bearing calibration described in the step 5 is: with document D type of belonging to CE jDefinition of probability be p (CE j| D)=max{p (CE j| D '), max iP (CC i| D) a Ij.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011104532367A CN102411636A (en) | 2011-12-30 | 2011-12-30 | Cross-language text classifying method aiming at topic drift problem |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011104532367A CN102411636A (en) | 2011-12-30 | 2011-12-30 | Cross-language text classifying method aiming at topic drift problem |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102411636A true CN102411636A (en) | 2012-04-11 |
Family
ID=45913707
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2011104532367A Pending CN102411636A (en) | 2011-12-30 | 2011-12-30 | Cross-language text classifying method aiming at topic drift problem |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102411636A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103336852A (en) * | 2013-07-24 | 2013-10-02 | 清华大学 | Cross-language ontology construction method and device |
CN103577498A (en) * | 2012-08-09 | 2014-02-12 | 北京百度网讯科技有限公司 | Method and device for automatically establishing classification rule for cross-language |
CN104584005A (en) * | 2012-08-22 | 2015-04-29 | 株式会社东芝 | Document classification device and document classification method |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050071152A1 (en) * | 2003-09-29 | 2005-03-31 | Hitachi, Ltd. | Cross lingual text classification apparatus and method |
-
2011
- 2011-12-30 CN CN2011104532367A patent/CN102411636A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050071152A1 (en) * | 2003-09-29 | 2005-03-31 | Hitachi, Ltd. | Cross lingual text classification apparatus and method |
Non-Patent Citations (3)
Title |
---|
NURIA BEL,CORNELIS H.A.KOSTER,MARTA VILLEGAS: "《Cross-Lingual Text Categorization》", 《LNCS》 * |
高影繁: "《基于跨语言文本分类的多语资源组织方法研究》", 《信息***》 * |
高影繁等: "《跨语言文本分类技术研究进展》", 《综述与述评》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103577498A (en) * | 2012-08-09 | 2014-02-12 | 北京百度网讯科技有限公司 | Method and device for automatically establishing classification rule for cross-language |
CN103577498B (en) * | 2012-08-09 | 2018-09-07 | 北京百度网讯科技有限公司 | A kind of method and apparatus building classifying rules automatically across language |
CN104584005A (en) * | 2012-08-22 | 2015-04-29 | 株式会社东芝 | Document classification device and document classification method |
CN104584005B (en) * | 2012-08-22 | 2018-01-05 | 株式会社东芝 | Document sorting apparatus and Document Classification Method |
CN103336852A (en) * | 2013-07-24 | 2013-10-02 | 清华大学 | Cross-language ontology construction method and device |
CN103336852B (en) * | 2013-07-24 | 2017-04-05 | 清华大学 | Across language ontology construction method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Barbieri et al. | Multimodal emoji prediction | |
Liu et al. | Supervised matrix factorization for cross-modality hashing | |
Marujo et al. | Supervised topical key phrase extraction of news stories using crowdsourcing, light filtering and co-reference normalization | |
CN105912576B (en) | Emotion classification method and system | |
Kesiman et al. | Benchmarking of document image analysis tasks for palm leaf manuscripts from southeast asia | |
CN105808530B (en) | Interpretation method and device in a kind of statistical machine translation | |
Layton et al. | Recentred local profiles for authorship attribution | |
CN101075228A (en) | Method and apparatus for named entity recognition in natural language | |
CN109086357A (en) | Sensibility classification method, device, equipment and medium based on variation autocoder | |
CN107391565B (en) | Matching method of cross-language hierarchical classification system based on topic model | |
CN112613324A (en) | Semantic emotion recognition method, device, equipment and storage medium | |
US20140289238A1 (en) | Document creation support apparatus, method and program | |
GB2583679A (en) | Searching multilingual documents based on document structure extraction | |
Li et al. | Publication date estimation for printed historical documents using convolutional neural networks | |
CN103020167A (en) | Chinese text classification method for computer | |
CN102567529B (en) | Cross-language text classification method based on two-view active learning technology | |
CN110334362B (en) | Method for solving and generating untranslated words based on medical neural machine translation | |
Tian et al. | Query difficulty prediction for web image search | |
CN102411636A (en) | Cross-language text classifying method aiming at topic drift problem | |
Cosma et al. | Self-supervised representation learning on document images | |
CN108038166A (en) | A kind of Chinese microblog emotional analysis method based on the subjective and objective skewed popularity of lexical item | |
CN103744958A (en) | Webpage classification algorithm based on distributed computation | |
Kambhatla | Minority vote: at-least-n voting improves recall for extracting relations | |
CN105138520A (en) | Event trigger word recognition method and device | |
Fakeri-Tabrizi et al. | Multiview self-learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20120411 |