CN102651003A - Cross-language searching method and device - Google Patents

Cross-language searching method and device Download PDF

Info

Publication number
CN102651003A
CN102651003A CN2011100478927A CN201110047892A CN102651003A CN 102651003 A CN102651003 A CN 102651003A CN 2011100478927 A CN2011100478927 A CN 2011100478927A CN 201110047892 A CN201110047892 A CN 201110047892A CN 102651003 A CN102651003 A CN 102651003A
Authority
CN
China
Prior art keywords
query
source language
translation
search
language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011100478927A
Other languages
Chinese (zh)
Other versions
CN102651003B (en
Inventor
赵世奇
吴华
王海峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201110047892.7A priority Critical patent/CN102651003B/en
Priority to PCT/CN2011/083420 priority patent/WO2012116562A1/en
Publication of CN102651003A publication Critical patent/CN102651003A/en
Application granted granted Critical
Publication of CN102651003B publication Critical patent/CN102651003B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3337Translation of the query language, e.g. Chinese to English

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a cross-language searching method and a cross-language searching device, wherein the method includes the steps as follows: A, receiving a source language search query input by a user; B, translating the source language search query into N types of target language queries, wherein N is an integer larger than 1; C, obtaining corresponding searching results of the N types of target language queries respectively; and D, integrating the searching result obtained in the step C, forming the final searching result set, and providing the final searching result set for the user, wherein, in the final searching result set, all searching results are sequenced according to ranks of all searching results in respective classifications and rank weights of the classifications. Through the method and the device, the searching results can contain multi-language files, so that better and more searching results can be provided for the user.

Description

A kind of method and apparatus of cross-language search
[technical field]
The present invention relates to Internet technical field, particularly a kind of method and apparatus of cross-language search.
[background technology]
Along with the continuous growth of internet information, people have higher requirement for information search, no longer are satisfied with in a kind of languages document sets and search for, and require to obtain multiple languages document.For example, if the search word (query) of user's input is " Beckham's picture ", the search in the then Chinese document sets possibly be able to not farthest be met consumers' demand, and the english document of American-European website is concentrated possibly have more excellent, more Search Results.
When the demand of from multilingual document sets, searching for is increasingly high; More in order to obtain, more comprehensively, information more accurately; Simultaneously in order to cross over aphasis; People hope can be with a kind of language description query that oneself is familiar with, and can comprise multilingual document in the Search Results, promptly carries out the cross-language search between two languages.
[summary of the invention]
In view of this, the invention provides a kind of method and apparatus of cross-language search, so that realize comprising the Search Results of multilingual document, for the user provides more excellent, more Search Results.
Concrete technical scheme is following:
A kind of method of cross-language search, this method comprises:
The source language searching request query of A, reception user input;
B, said source language query is translated as N kind target language query, N is the integer greater than 1;
C, obtain said N kind target language query corresponding search result respectively;
D, the Search Results that step C is obtained are integrated the back and are formed the final search result set and offer the user;
Wherein in the set of said final search result, the ordering weight according to the row of each Search Results in affiliated classification time and affiliated classification sorts to each Search Results.
In step B,, in the translation result of this kind target language that said source language query is corresponding, translate the highest a kind of translation result of score value as target language query to each target language;
The translation score value of translation result e is confirmed by at least a in the following factor: translate each contamination probability among statistics number and the translation result e of translation result e in the employed translated corpora.
More excellent ground, said step B specifically comprises:
B1, said source language query is optimized processing, said optimization process comprises any or the combination in query correction process and the query extension process;
B2, the source language query after the optimization process is translated as N kind target language query.
Wherein, if said optimization process only comprises the query correction process, then the source language query of said user input is carried out obtaining comprising after the query correction process source language query set Q1 of n1 query, the positive integer of n1 for presetting;
Said step B2 is specially: to each target language, utilize each query among the said Q1 to translate respectively, confirm that the highest translation result of translation score value summation is as target language query; Wherein, the translation score value summation of translation result does
Figure BDA0000048313580000021
P (e|q i) be q among the Q1 iBe translated into the translation score value of e;
The corresponding translation score value of translation result e is confirmed by at least a in the following factor: translate each contamination probability among statistics number and the translation result e of translation result e in the employed translated corpora.
If said optimization process only comprises the query extension process, then the source language query of said user input is carried out obtaining comprising after the query extension process source language query set Q2 of n2 query, the positive integer of n2 for presetting;
Said step B2 is specially: to each target language, utilize each query among the said Q2 to translate respectively, confirm that the highest translation result of translation score value summation is as target language query; Wherein, the translation score value summation of translation result does
Figure BDA0000048313580000022
P (e|q i) be q among the Q2 iBe translated into the translation score value of e;
The corresponding translation score value of translation result e is confirmed by at least a in the following factor: translate each contamination probability among statistics number and the translation result e of translation result e in the employed translated corpora.
If said optimization process not only comprises the query correction process but also comprise the query extension process; Then the source language query of said user input is carried out obtaining comprising after query correction process and the query extension process source language query set Q of n query, n is the positive integer of presetting;
Said step B2 is specially: to each target language, utilize each query among the said Q to translate respectively, confirm that the highest translation result of translation score value summation is as target language query; Wherein, the translation score value summation of translation result does
Figure BDA0000048313580000031
P (e|q i) be q among the Q iBe translated into the translation score value of e;
The translation score value of translation result e is confirmed by at least a in the following factor: translate each contamination probability among statistics number and the translation result e of translation result e in the employed translated corpora.
Wherein, to the source language query of said user input carry out after the query correction process with the query extension process after obtain comprising n query source language query gather Q and specifically comprise:
The source language query that the source language query of said user input is carried out obtaining after the query correction process comprising n1 query gathers Q1; N1 is preset positive integer; Each query among the said Q1 is carried out the query extension process respectively, obtain comprising the source language query set Q of n query; Perhaps,
The source language query that the source language query of said user input is carried out obtaining after the query extension process comprising n2 query gathers Q2; N2 is preset positive integer; Each query among the said Q2 is carried out the query correction process respectively, obtain comprising the source language query set Q of n query; Perhaps,
After the source language query of said user input carried out query correction process and query extension process simultaneously; Obtain comprising the source language query set Q1 and the source language query set Q2 that comprises n2 query of n1 query respectively; After said Q1 and Q2 got union, obtain comprising the source language query set Q of n query.
The source language query of said user input is carried out the query correction process specifically to be comprised:
Utilize the source language query of said user's input to search the error correction corpus; Judge the identical wrong query of source language query that whether exists in the error correction corpus with said user's input; If; Then confirm and identical pairing all correct query of wrong query of source language query of said user's input, from all correct query that confirm, select corresponding error correction probability to come preceding n1 correct query and constitute source language query set Q1; Otherwise, only comprise the source language query that said user imports among the said Q1;
Wherein, said error correction corpus comprises: the wrong query that from search log, collects in advance is right with the query that corresponding correct query constitutes, and wrong query is the error correction probability of corresponding correct query by error correction.
The source language query of said user input is carried out the query extension process specifically to be comprised:
The source language query of said user's input is carried out word segmentation processing; The synonym of each word of confirming through the repetition resource of searching source language to obtain after the word segmentation processing; Utilize each word that obtains after the word segmentation processing and the synonym of each word to make up, the query that gets before the expansion score value comes among the query that combination obtains n2 constitutes said Q2;
The expansion score value of query is confirmed by the statistics number of creating this query in the said repetition resource.
In addition, said step C also comprises:
Obtain said source language query corresponding search result.
For the situation that source language query is optimized processing, said step C also comprises: select a query the source language query set that after optimization process, obtains, obtain the query corresponding search result of selection.
When selecting a query the said source language query set that after optimization process, obtains, the selection strategy of use comprises:
Each query in the source language query set that obtains after the optimization process searches for one by one, until finding the search effect to satisfy the query of preset requirement, selects this search effect to satisfy the query of preset requirement; Perhaps,
Each query in the source language query set that obtains after the optimization process searches for, and selects the optimum query of search effect.
Particularly, integrate described in the step D and comprise: the Search Results that step C is obtained merges and goes heavily.
Said ordering weight according to the row of each Search Results in affiliated classification time and affiliated classification, each Search Results sorted specifically to be comprised:
Utilize the ordering weight of the row of each Search Results in affiliated classification time and affiliated classification, each Search Results is given a mark, the result sorts to each Search Results from high to low according to marking;
Wherein, the marking of Search Results Rst as a result score (Rst) be:
Figure BDA0000048313580000051
M is total number of Search Results classification, w iBe the ordering weight of i kind classification, rank i(Rst) be that the row of Rst in the classification of i kind is inferior.
Especially, be categorized as under the Search Results: the language that Search Results is corresponding;
The ordering Method of Weight Determination of i kind classification is specially:
The characteristic of the source language query of S1, the said user's input of extraction;
S2, the characteristic that step S1 is extracted are carried out similarity with the proper vector of each language and are calculated, and confirm that language that similarity surpasses preset similarity threshold is the mapping language of the source language query that imports of said user;
S3, for Search Results Rst, if be categorized as the mapping language under the Rst, then should under the classification ordering weight w iBe the first setting value a; If be categorized as source language and source language under the Rst is not the mapping language of classification under this, then should the affiliated ordering weight w that classifies iBe the second setting value b; If classification under the Rst is neither the mapping language neither source language, then should under the ordering weight w of classification iBe the 3rd setting value c;
Wherein, a>b>c, the proper vector of each language be in advance to the existing resource of each language excavate train out.
A kind of device of cross-language search, this device comprises: user side interactive unit, Translation Processing unit, search processing and integral unit as a result;
Said user side interactive unit is used to receive the source language searching request query that the user imports, and the search result set of said integral unit as a result being integrated back formation offers said user;
Said Translation Processing unit is used for said source language query is translated as N kind target language query, and N is the integer greater than 1;
Said search processing is used for obtaining respectively said N kind target language query corresponding search result;
Said integral unit as a result is used for that the Search Results that said search processing is obtained is integrated the back and forms the final search result set; Wherein, in the set of said final search result, the ordering weight according to the row of each Search Results in affiliated classification time and affiliated classification sorts to each Search Results.
The highest a kind of translation result of score value in the translation result of this kind target language that said source language query is corresponding, is translated as target language query to each target language in said Translation Processing unit;
The translation score value of translation result e is confirmed by at least a in the following factor: translate each contamination probability among statistics number and the translation result e of translation result e in the employed translated corpora.
More excellent ground, this device also comprises: the optimization process unit;
Said optimization process unit is used for offering said Translation Processing unit after source language query to said user input is optimized processing, said optimization process comprise in query correction process and the query extension process any or make up;
Said Translation Processing unit is translated as N kind target language query with the source language query that said optimization process unit is optimized after the processing.
If said optimization process only comprises the query correction process, then said optimization process unit carries out obtaining comprising after the query correction process source language query set Q1 of n1 query, the positive integer of n1 for presetting to the source language query of said user input;
Said Translation Processing unit utilizes each query among the said Q1 to translate respectively to each target language, confirms that the highest translation result of translation score value summation is as target language query; Wherein, the translation score value summation of translation result does
Figure BDA0000048313580000061
P (e|q i) be q among the Q1 iBe translated into the translation score value of e;
The corresponding translation score value of translation result e is confirmed by at least a in the following factor: translate each contamination probability among statistics number and the translation result e of translation result e in the employed translated corpora.
If said optimization process only comprises the query extension process, then said optimization process unit carries out obtaining comprising after the query extension process source language query set Q2 of n2 query, the positive integer of n2 for presetting to the source language query of said user input;
Said Translation Processing unit utilizes each query among the said Q2 to translate respectively to each target language, confirms that the highest translation result of translation score value summation is as target language query; Wherein, the translation score value summation of translation result does
Figure BDA0000048313580000062
P (e|q i) be q among the Q2 iBe translated into the translation score value of e;
The corresponding translation score value of translation result e is confirmed by at least a in the following factor: translate each contamination probability among statistics number and the translation result e of translation result e in the employed translated corpora.
If said optimization process not only comprises the query correction process but also comprise the query extension process; Then said optimization process unit carries out obtaining comprising after query correction process and the query extension process source language query set Q of n query to the source language query of said user input, and n is the positive integer of presetting;
Said Translation Processing unit utilizes each query among the said Q to translate respectively to each target language, confirms that the highest translation result of translation score value summation is as target language query; Wherein, the translation score value summation of translation result does P (e|q i) be q among the Q iBe translated into the translation score value of e;
The translation score value of translation result e is confirmed by at least a in the following factor: translate each contamination probability among statistics number and the translation result e of translation result e in the employed translated corpora.
Particularly, said optimization process unit can comprise: first correction module and first expansion module;
Said first correction module is used for the source language query set Q1 that source language query to said user input carries out obtaining comprising after the query correction process n1 query, the positive integer of n1 for presetting;
Said first expansion module is used for each query of said Q1 is carried out the query extension process respectively, obtains comprising the source language query set Q of n query.
Perhaps, said optimization process unit specifically comprises: second expansion module and second correction module;
Said second expansion module is used for the source language query set Q2 that source language query to said user input carries out obtaining comprising after the query extension process n2 query, the positive integer of n2 for presetting;
Said second correction module is used for each query of said Q2 is carried out the query correction process respectively, obtains comprising the source language query set Q of n query.
Perhaps, said optimization process unit specifically comprises: the 3rd correction module, the 3rd expansion module and merge processing module;
Said the 3rd correction module is used for the source language query of said user's input is carried out the query correction process, obtains comprising the source language query set Q1 of n1 query;
Said the 3rd expansion module is used for the source language query of said user's input is carried out the query extension process, obtains comprising the source language query set Q2 of n2 query;
Said merging processing module after being used for said Q1 and Q2 got union, obtains comprising the source language query set Q of n query.
Said optimization process unit specifically utilizes the source language query of said user's input to search the error correction corpus; Judge the identical wrong query of source language query that whether exists in the error correction corpus with said user's input; If; Then confirm and identical pairing all correct query of wrong query of source language query of said user's input, from all correct query that confirm, select corresponding error correction probability to come preceding n1 correct query and constitute source language query set Q1; Otherwise, only comprise the source language query that said user imports among the said Q1;
Wherein, said error correction corpus comprises: the wrong query that from search log, collects in advance is right with the query that corresponding correct query constitutes, and wrong query is the error correction probability of corresponding correct query by error correction.
Said optimization process unit specifically carries out word segmentation processing with the source language query of said user's input; The synonym of each word of confirming through the repetition resource of searching source language to obtain after the word segmentation processing; Utilize each word that obtains after the word segmentation processing and the synonym of each word to make up, the query that gets before the expansion score value comes among the query that combination obtains n2 constitutes said Q2;
The expansion score value of query is confirmed by the statistics number of creating this query in the said repetition resource.
In addition, said search processing also is used to obtain said source language query corresponding search result.
Corresponding to the situation that comprises the optimization process unit, this device also comprises:
Request selected cell in source is used for being optimized the source language query set that obtains after the processing from said optimization process unit and selects a query;
Said search processing also is used to obtain the query corresponding search result that said source request selected cell is selected.
Wherein, the selection strategy of said source request selected cell employing comprises:
Each query that said optimization process unit is optimized in the source language query set that obtains after the processing searches for one by one, until finding the search effect to satisfy the query of preset requirement, selects this search effect to satisfy the query of preset requirement; Perhaps,
Each query that said optimization process unit is optimized in the source language query set that obtains after the processing searches for, and selects the optimum query of search effect.
Particularly, said integral unit as a result comprises: merge processing module, go heavy processing module and ordering processing module;
Said merging processing module is used for the Search Results that said search processing is obtained is merged processing;
Saidly remove heavy processing unit, be used for that said merging processing module is merged Search Results after handling and go heavily to handle and obtain search result set;
Said ordering processing module is used in said search result set, and the ordering weight according to the row of each Search Results in affiliated classification time and affiliated classification sorts to each Search Results.
Said ordering processing module is specifically utilized the ordering weight of the row of each Search Results in affiliated classification in the said search result set time and affiliated classification, and each Search Results is given a mark, and the result sorts to each Search Results from high to low according to marking;
Wherein, the marking of Search Results Rst as a result score (Rst) be:
Figure BDA0000048313580000091
M is total number of Search Results classification, w iBe the ordering weight of i kind classification, rank i(Rst) be that the row of Rst in the classification of i kind is inferior.
Be categorized as under the Search Results: the language that Search Results is corresponding;
Said ordering processing module is specifically carried out following operation when confirming the ordering weight of i kind classification: the characteristic of extracting the source language query of said user's input; The characteristic of extracting and the proper vector of each language are carried out similarity calculate, confirm that language that similarity surpasses preset similarity threshold is the mapping language of the source language query that imports of said user; For Search Results Rst, if be categorized as the mapping language under the Rst, then should the affiliated ordering weight w that classifies iBe the first setting value a; If be categorized as source language and source language under the Rst is not the mapping language of classification under this, then should the affiliated ordering weight w that classifies iBe the second setting value b; If classification under the Rst is neither the mapping language neither source language, then should under the ordering weight w of classification iBe the 3rd setting value c;
Wherein, a>b>c, the proper vector of each language be in advance to the existing resource of each language excavate train out.
Can find out by above technical scheme; After the present invention is translated as plurality of target language query with source language query; Obtain N kind target language query corresponding search result, Search Results is integrated the set of back formation final search result offer the user, wherein in the final search result set; Ordering weight according to the row of each Search Results in affiliated classification time and affiliated classification sorts to each Search Results.Realized comprising the Search Results of multilingual document through the present invention, for the user provides more excellent, more Search Results.
[description of drawings]
The method flow diagram that Fig. 1 provides for the embodiment of the invention one;
The synoptic diagram that Fig. 2 provides for the embodiment of the invention two;
The structure drawing of device that Fig. 3 provides for the embodiment of the invention three;
(a) among Fig. 4 be three kinds of structural drawing of (c) optimization process unit of providing for the embodiment of the invention three (b).
[embodiment]
In order to make the object of the invention, technical scheme and advantage clearer, describe the present invention below in conjunction with accompanying drawing and specific embodiment.
Embodiment one,
The method flow diagram that Fig. 1 provides for the embodiment of the invention one, as shown in Figure 1, this method can may further comprise the steps:
Step 101: the source language query that receives user's input.
This method can realize that behind user's input source language query, browser sends to the source language query of user's input the server end of search engine at the server end of search engine.
Step 102: query is optimized processing to source language.
This step is an optional step, and the optimization process that source language query is carried out can comprise: any in query correction process and the query extension process or combination.Wherein the purpose of query correction process is in order to improve source language query by the possibility of correct translation, and the purpose of query extension process is in order to enlarge the resource recall rate of query.
Wherein, the query correction process is intended to correct the mistake among the query, mainly is misspelling, can realize based on noisy channel model.Specifically comprise: it is right with the query that corresponding correct query constitutes from search log, to collect a large amount of wrong query in advance, and miscount query is the probability of correct query by error correction, and promptly error correction probability P (p|p ') forms the error correction corpus; Source language query inquiry error correction corpus to user's input; Judge and whether have the wrong query identical in the error correction corpus with source language query; If; Then confirm wrong query pairing all correct querys identical, therefrom select corresponding error correction probability to come preceding n1 correct query and constitute the source language query set Q1 after source language query carries out correction process, the positive integer of n1 for presetting with source language query; If not, then need not source language query is carried out correction process, only comprise the source language query that the user imports among the Q1 that finally obtains.
Wherein, P (p ' | p)=P (p|p ') P (p ')/P (p).
P (p ' | be the error correction probability of p ' by error correction p) for source language query p, P (p|p ') be the wrong probability of being write as p of p ' quilt in the search log, P (p ') be the number of times that p ' occurs in search log, P (p) is the number of times that p occurs in search log.
The purpose of query extension process is to expand synonym query, to avoid the comprising difficult problem of translation that specific term or neologisms are brought among the source language query owing to user's input.For example; If the source language query of user's input is " the beautiful figure of the n8 of Nokia "; Wherein " beautiful figure " has more in the present cyberspeak, in general translation model, seldom occurs, and this brings difficulty will for follow-up translation source language query; If expand to " picture " but it is carried out synonym, will reduce the translation difficulty.
When carrying out the synonym expansion, can the source language query that receive be expanded to a query set Q2, suppose that source language query is q, the Q2 that expands is { q 1, q 2..., q N2, wherein, n2 is preset positive integer, the synonym query that comprises q and expanded by q among the query that the expands set Q2.
Particularly; Can at first source language query be carried out word segmentation processing; Through searching the repetition resource of source language, synonymicon for example, the synonym of each word of confirming to obtain after the word segmentation processing; After utilizing each word that obtains after the word segmentation processing and synonym thereof to make up, choose expansion score value among the query that combination obtains and constitute set Q2 at preceding n2 query.Wherein, the expansion score value of query can be confirmed by creating the statistics number of repeating query appearance in the resource.What need special instruction is; Repeat resource and be not limited to speech, also can be phrase, even be sentence; For example split and the resource that merges or obtain based on the repetition of reasoning based on the replacement of dictionary note, word order conversion, sentence structure conversion, sentence; As long as the things of describing is identical, the implication of expression is identical, can think to repeat resource.
For example, if n2 is 2, the source language query of user's input is " Chinese cuisines ", carries out obtaining " China " and " cuisines " after the word segmentation processing, and the synonym of " China " has " China "; The synonym of " cuisines " has " delicacies ", and the query that obtains after making up is " Chinese cuisines ", " Chinese delicacies "; " Chinese cuisines "; " Chinese delicacies " chosen the expansion score value and come preceding 2 query formation Q2{ " Chinese cuisines ", " Chinese cuisines " }.
If above-mentioned optimization process comprises: query correction process and query extension process; Can at first carry out the query correction process; Obtain source language query set Q1, each query among the Q1 that obtains after the query correction process is carried out the query extension process respectively, finally obtain gathering Q{q 1, q 2..., q n, n is preset positive integer; Also can at first carry out the query extension process, obtain source language query set Q2, then each query among the Q2 that obtains after the query extension process carried out the query correction process respectively, finally obtain gathering Q{q 1, q 2..., q n, perhaps, query correction process and query extension process can be carried out simultaneously, after set Q1 that query correction process and query extension process are obtained and Q2 get union, obtain gathering Q{q 1, q 2..., q n.
Step 103: with the source language query after the optimization process, be translated as N kind target language query, wherein N is the integer greater than 1.
Category of language that can be in advance that the N kind is commonly used in the embodiment of the invention is as target language, and for example, the English that will use always in advance, French, Japanese, German etc. are set to target language.Which kind target language specifically is set can be provided with according to demand flexibly, and the present invention does not add restriction to this.
If not to the optimization process in the source language query execution in step 102 of user's input, the source language query that then directly the user is imported is translated as N kind target language query.
When source language query is translated into m kind target language, may obtain multiple different translation result, to this situation, can select the highest a kind of translation result of translation score value in the multiple different translation result.If promptly source language query is q; Then the translation result
Figure BDA0000048313580000121
to m kind target language is:
Figure BDA0000048313580000122
wherein, P (e|q) expression q is translated into the translation score value of e.
Wherein, the translation score value of translation result can be confirmed by at least a in following two kinds of factors: translate each contamination probability in this translation result occurs in the employed translated corpora statistics number and the translation result.
If the source language query to user's input has carried out the optimization process in the step 102, then the source language query after the optimization process is translated as N kind target language query.If the query that after being optimized processing, has obtained expanding gathers Q; Then to each target language; Respectively each query among the Q is translated as target language query, therefrom determines a translation result of translating the highest translation result of score value summation as this kind target language.
Promptly to each target language, its translation result
Figure BDA0000048313580000131
is:
Figure BDA0000048313580000132
P (e|q wherein i) q among the set Q that expands of expression q iBe translated into the translation score value of e, if q iCan not be translated into e, then its appropriate translation score value is zero, Can be translated into all q of e among the expression set Q iTranslation score value summation.For example; The source language query of user's input is " Chinese cuisines ", and the set Q that obtains after the optimization process is { " Chinese cuisines ", " Chinese delicacies "; " Chinese cuisines "; " Chinese delicacies " }, be English situation to target language, " Chinese cuisines ", " Chinese cuisines " and " Chinese delicacies " all are translated into " Chinese cuisine "; If then these three query are translated into the translation score value summation maximum of " Chinese cuisine ", then with " Chinese cuisine " translation result as English.
Need to prove that if when optimization process, only carried out the query correction process, then Q is exactly the Q1 described in the step 102, n is n1 in the formula (1); If when optimization process, only carried out the query extension process, then Q is exactly the Q2 described in the step 102, and n is n2 in the formula (1).
In addition, when each query translates in the Q that the source language query to user's input is perhaps expanded by source language query, can realize through the mode of coupling translation model.More preferably, can comprise the translation model that specific term, neologisms trained that utilization is in advance excavated in the translation model.
Step 104: obtain N kind target language query corresponding search result respectively.
When translation obtains N kind target language query, in the resources bank of corresponding target language, search for respectively.When in the resources bank of each target language, searching for, can be general search, promptly in the unstructured resources storehouse of corresponding target language, search for; Also can be vertical search, promptly in the structuring resources bank of corresponding target language, search for.
In addition, except source language query being translated as N kind target language query and obtaining the target language query corresponding search result, can also obtain the Search Results of source language query simultaneously.Particularly, if source language query is not optimized processing, the source language query that then directly utilizes the user to import obtains Search Results.If source language query has been carried out optimization process, select a query the source language query that then after optimization process, the obtains set Q, utilize the query of this selection in the resources bank of source language, to search for and obtain the corresponding search result.
When from source language query set Q, selecting query, the strategy of the selection of employing can include but not limited to:
Each query among the pair set Q searches for one by one, until finding the search effect to satisfy the query of preset requirement, selects query that this search effect satisfies preset requirement as target language query; Perhaps, each query among the pair set Q searches for, and selects the optimum query of search effect as target language query.
Wherein, the search effect can be presented as Search Results quantity, perhaps; The degree of correlation with source language query in the Search Results satisfies the Search Results quantity that the preset degree of correlation requires; Perhaps, the Search Results quantity of in setting-up time, issuing, perhaps; The Search Results quantity that preset source requires is satisfied in the source of Search Results, or the like.
Step 105: each Search Results that will obtain is integrated and the back formation final search result set of sorting offers the user; Wherein, in the final search result set, each Search Results is sorted according to the ordering weight of the row of each Search Results in affiliated classification time and affiliated classification.
Owing to from the resources bank of N kind target language, all obtained Search Results; Perhaps further from the resources bank of source language, also obtained Search Results; Therefore, need integrate, wherein integrate and comprise: the merging of Search Results and going heavily each Search Results that obtains.
When the Search Results after integrating is sorted, can utilize the ordering weight and the row in affiliated classification that classify under each Search Results inferior for each Search Results marking, sort from high to low according to the marking result then.
Particularly, the marking of Search Results Rst as a result score (Rst) be:
score ( Rst ) = 1 m Σ i = 1 m w i 1 rank i ( Rst ) , - - - ( 2 )
Wherein, m is total number of Search Results classification, w iBe the ordering weight of i kind classification, rank i(Rst) be that the row of Rst in the classification of i kind is inferior.
Usually, classification can confirm that promptly Search Results derives from the search of which kind of language by its corresponding language under the Search Results.Because in some cases, possibly there is coincidence in the Search Results that obtains through different language, therefore, formula (2) is actually the processing of asking weighted mean value.For example; Some Search Results be the Search Results of Chinese also be English Search Results; Then with the inferior ordering weight that multiply by English of the row of this Search Results in English Search Results; Add the row of this Search Results in the Chinese search result and time multiply by the ordering weight of Chinese, acquisition with again divided by total number of the search language that uses, obtain the marking result of this Search Results.
In the embodiment of the invention, can excavate the existing resource of each language (comprising each target language and source language) in advance, train proper vector; When confirming the ordering weight of each language, the characteristic of extraction source language query is carried out similarity with the proper vector of the characteristic of extracting and each language and is calculated, and confirms that similarity surpasses the mapping language of the language of preset similarity threshold for this source language query.
For example, for classifying according to language, the existing resource that Japanese is corresponding is excavated, train proper vector and comprise: Tokyo, Japan, sumo, kimonos, Junichiro Koizumi, oriental cherry, Kimura open up ....
When the source language query of user's input is " which famous film the Kimura opens up "; Being characterized as of extraction source language query " Kimura opens up film " then; The characteristic of extraction and the proper vector of each language are carried out similarity calculating; Confirm that the similarity between the proper vector with the classification of Japanese surpasses preset similarity threshold, then the mapping language of definite this source language query is a Japanese.
If be categorized as the mapping language under the Search Results, the ordering weight w that then should classify iBe the first setting value a; If be categorized as source language and source language under the Search Results is not the mapping language of classification under this, then the ordering weight w of affiliated classification iBe the second setting value b; If classification under the Search Results is neither the mapping language neither source language, then the ordering weight w of affiliated classification iBe the 3rd setting value c.Wherein, a>b>c for example can get b=1, a>1, c<1.
In addition, the row during Search Results is classified under it time can by Search Results and search use a kind of or combination in any in the degree of correlation between the query, the website weight that Search Results is originated, the issuing time of Search Results etc. definite.
So far flow process shown in the embodiment one finishes.Through instantiation among the embodiment two said method is described below.
Embodiment two,
The synoptic diagram of embodiment two correspondences is as shown in Figure 2, supposes that the user imports Chinese query: " little Bei Meitu ".
This query is optimized processing; Through confirming after the correction process that this query correctly need not error correction, search synonymicon then, utilize word " Xiao Bei " can expand " Beckham ", word " beautiful figure " and can expand " picture "; Obtaining gathering Q after the final expansion is: { " Xiao Bei picture "; " Beckham's picture ", " little Bei Meitu ", " the beautiful figure of Beckham " }.
The server end of supposing search engine is provided with 3 kinds of target languages in advance: English, Japanese and French, and also need be directed against source language (Chinese) in addition and search for, promptly finally utilize 4 kinds of language to search for.Wherein, When translating into English; " Xiao Bei picture " and " little Bei Meitu " can be translated into " Beckhampicture ", " Cockles picture "; " Xiao Bei picture " and " little Bei Meitu " translated into " Beckhampicture " corresponding translation score value and is respectively 8 and 6, and " Xiao Bei picture " and " little Bei Meitu " translated into " Cockles picture " corresponding translation score value and be respectively 1 and 2; " Beckham's picture " and " the beautiful figure of Beckham " all can be translated into " Beckham picture ", and corresponding translation score value is respectively 9 and 7.
The translation score value summation of calculating " Beckham picture " is 8+6+9+7=30, and the translation score value summation of calculating " Cocklespicture " is 1+2=3.
Finally choose English translation result and be " Beckham picture ".
The translation result of other target languages is similar, describes no longer for example.
Then, utilize the translation result of each target language respectively, divide in the resources bank that is clipped to corresponding target language as target language query and search for, obtain each Search Results.
To source language, from set Q, select the best query " Beckham's picture " of search effect, and obtain the Search Results of the query of this selection.
Chinese, English, Japanese and French corresponding search result are integrated.
When the Search Results after integrating is sorted, carry out according to the mode of formula (2).After source language query " little Bei Meitu " being carried out feature extraction and similarity coupling, confirm that the mapping language is English, then getting English ordering weight is 2, and the ordering weight of Chinese is 1, and the ordering weight of Japanese and French is 0.5.
For one of them Search Results Page1; If this Page1 exists in Chinese retrieval result, English result for retrieval and Japanese result for retrieval; And the row in the Chinese retrieval result is inferior to be 2; Row in English result for retrieval is inferior to be 5; Row in the Japanese result for retrieval time is 8, then the marking of this Page1 as a result score (Page) be:
Figure BDA0000048313580000171
Embodiment three,
The structure drawing of device that Fig. 3 provides for the embodiment of the invention, as shown in Figure 3, this device can comprise: user side interactive unit 300, Translation Processing unit 310, search processing 320 and integral unit 330 as a result.
User side interactive unit 300 is used to receive the source language searching request query that the user imports, and the search result set that forms after integral unit 330 is integrated is as a result offered the user.
Translation Processing unit 310 is used for source language query is translated as N kind target language query, and N is the integer greater than 1.
Search processing 320 is used for obtaining respectively N kind target language query corresponding search result.
Integral unit 330 as a result, are used for that the Search Results that search processing 320 is obtained is integrated the back and form the final search result set; Wherein, in final ranking results, the ordering weight according to the row of each Search Results in affiliated classification time and affiliated classification sorts to each Search Results.
Particularly, Translation Processing unit 310 can be directed against each target language, in the translation result of this kind target language that source language query is corresponding, translates the highest a kind of translation result of score value as target language query.Wherein, the translation score value of translation result e is confirmed by at least a in the following factor: translate each contamination probability among statistics number and the translation result e of translation result e in the employed translated corpora.
More excellent ground, this device can also comprise: optimization process unit 340.
Optimization process unit 340 is used for offering Translation Processing unit 310 after source language query to user input is optimized processing, wherein optimization process can comprise in query correction process and the query extension process any or make up.
Translation Processing unit 310 is translated as N kind target language query with the source language query that optimization process unit 340 is optimized after the processing.
If comprise optimization process unit 340 in the device, then can there be following three kinds of situation in the form according to optimization process:
First kind of situation: if optimization process only comprises the query correction process, then the source language query of the 340 couples of users in optimization process unit input carries out obtaining comprising after the query correction process source language query set Q1 of n1 query, the positive integer of n1 for presetting.
At this moment, Translation Processing unit 310 utilizes each query among the Q1 to translate respectively to each target language, confirms that the highest translation result of translation score value summation is as target language query; Wherein, the translation score value summation of translation result does
Figure BDA0000048313580000181
P (e|q i) be q among the Q1 iBe translated into the translation score value of e.
The corresponding translation score value of translation result e is confirmed by at least a in the following factor: translate each contamination probability among statistics number and the translation result e of translation result e in the employed translated corpora.
Second kind of situation: if optimization process only comprises the query extension process, then the source language query of the 340 couples of users in optimization process unit input carries out obtaining comprising after the query extension process source language query set Q2 of n2 query, the positive integer of n2 for presetting.
Translation Processing unit 310 utilizes each query among the Q2 to translate respectively to each target language, confirms that the highest translation result of translation score value summation is as target language query; Wherein, the translation score value summation of translation result does P (e|q i) be q among the Q2 iBe translated into the translation score value of e.
The corresponding translation score value of translation result e is confirmed by at least a in the following factor: translate each contamination probability among statistics number and the translation result e of translation result e in the employed translated corpora.
The third situation: if optimization process not only comprises the query correction process but also comprise the query extension process; Then the source language query of the 340 couples of users in optimization process unit input carries out obtaining comprising after query correction process and the query extension process source language query set Q of n query, and n is the positive integer of presetting.
Translation Processing unit 310 utilizes each query among the Q to translate respectively to each target language, confirms that the highest translation result of translation score value summation is as target language query; Wherein, the translation score value summation of translation result does P (e|q i) be q among the Q iBe translated into the translation score value of e.
The translation score value of translation result e is confirmed by at least a in the following factor: translate each contamination probability among statistics number and the translation result e of translation result e in the employed translated corpora.
In the third situation, can there be three kinds of structures in optimization process unit 340:
First kind of structure: shown in (a) among Fig. 4, optimization process unit 340 can specifically comprise: first correction module 401 and first expansion module 402.
First correction module 401 is used for the source language query set Q1 that source language query to user input carries out obtaining comprising after the query correction process n1 query, the positive integer of n1 for presetting.
First expansion module 402 is used for each query of Q1 is carried out the query extension process respectively, obtains comprising the source language query set Q of n query.
Second kind of structure: shown in (b) among Fig. 4, optimization process unit 340 can specifically comprise: second expansion module 411 and second correction module 412.
Second expansion module 411 is used for the source language query set Q2 that source language query to user input carries out obtaining comprising after the query extension process n2 query, the positive integer of n2 for presetting.
Second correction module 412 is used for each query of Q2 is carried out the query correction process respectively, obtains comprising the source language query set Q of n query.
The third structure: shown in (c) among Fig. 4, optimization process unit 340 can specifically comprise: the 3rd correction module 421, the 3rd expansion module 422 and merge processing module 423.
The 3rd correction module 421 is used for the source language query of user's input is carried out the query correction process, obtains comprising the source language query set Q1 of n1 query.
The 3rd expansion module 422 is used for the source language query of user's input is carried out the query extension process, obtains comprising the source language query set Q2 of n2 query.
Merge processing module 423, after being used for Q1 and Q2 got union, the source language query that obtains comprising n query gathers Q.
Particularly; The query correction process that optimization process unit 340 shown in Fig. 3 carries out can be specially: the source language query that utilizes the user to import searches the error correction corpus; Judge the identical wrong query of source language query that whether exists in the error correction corpus with user's input; If; Then confirm and identical pairing all correct query of wrong query of source language query of user's input, from all correct query that confirm, select corresponding error correction probability to come preceding n1 correct query and constitute source language query set Q1; Otherwise only comprise the source language query that the user imports among the Q1.
Wherein, the error correction corpus comprises: the wrong query that from search log, collects in advance is right with the query that corresponding correct query constitutes, and wrong query is the error correction probability of corresponding correct query by error correction.
The query extension process that optimization process unit 340 carries out can be specially: the source language query of user's input is carried out word segmentation processing; Through searching through searching the repetition resource of source language; Synonymicon for example; The synonym of each word of confirming to obtain after the word segmentation processing utilizes each word that obtains after the word segmentation processing and the synonym of each word to make up, and the query that gets before the expansion score value comes among the query that combination obtains n2 constitutes Q2.
Wherein, the expansion score value of query is confirmed by creating the statistics number of repeating this query in the resource.What need special instruction is; Repeat resource and be not limited to speech, also can be phrase, even be sentence; For example split and the resource that merges or obtain based on the repetition of reasoning based on the replacement of dictionary note, word order conversion, sentence structure conversion, sentence; As long as the things of describing is identical, the implication of expression is identical, can think to repeat resource.
In addition, in order to get access to the Search Results of source language simultaneously, above-mentioned search processing 320 can also be used to obtain source language query corresponding search result.
If source language query has been carried out optimization process, then this device may further include: source request selected cell 350 is used for being optimized the source language query set that obtains after the processing from optimization process unit 340 and selects a query.
At this moment, search processing 320 specifically is to obtain the query corresponding search result that source request selected cell 350 is selected when obtaining Search Results to source language.
The selection strategy that source request selected cell 350 adopts can include but not limited to: each query that optimization process unit 340 is optimized in the source language query set that obtains after the processing searches for one by one; Until finding the search effect to satisfy the query of preset requirement, select this search effect to satisfy the query of preset requirement; Perhaps, each query that optimization process unit 340 is optimized in the source language query set that obtains after the processing searches for, and selects the optimum query of search effect.
In addition, integral unit 330 comprises as a result: merge processing module 331, go heavy processing module 332 and ordering processing module 333.
Merge processing module 331, be used for the Search Results that search processing 320 is obtained is merged processing.
Remove heavy processing unit 332, be used for merging Search Results after handling and going heavily to handle and obtain search result set merging processing module 331.
Ordering processing module 333 is used in mentioned above searching results set, and the ordering weight according to the row of each Search Results in affiliated classification time and affiliated classification sorts to each Search Results.
Particularly, ordering processing module 333 is utilized the ordering weight of the row of each Search Results in affiliated classification in the search result set time and affiliated classification, and each Search Results is given a mark, and the result sorts to each Search Results from high to low according to marking.
Wherein, the marking of Search Results Rst as a result score (Rst) be:
Figure BDA0000048313580000211
M is total number of Search Results classification, w iBe the ordering weight of i kind classification, rank i(Rst) be that the row of Rst in the classification of i kind is inferior.
Under the mentioned above searching results classification can for: the language that Search Results is corresponding, this language can comprise target language, can also comprise source language.
Ordering processing module 333 is operated below concrete the execution: the characteristic of extracting the source language query of user's input when the ordering weight of definite i kind classification; The proper vector of the characteristic of extracting and each language is carried out similarity calculate, confirm that similarity surpasses the mapping language of the language of preset similarity threshold for the source language query of user's input; For Search Results Rst, if be categorized as the mapping language under the Rst, then should the affiliated ordering weight w that classifies iBe the first setting value a; If be categorized as source language and source language under the Rst is not the mapping language of classification under this, then should the affiliated ordering weight w that classifies iBe the second setting value b; If classification under the Rst is neither the mapping language neither source language, then should under the ordering weight w of classification iBe the 3rd setting value c.Wherein, a>b>c, the proper vector of each language be in advance to the existing resource of each language excavate train out.
The above is merely preferred embodiment of the present invention, and is in order to restriction the present invention, not all within spirit of the present invention and principle, any modification of being made, is equal to replacement, improvement etc., all should be included within the scope that the present invention protects.

Claims (32)

1. the method for a cross-language search is characterized in that, this method comprises:
The source language searching request query of A, reception user input;
B, said source language query is translated as N kind target language query, N is the integer greater than 1;
C, obtain said N kind target language query corresponding search result respectively;
D, the Search Results that step C is obtained are integrated the back and are formed the final search result set and offer the user;
Wherein in the set of said final search result, the ordering weight according to the row of each Search Results in affiliated classification time and affiliated classification sorts to each Search Results.
2. method according to claim 1 is characterized in that, in step B, to each target language, in the translation result of this kind target language that said source language query is corresponding, translates the highest a kind of translation result of score value as target language query;
The translation score value of translation result e is confirmed by at least a in the following factor: translate each contamination probability among statistics number and the translation result e of translation result e in the employed translated corpora.
3. method according to claim 1 is characterized in that, said step B specifically comprises:
B1, said source language query is optimized processing, said optimization process comprises any or the combination in query correction process and the query extension process;
B2, the source language query after the optimization process is translated as N kind target language query.
4. method according to claim 3; It is characterized in that; If said optimization process only comprises the query correction process, then the source language query of said user input is carried out obtaining comprising after the query correction process source language query set Q1 of n1 query, the positive integer of n1 for presetting;
Said step B2 is specially: to each target language, utilize each query among the said Q1 to translate respectively, confirm that the highest translation result of translation score value summation is as target language query; Wherein, the translation score value summation of translation result does
Figure FDA0000048313570000011
P (e|q i) be q among the said Q1 iBe translated into the translation score value of e;
The corresponding translation score value of translation result e is confirmed by at least a in the following factor: translate each contamination probability among statistics number and the translation result e of translation result e in the employed translated corpora.
5. method according to claim 3; It is characterized in that; If said optimization process only comprises the query extension process, then the source language query of said user input is carried out obtaining comprising after the query extension process source language query set Q2 of n2 query, the positive integer of n2 for presetting;
Said step B2 is specially: to each target language, utilize each query among the said Q2 to translate respectively, confirm that the highest translation result of translation score value summation is as target language query; Wherein, the translation score value summation of translation result does
Figure FDA0000048313570000021
P (e|q i) be q among the said Q2 iBe translated into the translation score value of e;
The corresponding translation score value of translation result e is confirmed by at least a in the following factor: translate each contamination probability among statistics number and the translation result e of translation result e in the employed translated corpora.
6. method according to claim 3; It is characterized in that; If said optimization process not only comprises the query correction process but also comprise the query extension process; Then the source language query of said user input is carried out obtaining comprising after query correction process and the query extension process source language query set Q of n query, n is the positive integer of presetting;
Said step B2 is specially: to each target language, utilize each query among the said Q to translate respectively, confirm that the highest translation result of translation score value summation is as target language query; Wherein, the translation score value summation of translation result does
Figure FDA0000048313570000022
P (e|q i) be q among the Q iBe translated into the translation score value of e;
The translation score value of translation result e is confirmed by at least a in the following factor: translate each contamination probability among statistics number and the translation result e of translation result e in the employed translated corpora.
7. method according to claim 6 is characterized in that, to the source language query of said user input carry out after the query correction process with the query extension process after obtain comprising n query source language query gather Q and specifically comprise:
The source language query that the source language query of said user input is carried out obtaining after the query correction process comprising n1 query gathers Q1; N1 is preset positive integer; Each query among the said Q1 is carried out the query extension process respectively, obtain comprising the source language query set Q of n query; Perhaps,
The source language query that the source language query of said user input is carried out obtaining after the query extension process comprising n2 query gathers Q2; N2 is preset positive integer; Each query among the said Q2 is carried out the query correction process respectively, obtain comprising the source language query set Q of n query; Perhaps,
After the source language query of said user input carried out query correction process and query extension process simultaneously; Obtain comprising the source language query set Q1 and the source language query set Q2 that comprises n2 query of n1 query respectively; After said Q1 and Q2 got union, obtain comprising the source language query set Q of n query.
8. according to claim 3,4 or 7 described methods, it is characterized in that, the source language query of said user input carried out the query correction process specifically comprise:
Utilize the source language query of said user's input to search the error correction corpus; Judge the identical wrong query of source language query that whether exists in the error correction corpus with said user's input; If; Then confirm and identical pairing all correct query of wrong query of source language query of said user's input, from all correct query that confirm, select corresponding error correction probability to come preceding n1 correct query and constitute source language query set Q1; Otherwise, only comprise the source language query that said user imports among the said Q1;
Wherein, said error correction corpus comprises: the wrong query that from search log, collects in advance is right with the query that corresponding correct query constitutes, and wrong query is the error correction probability of corresponding correct query by error correction.
9. according to claim 3,5 or 7 described methods, it is characterized in that, the source language query of said user input carried out the query extension process specifically comprise:
The source language query of said user's input is carried out word segmentation processing; The synonym of each word of confirming through the repetition resource of searching source language to obtain after the word segmentation processing; Utilize each word that obtains after the word segmentation processing and the synonym of each word to make up, the query that gets before the expansion score value comes among the query that combination obtains n2 constitutes said Q2;
The expansion score value of query is confirmed by the statistics number of creating this query in the said repetition resource.
10. according to the described method of the arbitrary claim of claim 1 to 7, it is characterized in that said step C also comprises:
Obtain said source language query corresponding search result.
11., it is characterized in that said step C also comprises: select a query the source language query set that after optimization process, obtains, obtain the query corresponding search result of said selection according to claim 4,5,6 or 7 described methods.
12. method according to claim 11 is characterized in that, when selecting a query the said source language query set that after optimization process, obtains, the selection strategy of use comprises:
Each query in the source language query set that obtains after the optimization process searches for one by one, until finding the search effect to satisfy the query of preset requirement, selects this search effect to satisfy the query of preset requirement; Perhaps,
Each query in the source language query set that obtains after the optimization process searches for, and selects the optimum query of search effect.
13. method according to claim 1 is characterized in that, integrates described in the step D to comprise: the Search Results that step C is obtained merges and goes heavily.
14. based on the described method of claim 1, it is characterized in that, said ordering weight based on the row of each Search Results in affiliated classification time and affiliated classification, each Search Results sorted specifically to be comprised:
Utilize the ordering weight of the row of each Search Results in affiliated classification time and affiliated classification, each Search Results is given a mark, the result sorts to each Search Results from high to low according to marking;
Wherein, the marking of Search Results Rst as a result score (Rst) be:
Figure FDA0000048313570000041
M is total number of Search Results classification, w iBe the ordering weight of i kind classification, rank i(Rst) be that the row of Rst in the classification of i kind is inferior.
15. method according to claim 14 is characterized in that, is categorized as under the Search Results: the language that Search Results is corresponding;
The ordering Method of Weight Determination of i kind classification is specially:
The characteristic of the source language query of S1, the said user's input of extraction;
S2, the characteristic that step S1 is extracted are carried out similarity with the proper vector of each language and are calculated, and confirm that language that similarity surpasses preset similarity threshold is the mapping language of the source language query that imports of said user;
S3, for Search Results Rst, if be categorized as the mapping language under the Rst, then should under the classification ordering weight w iBe the first setting value a; If be categorized as source language and source language under the Rst is not the mapping language of classification under this, then should the affiliated ordering weight w that classifies iBe the second setting value b; If classification under the Rst is neither the mapping language neither source language, then should under the ordering weight w of classification iBe the 3rd setting value c;
Wherein, a>b>c, the proper vector of each language be in advance to the existing resource of each language excavate train out.
16. the device of a cross-language search is characterized in that, this device comprises: user side interactive unit, Translation Processing unit, search processing and integral unit as a result;
Said user side interactive unit is used to receive the source language searching request query that the user imports, and the search result set of said integral unit as a result being integrated back formation offers said user;
Said Translation Processing unit is used for said source language query is translated as N kind target language query, and N is the integer greater than 1;
Said search processing is used for obtaining respectively said N kind target language query corresponding search result;
Said integral unit as a result is used for that the Search Results that said search processing is obtained is integrated the back and forms the final search result set; Wherein, in the set of said final search result, the ordering weight according to the row of each Search Results in affiliated classification time and affiliated classification sorts to each Search Results.
17. device according to claim 16; It is characterized in that; The highest a kind of translation result of score value in the translation result of this kind target language that said source language query is corresponding, is translated as target language query to each target language in said Translation Processing unit;
The translation score value of translation result e is confirmed by at least a in the following factor: translate each contamination probability among statistics number and the translation result e of translation result e in the employed translated corpora.
18. device according to claim 16 is characterized in that, this device also comprises: the optimization process unit;
Said optimization process unit is used for offering said Translation Processing unit after source language query to said user input is optimized processing, said optimization process comprise in query correction process and the query extension process any or make up;
Said Translation Processing unit is translated as N kind target language query with the source language query that said optimization process unit is optimized after the processing.
19. device according to claim 18; It is characterized in that; If said optimization process only comprises the query correction process; Then said optimization process unit carries out obtaining comprising after the query correction process source language query set Q1 of n1 query, the positive integer of n1 for presetting to the source language query of said user input;
Said Translation Processing unit utilizes each query among the said Q1 to translate respectively to each target language, confirms that the highest translation result of translation score value summation is as target language query; Wherein, the translation score value summation of translation result does
Figure FDA0000048313570000061
P (e|q i) be q among the Q1 iBe translated into the translation score value of e;
The corresponding translation score value of translation result e is confirmed by at least a in the following factor: translate each contamination probability among statistics number and the translation result e of translation result e in the employed translated corpora.
20. device according to claim 18; It is characterized in that; If said optimization process only comprises the query extension process; Then said optimization process unit carries out obtaining comprising after the query extension process source language query set Q2 of n2 query, the positive integer of n2 for presetting to the source language query of said user input;
Said Translation Processing unit utilizes each query among the said Q2 to translate respectively to each target language, confirms that the highest translation result of translation score value summation is as target language query; Wherein, the translation score value summation of translation result does
Figure FDA0000048313570000062
P (e|q i) be q among the Q2 iBe translated into the translation score value of e;
The corresponding translation score value of translation result e is confirmed by at least a in the following factor: translate each contamination probability among statistics number and the translation result e of translation result e in the employed translated corpora.
21. device according to claim 18; It is characterized in that; If said optimization process not only comprises the query correction process but also comprise the query extension process; Then said optimization process unit carries out obtaining comprising after query correction process and the query extension process source language query set Q of n query to the source language query of said user input, and n is the positive integer of presetting;
Said Translation Processing unit utilizes each query among the said Q to translate respectively to each target language, confirms that the highest translation result of translation score value summation is as target language query; Wherein, the translation score value summation of translation result does
Figure FDA0000048313570000063
P (e|q i) be q among the Q iBe translated into the translation score value of e;
The translation score value of translation result e is confirmed by at least a in the following factor: translate each contamination probability among statistics number and the translation result e of translation result e in the employed translated corpora.
22. device according to claim 21 is characterized in that, said optimization process unit specifically comprises: first correction module and first expansion module;
Said first correction module is used for the source language query set Q1 that source language query to said user input carries out obtaining comprising after the query correction process n1 query, the positive integer of n1 for presetting;
Said first expansion module is used for each query of said Q1 is carried out the query extension process respectively, obtains comprising the source language query set Q of n query.
23. device according to claim 21 is characterized in that, said optimization process unit specifically comprises: second expansion module and second correction module;
Said second expansion module is used for the source language query set Q2 that source language query to said user input carries out obtaining comprising after the query extension process n2 query, the positive integer of n2 for presetting;
Said second correction module is used for each query of said Q2 is carried out the query correction process respectively, obtains comprising the source language query set Q of n query.
24. device according to claim 21 is characterized in that, said optimization process unit specifically comprises: the 3rd correction module, the 3rd expansion module and merging processing module;
Said the 3rd correction module is used for the source language query of said user's input is carried out the query correction process, obtains comprising the source language query set Q1 of n1 query;
Said the 3rd expansion module is used for the source language query of said user's input is carried out the query extension process, obtains comprising the source language query set Q2 of n2 query;
Said merging processing module after being used for said Q1 and Q2 got union, obtains comprising the source language query set Q of n query.
25. according to claim 19 or 21 described devices; It is characterized in that; Said optimization process unit specifically utilizes the source language query of said user's input to search the error correction corpus; Judge the identical wrong query of source language query that whether exists in the error correction corpus with said user's input; If, then confirm and identical pairing all correct query of wrong query of source language query of said user's input, from all correct query that confirm, select corresponding error correction probability to come preceding n1 correct query and constitute source language query set Q1; Otherwise, only comprise the source language query that said user imports among the said Q1;
Wherein, said error correction corpus comprises: the wrong query that from search log, collects in advance is right with the query that corresponding correct query constitutes, and wrong query is the error correction probability of corresponding correct query by error correction.
26. according to claim 20 or 21 described devices; It is characterized in that; Said optimization process unit specifically carries out word segmentation processing with the source language query of said user's input; The synonym of each word of confirming through the repetition resource of searching source language to obtain after the word segmentation processing utilizes each word that obtains after the word segmentation processing and the synonym of each word to make up, and the query that gets before the expansion score value comes among the query that combination obtains n2 constitutes said Q2;
The expansion score value of query is confirmed by the statistics number of creating this query in the said repetition resource.
27., it is characterized in that said search processing also is used to obtain said source language query corresponding search result according to the described device of the arbitrary claim of claim 16 to 24.
28., it is characterized in that this device also comprises according to claim 19,20, the described device of 21 or 22 arbitrary claims:
Request selected cell in source is used for being optimized the source language query set that obtains after the processing from said optimization process unit and selects a query;
Said search processing also is used to obtain the query corresponding search result that said source request selected cell is selected.
29. device according to claim 28 is characterized in that, the selection strategy that said source request selected cell adopts comprises:
Each query that said optimization process unit is optimized in the source language query set that obtains after the processing searches for one by one, until finding the search effect to satisfy the query of preset requirement, selects this search effect to satisfy the query of preset requirement; Perhaps,
Each query that said optimization process unit is optimized in the source language query set that obtains after the processing searches for, and selects the optimum query of search effect.
30. device according to claim 16 is characterized in that, said integral unit as a result comprises: merge processing module, go heavy processing module and ordering processing module;
Said merging processing module is used for the Search Results that said search processing is obtained is merged processing;
Saidly remove heavy processing unit, be used for that said merging processing module is merged Search Results after handling and go heavily to handle and obtain search result set;
Said ordering processing module is used in said search result set, and the ordering weight according to the row of each Search Results in affiliated classification time and affiliated classification sorts to each Search Results.
31. device according to claim 30; It is characterized in that; Said ordering processing module is specifically utilized the ordering weight of the row of each Search Results in affiliated classification in the said search result set time and affiliated classification; Each Search Results is given a mark, and the result sorts to each Search Results from high to low according to marking;
Wherein, the marking of Search Results Rst as a result score (Rst) be: M is total number of Search Results classification, w iBe the ordering weight of i kind classification, rank i(Rst) be that the row of Rst in the classification of i kind is inferior.
32. device according to claim 31 is characterized in that, is categorized as under the Search Results: the language that Search Results is corresponding;
Said ordering processing module is specifically carried out following operation when confirming the ordering weight of i kind classification: the characteristic of extracting the source language query of said user's input; The characteristic of extracting and the proper vector of each language are carried out similarity calculate, confirm that language that similarity surpasses preset similarity threshold is the mapping language of the source language query that imports of said user; For Search Results Rst, if be categorized as the mapping language under the Rst, then should the affiliated ordering weight w that classifies iBe the first setting value a; If be categorized as source language and source language under the Rst is not the mapping language of classification under this, then should the affiliated ordering weight w that classifies iBe the second setting value b; If classification under the Rst is neither the mapping language neither source language, then should under the ordering weight w of classification iBe the 3rd setting value c;
Wherein, a>b>c, the proper vector of each language be in advance to the existing resource of each language excavate train out.
CN201110047892.7A 2011-02-28 2011-02-28 Cross-language searching method and device Active CN102651003B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201110047892.7A CN102651003B (en) 2011-02-28 2011-02-28 Cross-language searching method and device
PCT/CN2011/083420 WO2012116562A1 (en) 2011-02-28 2011-12-03 Method and device for cross-language query

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110047892.7A CN102651003B (en) 2011-02-28 2011-02-28 Cross-language searching method and device

Publications (2)

Publication Number Publication Date
CN102651003A true CN102651003A (en) 2012-08-29
CN102651003B CN102651003B (en) 2014-08-13

Family

ID=46693011

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110047892.7A Active CN102651003B (en) 2011-02-28 2011-02-28 Cross-language searching method and device

Country Status (2)

Country Link
CN (1) CN102651003B (en)
WO (1) WO2012116562A1 (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103268326A (en) * 2013-05-02 2013-08-28 百度在线网络技术(北京)有限公司 Personalized cross-language retrieval method and device
CN103699545A (en) * 2012-09-28 2014-04-02 摩根全球购物有限公司 Network searching system and network searching method thereof
CN104573019A (en) * 2015-01-12 2015-04-29 百度在线网络技术(北京)有限公司 Information searching method and device
CN104679642A (en) * 2013-11-26 2015-06-03 国际商业机器公司 Language independent processing of logs in a log analytics system
CN105404688A (en) * 2015-12-11 2016-03-16 北京奇虎科技有限公司 Searching method and searching device
CN106777261A (en) * 2016-12-28 2017-05-31 深圳市华傲数据技术有限公司 Data query method and device based on multi-source heterogeneous data set
CN106919642A (en) * 2017-01-13 2017-07-04 北京搜狗科技发展有限公司 A kind of cross-language search method and apparatus, a kind of device for cross-language search
WO2017173827A1 (en) * 2016-04-06 2017-10-12 北京搜狗科技发展有限公司 Searching method, device and equipment
CN108304412A (en) * 2017-01-13 2018-07-20 北京搜狗科技发展有限公司 A kind of cross-language search method and apparatus, a kind of device for cross-language search
CN109933724A (en) * 2019-03-07 2019-06-25 上海智臻智能网络科技股份有限公司 Knowledge searching method, system, question and answer system, electronic equipment and storage medium
CN110110171A (en) * 2019-05-09 2019-08-09 上海泰豪迈能能源科技有限公司 Enterprise information searching method, device and electronic equipment
CN111400464A (en) * 2019-01-03 2020-07-10 百度在线网络技术(北京)有限公司 Text generation method, text generation device, server and storage medium
CN111666417A (en) * 2020-04-13 2020-09-15 百度在线网络技术(北京)有限公司 Method and device for generating synonyms, electronic equipment and readable storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9678952B2 (en) 2013-06-17 2017-06-13 Ilya Ronin Cross-lingual E-commerce

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101271461A (en) * 2007-03-19 2008-09-24 株式会社东芝 Cross-language retrieval request conversion and cross-language information retrieval method and system
US20090193003A1 (en) * 2007-09-21 2009-07-30 Google Inc. Cross-Language Search
CN101743544A (en) * 2007-05-16 2010-06-16 谷歌公司 Cross-language information retrieval

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101271461A (en) * 2007-03-19 2008-09-24 株式会社东芝 Cross-language retrieval request conversion and cross-language information retrieval method and system
CN101743544A (en) * 2007-05-16 2010-06-16 谷歌公司 Cross-language information retrieval
US20090193003A1 (en) * 2007-09-21 2009-07-30 Google Inc. Cross-Language Search

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103699545A (en) * 2012-09-28 2014-04-02 摩根全球购物有限公司 Network searching system and network searching method thereof
CN103268326A (en) * 2013-05-02 2013-08-28 百度在线网络技术(北京)有限公司 Personalized cross-language retrieval method and device
CN104679642A (en) * 2013-11-26 2015-06-03 国际商业机器公司 Language independent processing of logs in a log analytics system
US9852129B2 (en) 2013-11-26 2017-12-26 International Business Machines Corporation Language independent processing of logs in a log analytics system
US9881005B2 (en) 2013-11-26 2018-01-30 International Business Machines Corporation Language independent processing of logs in a log analytics system
CN104573019A (en) * 2015-01-12 2015-04-29 百度在线网络技术(北京)有限公司 Information searching method and device
CN104573019B (en) * 2015-01-12 2019-04-02 百度在线网络技术(北京)有限公司 Information retrieval method and device
CN105404688A (en) * 2015-12-11 2016-03-16 北京奇虎科技有限公司 Searching method and searching device
WO2017173827A1 (en) * 2016-04-06 2017-10-12 北京搜狗科技发展有限公司 Searching method, device and equipment
CN106777261A (en) * 2016-12-28 2017-05-31 深圳市华傲数据技术有限公司 Data query method and device based on multi-source heterogeneous data set
CN108304412A (en) * 2017-01-13 2018-07-20 北京搜狗科技发展有限公司 A kind of cross-language search method and apparatus, a kind of device for cross-language search
CN106919642A (en) * 2017-01-13 2017-07-04 北京搜狗科技发展有限公司 A kind of cross-language search method and apparatus, a kind of device for cross-language search
CN106919642B (en) * 2017-01-13 2021-04-16 北京搜狗科技发展有限公司 Cross-language search method and device for cross-language search
CN108304412B (en) * 2017-01-13 2022-09-30 北京搜狗科技发展有限公司 Cross-language search method and device for cross-language search
CN111400464A (en) * 2019-01-03 2020-07-10 百度在线网络技术(北京)有限公司 Text generation method, text generation device, server and storage medium
CN111400464B (en) * 2019-01-03 2023-05-26 百度在线网络技术(北京)有限公司 Text generation method, device, server and storage medium
CN109933724A (en) * 2019-03-07 2019-06-25 上海智臻智能网络科技股份有限公司 Knowledge searching method, system, question and answer system, electronic equipment and storage medium
CN109933724B (en) * 2019-03-07 2022-01-14 上海智臻智能网络科技股份有限公司 Knowledge search method, knowledge search system, question answering device, electronic equipment and storage medium
CN110110171A (en) * 2019-05-09 2019-08-09 上海泰豪迈能能源科技有限公司 Enterprise information searching method, device and electronic equipment
CN111666417A (en) * 2020-04-13 2020-09-15 百度在线网络技术(北京)有限公司 Method and device for generating synonyms, electronic equipment and readable storage medium
CN111666417B (en) * 2020-04-13 2023-06-23 百度在线网络技术(北京)有限公司 Method, device, electronic equipment and readable storage medium for generating synonyms

Also Published As

Publication number Publication date
WO2012116562A1 (en) 2012-09-07
CN102651003B (en) 2014-08-13

Similar Documents

Publication Publication Date Title
CN102651003B (en) Cross-language searching method and device
CN109492077B (en) Knowledge graph-based petrochemical field question-answering method and system
CN104376406B (en) A kind of enterprise innovation resource management and analysis method based on big data
CN101694670B (en) Chinese Web document online clustering method based on common substrings
CN102419778B (en) Information searching method for discovering and clustering sub-topics of query statement
CN102968465B (en) Network information service platform and the search service method based on this platform thereof
CN106537370A (en) Method and system for robust tagging of named entities in the presence of source or translation errors
CN103902652A (en) Automatic question-answering system
CN103970729A (en) Multi-subject extracting method based on semantic categories
CN103399901A (en) Keyword extraction method
CN101169780A (en) Semantic ontology retrieval system and method
CN101593200A (en) Chinese Web page classification method based on the keyword frequency analysis
CN103389988A (en) Method and device for guiding user to carry out information search
CN102682000A (en) Text clustering method, question-answering system applying same and search engine applying same
CN107066555A (en) Towards the online topic detection method of professional domain
CN103970730A (en) Method for extracting multiple subject terms from single Chinese text
CN111177591A (en) Knowledge graph-based Web data optimization method facing visualization demand
CN105426529A (en) Image retrieval method and system based on user search intention positioning
CN106484797A (en) Accident summary abstracting method based on sparse study
CN102929902A (en) Character splitting method and device based on Chinese retrieval
CN102779135A (en) Method and device for obtaining cross-linguistic search resources and corresponding search method and device
CN105912662A (en) Coreseek-based vertical search engine research and optimization method
CN103646029A (en) Similarity calculation method for blog articles
CN103927339B (en) Knowledge Reorganizing system and method for knowledge realignment
CN105404677A (en) Tree structure based retrieval method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant