CN1707476A - Auxiliary translation searching engine system and method thereof - Google Patents

Auxiliary translation searching engine system and method thereof Download PDF

Info

Publication number
CN1707476A
CN1707476A CN 200510018660 CN200510018660A CN1707476A CN 1707476 A CN1707476 A CN 1707476A CN 200510018660 CN200510018660 CN 200510018660 CN 200510018660 A CN200510018660 A CN 200510018660A CN 1707476 A CN1707476 A CN 1707476A
Authority
CN
China
Prior art keywords
web page
bilingual
webpage
module
index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 200510018660
Other languages
Chinese (zh)
Inventor
程伟
陈智贤
贺方升
李银刚
孙上海
王沧洪
余俊
朱柳嵩
朱前线
Original Assignee
贺方升
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 贺方升 filed Critical 贺方升
Priority to CN 200510018660 priority Critical patent/CN1707476A/en
Publication of CN1707476A publication Critical patent/CN1707476A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The present invention discloses one kind of auxiliary translation searching engine system and its method, and relates to multilingual translation system and method in Internet. The method includes the steps of: 1. for the network robot to pick up web page and store in source information library; 2. to establish web page index library with web page index module; 3. to find single web page or bilingual web page pair in web page index library with the web page distinguishing and pre-treating module and perform web page pre-treatment; 4. to perform sentence matching treatment; 5. to store in bilingual library; 6. to establish index for the matched and stored bilingual pairs; 7. to respond user's request and search nearby bilingual result and source URL; and 8. to display in client end the bilingual result and source URL. The present invention is applied in automatic translation in network search.

Description

Auxiliary translation searching engine system and method thereof
Technical field
The present invention relates to multilingual mutual translation system in a kind of internet and method thereof; Specifically, relate to a kind of network robot that utilizes and from the internet, constantly collect bilingual corpora information, information is handled, realize the system and the method thereof of computer-aided translation again in conjunction with search engine.
Background technology
Along with China's entry into the WTO, and success of application for organizing the Olympic Games, Chinese external exchanges more and more, the alphabet that is faced, article also can increase thereupon, this will ask for help grasp certain foreign language technical ability, particularly write and translate technical ability, and this is for extremely difficult thing of most people Lai Shuoshi.In addition, specialize in the personage of certain industry, also can in the middle of work, face the situation of consulting a large amount of foreign language datum, and specialized vocabulary amount young pathbreaker seriously restricts its efficient for some.
At present, had some this assisted translation tools both at home and abroad, but effect is undesirable.
At first, these translation tools have dual mode to exist.A kind of is to adopt to carry dictionary mode; Input request sentence to the user word for word pursues translating of speech, but this translation result does not often meet the foreign language grammer fully, and is utterly useless to user's writing, reading.For example the foreign languages translation of " State Intellectual Property Office of the People's Republic of China " early has the official in the current world to translate surely, i.e. " State Intellectual Property Office ofP.R.China ", but become " P.R.China NationKnowledge Property Office " according to word for word translating, caused mistake by the schema translation of speech; And to wonder that the translation of these existing " sanctified by usage ", a kind of effective way are the online removal search.Another kind is to adopt the corpus pattern; Input request sentence to the user is inquired about in corpus, but because the accumulation of its corpus adopts is the right mode of a kind of artificial interpolation intertranslation sentence, institute is so that the corpus amount of capacity is limited.The corpus of doing preferably at present also have only 500,000 right.
Secondly, a key character of translation is the repeatability of translation.Studies show that on interior perhaps sentence pattern sentence formula, individual's translation repetition rate is about 30%, for whole internet, this repetition rate can be higher.Therefore, can cause a large amount of duplication of labour, but can not get desirable effect for artificial interpolation bilingualism corpora.Certainly, accumulate system automatically, also have the expert to do the research of this respect for bilingualism corpora; For example: Christopher C.Yang " Mining English/Chinese Parallel Documents from theWorld Wide Web ", but paper at the research system, only utilize the title in the webpage label, grasp Chinese and English two webpages of contrast one by one, carrying out bilingualism corpora accumulates automatically, and do not continue the further feature point that utilizes webpage right, and do not grasp the Chinese and English of internet mass this class webpage info web at same webpage.
Usually, the user can search for an existing its appropriate translation of word or expression on webpage, can use the universal search engine as Baidu, GOOGLE and so on.But these search engines are not to aim at supplementary translation design, user's search technical ability is had higher requirements, otherwise can provide thousands of information; Because quantity of information is huge, thereby causes the user to get lost in the information ocean, can't obtain the own result who really needs fast.
In present stage, there is not a kind of good method to solve the above-mentioned problem.The user also can only read, write by the most original mode of looking up the dictionary, and the crowd who is ignorant of foreign language for some almost is the work that can't finish.
Through " Hubei Province's scientific and technical information research institute update search center " (country-level science and technology is looked into new consulting unit) retrieval, conclusion is: entrusting and looking into fresh content is a kind of translation search engine, it provides many translation match statements from WEB by search engine in result for retrieval, and provide the function of corresponding link simultaneously, examine in the domestic and foreign literature and do not relate to.
Summary of the invention
Purpose of the present invention overcomes prior art existing problems and not enough exactly, proposes a kind of effective solution, and a kind of auxiliary translation searching engine system and method thereof promptly are provided.
The object of the present invention is achieved like this: constantly grasp webpage from the internet by network robot and deposit database in, the webpage that grasps is set up index, and extract and identification, analysis and filter, to the bilingual journal content match checking that may exist, keep the bilingual data of contrast fully, the source URL together with this language material deposits database in.Again the bilingual data in the database is set up index, Ji Lei corpus can when the user imports a keyword or sentence, be responded user's request fast by user search thus, return with user inquiring keyword or sentence same or analogous with reference to example sentence, for reference.Simultaneously, also show these source URL and web page titles, click for the user and enter corresponding webpage to obtain more information with reference to bilingual example sentence.
Specifically, the present invention includes system and method two parts:
1, system
As Fig. 1, native system comprises that internet A, server B, wireless network connect C, the Internet network connects D, mobile communication equipment client E, desktop computer client or browser F, mobile subscriber G, computer user H;
One the tunnel, server B, the wireless network that is communicated with internet A is connected C, mobile communication equipment client E, mobile subscriber G are communicated with successively;
Another road, server B, the Internet network that is communicated with internet A is connected D, desktop computer client or browser F, computer user H is communicated with successively;
Described server B comprises the translation search engine server B1 that is communicated with successively, database server B2, retrieval server B3;
Wherein translation search engine server B1 comprises network robot module B1.1, the identification of web page index module B1.2 webpage and pretreatment module B1.3, subordinate sentence matching module B1.4;
Wherein database server B2 comprises source information storehouse B2.1, web page index storehouse B2.2, bilingualism corpora B2.3;
Wherein retrieval server B3 comprises index module B3.1, retrieval module B3.2.
Described network robot module B1.1 promptly a kind ofly grasps the info web on the internet get off, and is entered into the system module among the B2.1 of source information storehouse; Described webpage is meant that all are present in the web data information on the internet, as, all kinds webpages such as html, XML;
Described web page index module B1.2 promptly a kind ofly analyzes the info web that is kept among the B2.1 of source information storehouse, establishes the index that helps webpage identification, is entered into the system module among the B2.2 of web page index storehouse;
Described webpage identification and pretreatment module B1.3, it is a kind of single webpage that contains bilingual information of from the B2.2 of web page index storehouse, seeking, perhaps seek the pure first language version webpage that may have the contrast of second language version, find out the webpage of the second language version contrast of mating most by the web page index condition, the formation bilingual web page is right, then, to single webpage or bilingual web page to carrying out the noise purification filtering, remove the irrelevant information in the webpage, may there be the system module of bilingual translation contrast web page contents in extraction;
Described subordinate sentence matching module B1.4, it is a kind of content of the info web to webpage identification and pretreatment module B1.3 extraction, use the subordinate sentence matching algorithm, it is right to be divided into corresponding bilingual sentence, is entered into system module among the bilingualism corpora B2.3 together with URL and web page title;
Described source information storehouse B2.1, the database of info web is grasped in promptly a kind of storage from the internet;
Described web page index storehouse B2.2, promptly a kind of storing helps webpage identification and the index of handling and the database of web page text;
Described bilingualism corpora B2.3, promptly a kind of storage can provide the bilingual journal database of information of supplementary translation;
Above-mentioned three database B2.1, B2.2, B2.3, the database that is adopted is general Universal Database, as MySQL, SQL Server, Oracle etc.;
Described index module B3.1, promptly a kind of to the bilingual sentence behind the coupling warehouse-in to setting up the system module of index;
Described retrieval module B3.2, promptly a kind of user submit to by various user end to server B and want the statement translated, and server B is handled it, and with close Chinese and English result, and URL (web page address) goes out and returns to system module on the client end interface.
2, method
As Fig. 2, this method is a kind of method that realizes auxiliary translation searching engine, and it adopts following step:
1. network robot grasps webpage automatically and deposits source information storehouse 1 in;
2. utilize the web page index module to set up web page index storehouse 2;
3. utilize webpage identification and pretreatment module in the web page index storehouse, find out single webpage or bilingual web page right, and carry out webpage pre-service 3;
4. carry out subordinate sentence matching treatment 4;
5. deposit bilingual expectation storehouse 5 in;
6. the bilingual sentence after coupling being put in storage is to setting up index 6;
7. respond user's request, retrieve close bilingual result and source URL7 thereof fast;
8. show close bilingual result and source URL8 thereof in various clients.
Described step 1. network robot grasps webpage automatically and deposits source information storehouse 1 in, promptly utilize the network robot that operates in the server B end constantly to grasp info web, the information of extracting and the URL of this webpage are deposited in the database that operates on the server B from internet A;
2. described step utilizes the web page index module to set up web page index storehouse 2, promptly reads the info web among the B2.1 of source information storehouse, utilizes web page index module B1.2 in the server B to establish and helps webpage identification and pretreated web page index storehouse B2.2;
Described step 3. utilizes webpage identification and pretreatment module is found out single webpage in the web page index storehouse or bilingual web page is right, and carry out webpage pre-service 3, promptly from the B2.2 of web page index storehouse, read the single webpage that contains bilingual information, perhaps read the pure first language version webpage that may have the contrast of second language version, find out the webpage of the second language version contrast of mating most by the web page index condition, the formation bilingual web page is right, then, to single webpage or bilingual web page to carrying out the noise purification filtering, remove the irrelevant information in the webpage, may there be the info web content of bilingual translation contrast in extraction;
4. described step carries out subordinate sentence matching treatment 4, promptly info web is discerned the info web content subordinate sentence matching algorithm of finishing with pre-service, and it is right to be divided into corresponding bilingual sentence;
5. described step deposits bilingualism corpora 5 in, and it is right to be about to the bilingual sentence that the subordinate sentence matching treatment finishes, and deposits among the bilingualism corpora B2.3 that operates in the server B end;
Bilingual sentence after 6. described step puts in storage coupling is to setting up index 6, and the bilingual journal information that is about among the bilingualism corpora B2.3 is set up index, to accelerate the response speed of inquiry;
7. described step responds user's request, retrieves close bilingual result and source URL7 thereof fast, promptly to the query statement of user's input, retrieves in the index of setting up, and inquires the bilingual result identical or close with customer requirements, and obtains its source URL;
8. described step shows close bilingual result and source URL8 thereof in various clients, and the bilingual result and the corresponding internet address thereof that are about to inquiry turn back on the client of user's use.
Principle of work of the present invention
Consult Fig. 2, the present invention is by server B end operational network robot program, info web resource on the internet A is grasped, deposit source information storehouse B2.1 in, set up web page index, in the B2.2 of web page index storehouse, find out single webpage or bilingual web page right, to carry out noise then through the webpage that these step process are crossed and purify and filter,, carry out subordinate sentence matching treatment 4 purifying the corresponding bilingual web page information in back, it is right to be divided into corresponding bilingual sentence, typing bilingualism corpora B2.3.In bilingualism corpora B2.3, with the bilingual sentence behind the coupling warehouse-in to setting up index with convenient search.The user can be by various clients, and as mobile communication equipment client E, desktop computer client or browser F submit to server B and to want the statement translated, find out the result that is complementary and it is shown by user interface.On display page, Chinese, source URL and corresponding web page title English and on the internet are simultaneously displayed on together, form the form of concentrating contrast.
The present invention has the following advantages and good effect:
1. the present invention is the info web that utilizes on the network robot extracting internet A, it is carried out purification filtering, extract the bilingual web page information that wherein exists, and it is mated checking, thereby obtain the bilingual data of entirely true contrast, to offer user's translation and inquiry.Its advantage is to realize the full-automation of bilingualism corpora B2.3 accumulation, is different from the mode of general artificial interpolation corpus, thereby has broken through the little restriction of artificial interpolation corpus quantity, has really realized the magnanimity accumulation of bilingualism corpora.In addition, can click according to the source URL of translation during user search and enter corresponding translation information webpage.
2. the present invention can also produce positive effect.As, this accumulation bilingualism corpora has been broken traditional-handwork and has been added the corpus pattern, has brought technical innovation.In addition, the bilingualism corpora of accumulation can have multiple use, and not only is applied to the web search translation engine, also can be used for language comparative study, translation conversion, and translatese and Translation Study automatically, bilingual dictionary is compiled and aspect such as translation teaching.
Description of drawings
Fig. 1-system of the present invention forms synoptic diagram;
Fig. 2-method flow diagram of the present invention;
The connected graph in Fig. 3-internet, network robot module and source information storehouse;
Fig. 4-webpage identification and concrete implementing procedure figure of pretreatment module;
Fig. 5-web page index table sample figure;
The concrete implementing procedure figure of Fig. 6-subordinate sentence matching module;
Fig. 7-User Page product process figure.
Wherein:
The A-internet.
The B-server comprises:
B1-translation search engine server,
B1.1-network robot module,
B1.2-web page index module,
Identification of B1.3-webpage and pretreatment module,
B1.4-subordinate sentence matching module;
The B2-database server,
B2.1-source information storehouse,
B2.2-web page index storehouse,
The B2.3-bilingualism corpora;
The B3-retrieval server,
The B3.1-index module,
The B3.2-retrieval module.
The C-wireless network connects.
The D-Internet network connects.
E-mobile communication equipment client.
F-desktop computer client or browser.
The G-mobile subscriber.
The H-computer user.
The 1-network robot grasps webpage automatically and deposits the source information storehouse in;
2-utilizes the web page index module to set up the web page index storehouse;
3-utilizes webpage identification and pretreatment module is found out single webpage in the web page index storehouse or bilingual web page is right, and carries out the webpage pre-service;
4-carries out the subordinate sentence matching treatment;
5-deposits bilingual expectation storehouse in;
Bilingual sentence after 6-puts in storage coupling is to setting up index;
7-response user request retrieves close bilingual result and source URL thereof fast;
8-shows close bilingual result and source URL thereof in various clients;
10-reads webpage from the web page index storehouse;
The identification of 11-Web page classifying;
12-purifies,
12.1-preliminary the filtration, 12.2-filters fully, and 12.3-contributes, and the 12.4-parsing tree obtains the result;
13-is by the Chinese web page of web page index conditional search correspondence;
Relatively webpage is right for 14-;
The 15-analyzing web page is to obtaining the result;
16-submits to the subordinate sentence matching module and handles;
17-web page index table sample;
18-Chinese article paragraph;
19-english article paragraph;
20-sentence cutting unit;
Many Chinese sentences (queuing) of 21-;
Many english sentences of 22-(queuing);
23-judges that the sentence matching unit calculates the right coupling evaluation of estimate of Sino-British sentence;
24-V 〉=threshold values;
25-submits the translation content in user interface;
The 26-retrieval;
27-returns corresponding Chinese, English, and source URL is presented on the user interface.
Embodiment
Relevant step and practical application thereof to this method further specifies below.
For convenience of description, the bilingual employing Chinese here and English this bilingual contrast, but the present invention has more than and is limited to this bilingual of Chinese and English.
Described step 1. network robot grasps webpage automatically and deposits source information storehouse 1 in, the corresponding contents of other link that (consulting Fig. 3) promptly comprises in the single info web by realizing for internet address of network robot to grasp automatically this internet address correspondence and this info web deposits info web and the corresponding internet address thereof that grasps in source information storehouse B2.1;
For example: give internet address of network robot
Http:// www.51education.net/Article_Show.asp? ArticleID=2402, the link of a lot of literal correspondences is arranged in this address page, so, network robot will grasp all the elements on this internet address corresponding page, simultaneously also can grasp the content of the all-links correspondence that comprises on this page get off, be the web page contents of the whole website of this internet address correspondence, and content and the internet address that grasps is kept among the B2.1 of source information storehouse simultaneously.
2. described step utilizes the web page index module to set up web page index storehouse 2, and the info web that is about to the network robot extracting is handled, and sets up the index of the correlated characteristic information (URL, territory, filename, web page title, type of webpage etc.) of this webpage.Web page index module B1.2 be responsible for to extract the webpage hyperlink URL, the text language type of analyzing web page, and other eigenwerts of analyzing web page are determined the module of each index entry;
For example: network robot has grasped URL and has been
Behind the info web of http://www.snda.com/en/about/overview.htm, the web page index module is set up manipulative indexing with this webpage, as shown in Figure 5, deposits web page index storehouse B2.2 in.
Described step 3. utilizes webpage identification and pretreatment module is found out single webpage in the web page index storehouse or bilingual web page is right, and carry out webpage pre-service 3 (consulting Fig. 4), promptly from the B2.2 of web page index storehouse, read webpage, carry out Web page classifying identification according to the type of webpage field of web page index storehouse B2.2 record;
If this webpage belongs to bilingual in the type with one page, then enter and purify 12 processing;
It is as follows to purify 12 idiographic flows:
1, at first filters 12.1, remove the garbage that may exist in the webpage, the content after tentatively filtering is deposited in the temporary file info web is preliminary;
2, the temporary file that obtains after the preliminary filtration is filtered 12.2 fully, only keep the paragraph that may have bilingual journal;
3, the bilingual journal paragraph after identification is finished is set up XML (extending mark language) tree;
4, Analysis of X ML tree filters out all redundant informations, only keeps the webpage of bilingual journal;
For example: internet address
Http:// www.51education.net/Article_Show.asp? the webpage of ArticleID=240, filter 12.1 through preliminary, remove the link of " English lyrics translation " correspondence in this webpage, the picture of " free QQ send " correspondence etc., only keep the body matter of " a study piece study (contrast between Chinese and English) ".Again through filtering 12.2 fully, remove garbages such as " author: carefree education " in the body matter, " change paste from: www.51education.org ".Once more this text web page contents is built the XML tree, Analysis of X ML tree filters out redundant information, only keeps the webpage of bilingual journal, and promptly " knowledge is shallow, and is as though treading on thin ice.A little learning is a dangerous thing. ", " U.S.A of things is present in the person of the examining mind.Beauty in things exists in the mind which contemplates them. " etc.
If this webpage belongs to pure English type of webpage, then enter flow process and be followed successively by: by the Chinese web page of web page index conditional search correspondence, relatively webpage is right, and analyzing web page is to obtaining result's pre-service.Wherein, by the Chinese web page of web page index conditional search correspondence,, search, with the same or analogous Chinese web page of the filename of English webpage in same territory promptly according to the URL of English webpage.
For example, the concordance list of the pure English webpage of " grand brief introduction " in the B2.2 of web page index storehouse as shown in Figure 5, URL is http://www.snda.com/en/about/overview.htm, its territory is www.snda.com, file is called overview.htm, by its territory, find the Chinese web page of corresponding file overview.htm by name: its URL is
http://www.snda.com/cs/about/overview.htm
4. described step carries out subordinate sentence matching treatment 4 idiographic flows following (consulting Fig. 6):
1, to purifying and pretreated webpage (one section corresponding one section English of Chinese) carries out paragraph and cuts apart, is partitioned into a plurality of sentences unit;
For example: http://www.oxford.com.cn/Article_Show.asp? following contrast between Chinese and English paragraph is arranged in this internet address of ArticleID=1467:
Is the Chinese paragraph is: what the time? be a kind of as money can save, the flower with or the waste thing? perhaps it is as weather, it is a kind of thing that we can't grasp? whether all the same is the global time? you can say, that is a simple question, no matter you go there, one minute all is 60 seconds, one hour is 60 minutes, and one day is 24 hours, by that analogy.Uh, maybe.But in the U.S., just that's what it all adds up to for the meaning of time.Be an important resource between American's apparent time, perhaps Here it is, and why they like the cause of " Time is money ".
Is English paragraph: What is time? Is it a thing to be saved or spent or wasted, like money? Or is it something we have no control over, like the weather? Is time the same all over the world? That ' s an easy question, you say.Wherever you go, a minute is 60 seconds, an hour is 60 minutes, a day is24 hours, and so forth.Well, maybe.But in America, time is more thanthat.Americans see time as a valuable resource.Maybe that ' s why theyare fond of the expression, " Time is money. "
Through after sentence cuts apart, above Chinese paragraph be divided into 7 Chinese sentences, for:
What is time?
Be a kind of as money can save, the flower with or the waste thing?
Is perhaps it a kind of thing that we can't grasp as weather?
Whether all the same is the global time?
You can say that is a simple question, no matter you go there, one minute all is 60 seconds, and one hour is 60 minutes, and one day is 24 hours, by that analogy.
Uh, maybe.
But in the U.S., just that's what it all adds up to for the meaning of time.Be an important resource between American's apparent time, perhaps Here it is, and why they like the cause of " Time is money ".
Through after sentence cuts apart, above English paragraph be divided into 10 english sentences, for:
What?is?time?
Is?it?a?thing?to?be?saved?or?spent?or?wasted,like?money?
Or?is?it?something?we?have?no?control?over,like?the?weather?
Is?time?the?same?all?over?the?world?
That’s?an?easy?question,you?say.
Wherever?you?go,a?minute?is?60?seconds,an?hour?is?60?minutes,a?dayis?24?hours,and?so?forth.
Well,maybe.
But?in?America,time?is?more?than?that.
Americans?see?time?as?a?valuable?resource.
Maybe?that’s?why?they?are?fond?of?the?expression,″Time?is?money.″
2, Chinese sentence after top cutting apart and english sentence are kept original order, call the coupling verification algorithm, judge which sentence is to satisfactory matching rate.Here adopt seven kinds of situations to come distich that (sentence is to promptly: the corresponding Y sentence English of X sentence Chinese, that is to say that this X Chinese sentence and this Y english sentence are corresponding, the expressed meaning is identical) mated.These seven kinds of situations are respectively (Chinese sentence number are to the english sentence number): 1 pair 0,0 pair 1,1 pair 1,1 pair 2,2 pairs 1,1 pair 3,3 pairs 1, then can obtain seven evaluations of estimate;
For example: second sentence that goes on foot after cutting apart is calculated matching rate (" the 0th " in the following example promptly do not have sentence)
The 1st matching rate to English the 0th of Chinese is: 0.0
The 0th matching rate to English the 1st of Chinese is: 0.0
The 1st matching rate to English the 1st of Chinese is: 0.15384615384615385
The 1st matching rate to English the 1st, 2 of Chinese is: 0.007692307692307693
The the 1st, 2 matching rate to English the 1st of Chinese is: 0.010636499479268863
The 1st matching rate to English the 1st, 2,3 of Chinese is: 0.0025380710659898475
The the 1st, 2,3 matching rate to English the 1st of Chinese is: 0.00654321287503227
Have matching rate as can be known: it is the highest that the 1st of Chinese gets matching rate to English the 1st, and it is right therefore it to be formed a sentence, as a recorded and stored.After calculating, remove the 1st of the 1st of Chinese and English, calculate the matching rate of top seven kinds of situations again with identical method, the sentence that can obtain whole bilingual couplings is right.
3, with the highest taking-up of V (evaluation of estimate), (so-called threshold values is meant the numeral that we obtain by a large amount of statistics if this high evaluation value satisfies threshold values, every evaluation of estimate is more right than this digital big sentence, we will assert that they are corresponding, otherwise it is not corresponding, drawing threshold values through a large amount of statistics is 0.02401435932272006), judge that then they are that a sentence that meets is right.Read bilingual journal but not necessarily fully the sentence of meaning coupling to the time, verify its matching rate;
5. described step deposits bilingualism corpora 5 in, and soon the Sino-British sentence after the subordinate sentence matching treatment is to depositing among the bilingualism corpora B2.3 that operates on the server B;
For example: after the checking of subordinate sentence coupling, in bilingualism corpora B2.3, exist:
Record one
http://www.oxford.com.cn/Article_Show.asp?ArticleID=1467
What is time?
What?is?time?
Record two
http://www.oxford.com.cn/Article_Show.asp?ArticleID=1467
Be a kind of as money can save, the flower with or the waste thing?
Is?it?a?thing?to?be?saved?or?spent?or?wasted,like?money?
Record three
http://www.oxford.com.cn/Article_Show.asp?ArticleID=1467
Is perhaps it a kind of thing that we can't grasp as weather?
Or?is?it?something?we?have?no?control?over,like?the?weather?
Record four
http://www.oxford.com.cn/Article_Show.asp?ArticleID=1467
Whether all the same is the global time?
Is?time?the?same?all?over?the?world?
Record five
http://www.oxford.com.cn/Article_Show.asp?ArticleID=1467
You can say that is a simple question, no matter you go there, one minute all is 60 seconds, and one hour is 60 minutes, and one day is 24 hours, by that analogy.
That’s?an?easy?question,you?say.Wherever?you?go,a?minute?is?60seconds,an?hour?is?60?minutes,a?day?is?24?hours,and?so?forth.
Record six
http://www.oxford.com.cn/Article_Show.asp?ArticleID=1467
Uh, maybe.
Well,maybe.
Record seven
http://www.oxford.com.cn/Article_Show.asp?ArticleID=1467
But in the U.S., just that's what it all adds up to for the meaning of time.
Americans?see?time?as?a?valuable?resource.
Record eight
http://www.oxford.com.cn/Article_Show.asp?ArticleID=1467
Be an important resource between American's apparent time, perhaps Here it is, and why they like the cause of " Time is money ".
Maybe?that’s?why?they?are?fond?of?the?expression,″Time?is?money.″
6. described step sets up index 6 to bilingualism corpora, promptly index is set up in the record among the bilingualism corpora B2.3, to accelerate the speed of retrieval and inquisition;
7. described step responds user's request, retrieve close bilingual result and source URL7 thereof fast, sentence or the word inquired about wanted in i.e. mobile communication equipment client E that provides by native system by the user and desktop computer client F input, inquire about in the indexed file behind native system acquisition user's sentence or the word, retrieve identical or close bilingual result, and obtain its source URL;
For example: user's input " whether all the same the global time is ", carry out translation and inquiry, then return step and 5. write down four result in the example
http://www.oxford.com.cn/Article_Show.asp?ArticleID=1467
Whether all the same is the global time?
Is?time?the?same?all?over?the?world?
For example: the user imports " maybe ", carries out translation and inquiry, then returns step and 5. writes down six result in the example
http://www.oxford.com.cn/Article_Show.asp?ArticleID=1467
Uh, maybe.
Well,maybe.
8. described step shows close bilingual result and source URL8 thereof in various clients, retrieved at the server B end promptly that to import identical or close bilingual sentence right with the user, mobile communication equipment client E that provides at native system and desktop computer client F centralized displaying go out bilingual sentence to and the right source URL of bilingual sentence.Display result also comprises web page title, reaches the hyperlink of source correspondence, clicks bilingual result or source URL, can both be linked to the internet web page of this bilingual correspondence as a result.
For example: the user imports " maybe ", and the inquiry back in page result displayed is:
Uh, maybe.
Well,maybe.
Http:// www.oxford.com.cn/Article_Show.asp? ArticleID=1467Click top hyperlink, can be opened to the internet web page of this bilingual correspondence as a result.

Claims (6)

1, a kind of auxiliary translation searching engine system comprises that internet (A), server (B), wireless network connect (C), the Internet network connects (D), mobile communication equipment client (E), desktop computer client or browser (F), mobile subscriber (G), computer user (H);
One the tunnel, server (B), the wireless network that is communicated with internet (A) is connected (C), mobile communication equipment client (E), mobile subscriber (G) is communicated with successively;
Another road, server (B), the Internet network that is communicated with internet A is connected (D), desktop computer client or browser (F), computer user (H) is communicated with successively;
It is characterized in that:
Described server (B) comprises the translation search engine server (B1) that is communicated with successively, database server (B2), retrieval server (B3);
Wherein translation search engine server (B1) comprises network robot module (B1.1), the identification of web page index module (B1.2) webpage and pretreatment module (B1.3), subordinate sentence matching module (B1.4);
Wherein database server (B2) comprises source information storehouse (B2.1), web page index storehouse (B2.2), bilingualism corpora (B2.3);
Wherein retrieval server B3 comprises index module (B3.1), retrieval module (B3.2).
Described network robot module (B1.1) promptly a kind ofly grasps the info web on the internet get off, and is entered into the system module in the source information storehouse (B2.1);
Described web page index module (B1.2) is promptly a kind ofly analyzed the info web that is kept in the source information storehouse (B2.1), establishes the index that helps webpage identification, is entered into the system module in the web page index storehouse (B2.2);
Described webpage identification and pretreatment module (B1.3), it is a kind of single webpage that contains bilingual information of from web page index storehouse (B2.2), seeking, perhaps seek the pure first language version webpage that may have the contrast of second language version, find out the webpage of the second language version contrast of mating most by the web page index condition, the formation bilingual web page is right, then, to single webpage or bilingual web page to carrying out the noise purification filtering, remove the irrelevant information in the webpage, may there be the system module of bilingual translation contrast web page contents in extraction;
Described subordinate sentence matching module (B1.4), it is a kind of content of the info web to webpage identification and pretreatment module (B1.3) extraction, use the subordinate sentence matching algorithm, it is right to be divided into corresponding bilingual sentence, is entered into system module in the bilingualism corpora (B2.3) together with URL and web page title;
Described source information storehouse (B2.1), the database of info web is grasped in promptly a kind of storage from the internet;
Described web page index storehouse (B2.2), promptly a kind of storing helps webpage identification and the index of handling and the database of web page text;
Described bilingualism corpora (B2.3), promptly a kind of storage can provide the bilingual journal database of information of supplementary translation;
Described index module (B3.1), promptly a kind of to the bilingual sentence behind the coupling warehouse-in to setting up the system module of index;
Described retrieval module (B3.2), promptly a kind of user submits to by various user end to server (B) and wants the statement translated, and server (B) is handled it, and with close Chinese and English result, and web page address returns to the system module on the client end interface.
2, a kind of method that realizes auxiliary translation searching engine is characterized in that adopting following step:
1. network robot grasps webpage automatically and deposits source information storehouse (1) in;
2. utilize the web page index module to set up web page index storehouse (2);
3. utilize webpage identification and pretreatment module in the web page index storehouse, find out single webpage or bilingual web page right, and carry out webpage pre-service (3);
4. carry out subordinate sentence matching treatment (4);
5. deposit bilingual expectation storehouse (5) in;
6. the bilingual sentence after coupling being put in storage is to setting up index (6);
7. respond user's request, retrieve close bilingual result and source URL (7) thereof fast;
8. show close bilingual result and source URL (8) thereof in various clients.
3, by the described a kind of method that realizes auxiliary translation searching engine of claim 2, it is characterized in that:
The info web that utilizes the web page index module to set up web page index storehouse (2) network robot extracting is soon handled, and sets up the correlated characteristic information index of this webpage.
4, by the described a kind of method that realizes auxiliary translation searching engine of claim 2, it is characterized in that:
Utilize webpage identification and pretreatment module in the web page index storehouse, find out single webpage or bilingual web page right, and carry out webpage pre-service (3), promptly from web page index storehouse (B2.2), read webpage, carry out Web page classifying identification according to the type of webpage field of web page index storehouse (B2.2) record, then, to single webpage or bilingual web page to purifying or pre-service.
5, by the described a kind of method that realizes auxiliary translation searching engine of claim 2, it is characterized in that:
It is as follows to carry out subordinate sentence matching treatment (4) idiographic flow:
1. to purifying and pretreated webpage carries out paragraph and cuts apart, be partitioned into a plurality of sentences unit;
2. Chinese sentence after top cutting apart and english sentence are kept original order, call the coupling verification algorithm, judge which sentence is to satisfactory matching rate;
3. with the highest taking-up of evaluation of estimate V,, judge that then they are that a sentence that meets is right if this high evaluation value satisfies threshold values.
6, by the described a kind of method that realizes auxiliary translation searching engine of claim 2, it is characterized in that:
Show close bilingual result and source URL (8) thereof in various clients, display result also comprises web page title, reaches the hyperlink of source correspondence, clicks bilingual result and URL, can both be linked to the internet web page of this bilingual correspondence as a result.
CN 200510018660 2005-05-06 2005-05-06 Auxiliary translation searching engine system and method thereof Pending CN1707476A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200510018660 CN1707476A (en) 2005-05-06 2005-05-06 Auxiliary translation searching engine system and method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200510018660 CN1707476A (en) 2005-05-06 2005-05-06 Auxiliary translation searching engine system and method thereof

Publications (1)

Publication Number Publication Date
CN1707476A true CN1707476A (en) 2005-12-14

Family

ID=35581399

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200510018660 Pending CN1707476A (en) 2005-05-06 2005-05-06 Auxiliary translation searching engine system and method thereof

Country Status (1)

Country Link
CN (1) CN1707476A (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100458784C (en) * 2006-04-06 2009-02-04 中国科学院计算技术研究所 Researching system and method used in digital labrary
CN101419596B (en) * 2007-10-26 2010-07-21 英业达股份有限公司 Translation dictionary enquiring system applying to master/slave mode structure and method thereof
CN102043808A (en) * 2009-10-14 2011-05-04 腾讯科技(深圳)有限公司 Method and equipment for extracting bilingual terms using webpage structure
CN102075509A (en) * 2009-11-24 2011-05-25 英特尔公司 Methods and systems for real time language translation using social networking
CN102385609A (en) * 2010-08-30 2012-03-21 微软公司 Enhancing search-result relevance ranking using uniform resource locators for queries containing non-encoding characters
CN102982030A (en) * 2011-09-02 2013-03-20 北京百度网讯科技有限公司 Method and device for automatically generating webpage
US20130144600A1 (en) * 2009-03-18 2013-06-06 Microsoft Corporation Adaptive pattern learning for bilingual data mining
CN103377188A (en) * 2012-04-24 2013-10-30 苏州引角信息科技有限公司 Translation library construction method and system
CN103412857A (en) * 2013-09-04 2013-11-27 广东全通教育股份有限公司 System and method for realizing Chinese-English translation of webpage
CN104090915A (en) * 2014-06-12 2014-10-08 小米科技有限责任公司 Method and device for updating user data
CN104933195A (en) * 2015-07-13 2015-09-23 广西达译商务服务有限责任公司 Chinese and Burmese bilingual parallel text automatic acquisition system and realizing method thereof
CN104933192A (en) * 2015-07-13 2015-09-23 广西达译商务服务有限责任公司 Automatic Chinese and Filipino bilingual parallel text collection system and implementation method
CN104933194A (en) * 2015-07-13 2015-09-23 广西达译商务服务有限责任公司 Chinese and Vietnamese bilingual parallel text automatic acquisition system and realizing method thereof
CN104933193A (en) * 2015-07-13 2015-09-23 广西达译商务服务有限责任公司 Chinese and Bahasa Melayu bilingual parallel text automatic acquisition system and realizing method thereof
CN104965925A (en) * 2015-07-13 2015-10-07 广西达译商务服务有限责任公司 Automatic Chinese-Khmer bilingual parallel text acquisition system and implementation method
CN105005561A (en) * 2015-07-07 2015-10-28 刘改琳 Bilingual retrieval statistical translation system based on corpus
CN105022728A (en) * 2015-07-13 2015-11-04 广西达译商务服务有限责任公司 Automatic acquisition system of Chinese and Lao bilingual parallel texts and implementation method
CN105045862A (en) * 2015-07-13 2015-11-11 广西达译商务服务有限责任公司 System for automatically acquiring bilingual parallel corpus of Chinese-foreign languages and realization method
CN105045861A (en) * 2015-07-13 2015-11-11 广西达译商务服务有限责任公司 System for automatically collecting Hanyu and Bahasa Indonesia bilingualism parallel texts, and implementation method
CN105138548A (en) * 2015-07-13 2015-12-09 广西达译商务服务有限责任公司 System for automatically collecting Chinese-Thai bilingual parallel corpus and implementation method
CN102982030B (en) * 2011-09-02 2016-12-14 北京百度网讯科技有限公司 A kind of method and device automatically generating webpage
CN109815390A (en) * 2018-11-08 2019-05-28 平安科技(深圳)有限公司 Search method, device, computer equipment and the computer storage medium of multilingual information

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100458784C (en) * 2006-04-06 2009-02-04 中国科学院计算技术研究所 Researching system and method used in digital labrary
CN101419596B (en) * 2007-10-26 2010-07-21 英业达股份有限公司 Translation dictionary enquiring system applying to master/slave mode structure and method thereof
US8670975B2 (en) * 2009-03-18 2014-03-11 Microsoft Corporation Adaptive pattern learning for bilingual data mining
US20130144600A1 (en) * 2009-03-18 2013-06-06 Microsoft Corporation Adaptive pattern learning for bilingual data mining
CN102043808A (en) * 2009-10-14 2011-05-04 腾讯科技(深圳)有限公司 Method and equipment for extracting bilingual terms using webpage structure
CN102043808B (en) * 2009-10-14 2014-06-18 腾讯科技(深圳)有限公司 Method and equipment for extracting bilingual terms using webpage structure
CN102075509A (en) * 2009-11-24 2011-05-25 英特尔公司 Methods and systems for real time language translation using social networking
US9087045B2 (en) 2009-11-24 2015-07-21 Intel Corporation Methods and systems for real time language translation using social networking
US8977624B2 (en) 2010-08-30 2015-03-10 Microsoft Technology Licensing, Llc Enhancing search-result relevance ranking using uniform resource locators for queries containing non-encoding characters
CN102385609A (en) * 2010-08-30 2012-03-21 微软公司 Enhancing search-result relevance ranking using uniform resource locators for queries containing non-encoding characters
CN102982030A (en) * 2011-09-02 2013-03-20 北京百度网讯科技有限公司 Method and device for automatically generating webpage
CN102982030B (en) * 2011-09-02 2016-12-14 北京百度网讯科技有限公司 A kind of method and device automatically generating webpage
CN103377188A (en) * 2012-04-24 2013-10-30 苏州引角信息科技有限公司 Translation library construction method and system
CN103412857A (en) * 2013-09-04 2013-11-27 广东全通教育股份有限公司 System and method for realizing Chinese-English translation of webpage
CN104090915A (en) * 2014-06-12 2014-10-08 小米科技有限责任公司 Method and device for updating user data
CN105005561A (en) * 2015-07-07 2015-10-28 刘改琳 Bilingual retrieval statistical translation system based on corpus
CN105005561B (en) * 2015-07-07 2018-11-16 刘改琳 A kind of bilingual retrieval statistics translation system based on corpus
CN104933195A (en) * 2015-07-13 2015-09-23 广西达译商务服务有限责任公司 Chinese and Burmese bilingual parallel text automatic acquisition system and realizing method thereof
CN104965925A (en) * 2015-07-13 2015-10-07 广西达译商务服务有限责任公司 Automatic Chinese-Khmer bilingual parallel text acquisition system and implementation method
CN104933193A (en) * 2015-07-13 2015-09-23 广西达译商务服务有限责任公司 Chinese and Bahasa Melayu bilingual parallel text automatic acquisition system and realizing method thereof
CN105022728A (en) * 2015-07-13 2015-11-04 广西达译商务服务有限责任公司 Automatic acquisition system of Chinese and Lao bilingual parallel texts and implementation method
CN105045862A (en) * 2015-07-13 2015-11-11 广西达译商务服务有限责任公司 System for automatically acquiring bilingual parallel corpus of Chinese-foreign languages and realization method
CN105045861A (en) * 2015-07-13 2015-11-11 广西达译商务服务有限责任公司 System for automatically collecting Hanyu and Bahasa Indonesia bilingualism parallel texts, and implementation method
CN105138548A (en) * 2015-07-13 2015-12-09 广西达译商务服务有限责任公司 System for automatically collecting Chinese-Thai bilingual parallel corpus and implementation method
CN104933194A (en) * 2015-07-13 2015-09-23 广西达译商务服务有限责任公司 Chinese and Vietnamese bilingual parallel text automatic acquisition system and realizing method thereof
CN104933192A (en) * 2015-07-13 2015-09-23 广西达译商务服务有限责任公司 Automatic Chinese and Filipino bilingual parallel text collection system and implementation method
CN109815390A (en) * 2018-11-08 2019-05-28 平安科技(深圳)有限公司 Search method, device, computer equipment and the computer storage medium of multilingual information
CN109815390B (en) * 2018-11-08 2023-08-08 平安科技(深圳)有限公司 Method, device, computer equipment and computer storage medium for retrieving multilingual information

Similar Documents

Publication Publication Date Title
CN1707476A (en) Auxiliary translation searching engine system and method thereof
CN1133127C (en) Document retrieval system
CN1536483A (en) Method for extracting and processing network information and its system
CN1924858A (en) Method and device for fetching new words and input method system
CN1691007A (en) Method, system or memory storing a computer program for document processing
CN1158627C (en) Method and apparatus for character recognition
CN1171162C (en) Apparatus and method for retrieving charater string based on classification of character
CN1667609A (en) Document information management system and document information management method
CN1894688A (en) Translation determination system, method, and program
CN1530926A (en) Phonetic recognizing dictionary producer and information search device
CN1319836A (en) Method and device for converting expressing mode
CN1219266C (en) Method for realizing multi-path dialogue for man-machine Chinese colloguial conversational system
CN1625740A (en) Index structure of metadata, method for providing indices of metadata, and metadata searching method and apparatus using the indices of metadata
CN1368693A (en) Method and equipment for global software
CN1947090A (en) Media asset management system for managing video segments from fixed-area security cameras and associated methods
CN1439979A (en) Solution scheme data editing process and automatic summarizing processor and method
CN1215457C (en) Sentense recognition device, sentense recognition method, program and medium
CN1992728A (en) Systems and methods for facilitating group collaborations
CN1975858A (en) Conversation control apparatus
CN101069181A (en) Storage device and recording medium
CN101046812A (en) Method of data base table recording structure and detection and its device
CN1680942A (en) Document group analyzing apparatus, a document group analyzing method, a document group analyzing system
CN1696933A (en) Method for automatic picking up conceptual relationship of text based on dynamic programming
CN1156779C (en) Method and apparatus for document retrieval
CN1737802A (en) Information processing apparatus and method, recording medium, and program

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication