CN101933017B - Document search device, document search system, and document search method - Google Patents

Document search device, document search system, and document search method Download PDF

Info

Publication number
CN101933017B
CN101933017B CN2009800000314A CN200980000031A CN101933017B CN 101933017 B CN101933017 B CN 101933017B CN 2009800000314 A CN2009800000314 A CN 2009800000314A CN 200980000031 A CN200980000031 A CN 200980000031A CN 101933017 B CN101933017 B CN 101933017B
Authority
CN
China
Prior art keywords
keyword
translation
document
score
retrieval
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2009800000314A
Other languages
Chinese (zh)
Other versions
CN101933017A (en
Inventor
小岛荣之
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mitsubishi Electric Information Systems Corp
Mitsubishi Electric Information Technology Corp
Original Assignee
Mitsubishi Electric Information Systems Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mitsubishi Electric Information Systems Corp filed Critical Mitsubishi Electric Information Systems Corp
Publication of CN101933017A publication Critical patent/CN101933017A/en
Application granted granted Critical
Publication of CN101933017B publication Critical patent/CN101933017B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

Provided are a document search device, a document search method, a document search system, and a document search program in which, when a document is searched for by using an inputted keyword and a translated keyword, the order of priority of the documents outputted as the result of the search can be adequately determined. A keyword translation means (22) of a document search device (10) translates an inputted keyword to obtain a translated keyword. A keyword score determining means (23) determines a keyword score for each of the inputted keyword and translated keyword. A document search means (24) searches for a document according to the inputted and translated keywords to acquire a plurality of documents. A document score calculating means (25) calculates a document score for each of the search result documents from the keyword scores and the frequency of appearance of each of the keywords. A search result output means (26) places the search result documents in order of document score and performs the output according to the order.

Description

Document search device, document retrieval system and document retrieval method
Technical field
The present invention relates to use document search device and the document retrieval method of keyword retrieval file, particularly use document search device and the document retrieval method of the keyword retrieval file after translating.In addition, the present invention relates to comprise the document retrieval system of this document search device.And, the invention still further relates to the document retrieval program, be used for making computing machine to have function as this document search device or document retrieval system.
Background technology
In document retrieval system, when document data bank has comprised multilingual file, be well known that the keyword that other language are used as retrieving usefulness translated in the keyword of input.Put down in writing the example of this system in the patent documentation 1.In patent documentation 1, put down in writing translating into English with the keyword of Japanese appointment, the document of Japanese has been retrieved with Japanese, the document of English has been retrieved in English.
Patent documentation 1: Jap.P. open communique spy open flat 10-232883 number
Summary of the invention
In the prior art, under the situation of using multilingual to retrieve, can not determine the priority as the output file of result for retrieval rightly.
Because word generally all is polysemant, when other language translated in the keyword of importing with mother tongue, might not be best selection.Therefore, when determining the priority of file in the lists of documents at result for retrieval, for example can not determine priority rightly to the file that comprises the keyword after the translation sometimes.
In order to solve such problem, the purpose of this invention is to provide a kind of like this document search device and document retrieval method, when the keyword after the keyword that uses input and the translation carries out document retrieval, can determine the priority as the output file of result for retrieval rightly.
The purpose of this invention is to provide a kind of document retrieval system that comprises this document search device in addition.
The invention provides the document search device that uses the keyword retrieval file, it comprises: the keyword receiving-member receives more than one keyword as the input keyword; The keyword translation unit corresponding to each described input keyword, obtains described input keyword is translated into the translation keyword of multiple other language of other language; Keyword score determining means, keyword score determined in each described input keyword, each described input keyword is corresponding to a plurality of translation keywords with order, described keyword score determining means is to whole combinations of each described input keyword and each described translation keyword, determine the translation scoring according to described order, described keyword score determining means is to each described translation keyword, determine described keyword score according to the whole described translation scoring that is associated, wherein, the described keyword score of described input keyword is all higher than the described keyword score of any one described translation keyword of importing keyword corresponding to this; The document retrieval parts according to described input keyword and described translation keyword retrieval file, obtain a plurality of result for retrieval files; The document score calculating unit is marked according to described keyword score calculation document to each described result for retrieval file; And the result for retrieval output block, each described result for retrieval file and corresponding described document score are associated laggard line output.
Each imports keyword corresponding to a plurality of translation keywords with order, and keyword score determining means is determined the keyword score of translation keyword according to order.
Keyword score determining means is determined the translation scoring to whole combinations of each input keyword and each translation keyword according to order.
The document score calculating unit is also translated the number of times that keyword occurs according to each input keyword and each in result for retrieval, come the calculation document scoring.
The document score calculating unit comes the calculation document scoring also according to the discrimination that the character recognition of result for retrieval file is handled.
In addition, document retrieval system provided by the invention comprises: described document search device; The translation service device generates the translation keyword according to the input keyword; And document data bank, storage is as a plurality of files of searching object.
In addition, the invention provides the document retrieval method that uses the keyword retrieval file, it comprises: the keyword receiving step obtains more than one keyword as the input keyword; The keyword translation steps obtains to translate into the translation keyword of multiple other language of other language importing keyword; The keyword score determining step, keyword score determined in each input keyword, each described input keyword is corresponding to a plurality of translation keywords with order, described keyword score determining step is to whole combinations of each described input keyword and each described translation keyword, determine the translation scoring according to described order, described keyword score determining step is to each described translation keyword, determine described keyword score according to the whole described translation scoring that is associated, wherein, the described keyword score of described input keyword is all higher than the described keyword score of any one described translation keyword of importing keyword corresponding to this; The document retrieval step according to input keyword and translation keyword retrieval file, obtains a plurality of result for retrieval files; The document score calculation procedure is marked according to the keyword score calculation document to each result for retrieval file; And result for retrieval output step, each result for retrieval file and corresponding file scoring are associated laggard line output.
Document search device of the present invention, document retrieval method and document retrieval system, keyword score determined in keyword after each input keyword and each translation, and according to the scoring of this keyword score calculation document, so can determine the priority of the file exported as result for retrieval rightly.
Description of drawings
Fig. 1 is the figure that expression document retrieval system of the present invention constitutes.
Fig. 2 is the process flow diagram of the document search device action in the document retrieval system of key diagram 1.
Fig. 3 is the figure of the example of expression input keyword and the corresponding relation of translating keyword.
Fig. 4 be the order of expression translation keyword and mark by the translation of this order between the figure of example of corresponding relation.
Fig. 5 is expression to the sequenced translation of each keyword scoring and finally gives the figure of the example of the corresponding relation between the keyword score of each keyword.
Fig. 6 is the figure that is illustrated in the example of the information of each keyword occurrence number of expression in the textual data of result for retrieval file.
Fig. 7 is that expression is to the figure of the example of the document score result of calculation of result for retrieval file.
Embodiment
The present invention is when retrieving from the document data bank that comprises the file of writing with various language such as Japanese, English, French, Chinese, when having imported the keyword of certain language, utilize translation engine that the keyword of input is converted to the language of other countries, the keyword after using the keyword of input simultaneously and converting the language of other countries to is retrieved.By marking to keyword, determine the priority between the keyword, and this priority is reflected in document retrieval result's the priority and exports.Thus, can realize corresponding multilingual document retrieval mode.
With reference to the accompanying drawings embodiments of the present invention are described below.
Embodiment 1
Fig. 1 represents the formation of document retrieval system 100 of the present invention.Document retrieval system 100 is used for using keyword to carry out document retrieval.
Document retrieval system 100 comprises the document search device 10 that uses the keyword retrieval file.
Document search device 10 is signal conditioning packages, has well-known structure as computing machine.
Document search device 10 has input media 30, is used for the user and imports keyword.This input media 30 for example is mouse or keyboard etc.In addition, document search device 10 has display device 40, shows the result of retrieval process to the user.Display device 40 for example is display or printer etc.Document search device 10 has the arithmetic unit 20 that carries out computing in addition.Arithmetic unit 20 for example is the CPU(central processing unit).
In addition, though not expression among the figure, document search device 10 comprises storer and the HDD(hard disk drive as the memory unit of storage information).And document search device 10 has network interface, be used for and other signal conditioning packages between send or reception information.
In the memory unit of document search device 10, the document retrieval program of regulation document search device 10 and arithmetic unit 20 actions is installed.Arithmetic unit 20 is by carrying out this document search program, bring into play the function as keyword receiving-member 21, keyword translation unit 22, keyword score determining means 23, document retrieval parts 24, document score arithmetic unit 25 and result for retrieval output block 26 shown in Figure 1, their detailed functions separately will be narrated in the back.
Arithmetic unit 20 makes as the document search device 10 of computing machine and realizes other functions of record in this manual by execute file search program or other program in addition.
Document retrieval system 100 comprises translation service device 110, being connected with the mode that document search device 10 communicates.Translation service device 110 carries out the translation of keyword.Translation service device 110 receives the word with certain language performance, just it is translated into other language and output.That is, has the function of this input keyword being translated into the keyword (translation keyword) of other language according to the keyword of importing (input keyword) generation.Wherein so-called " translation " also can catch the conversion from the keyword of certain language to the keyword of other language.
Translation service device 110 carries out multilingual translation.Export after for example generating the translation keyword of the translation keyword of English and French for the input keyword of Japanese.
In addition, translation service device 110 generates a plurality of translation keywords with order for an input keyword.That is, for certain word, the frequency of using separately according to the translation word of correspondence for example begins order from the translation word of frequent use and sorts, and generates the inventory of translation keyword.This inventory for example by a translation keyword is arranged in order, represents respectively to translate the order of keyword, but also can represent respectively to translate the order of keyword by corresponding the numerical value of translation keyword and order of representation etc.
The structure of translation service device 110 can be used known structure.For example, the lexicon file that is associated with more than one translation word installed respectively in 110 pairs of a plurality of words of translation service device, and translate with reference to this lexicon file.
Document retrieval system 100 comprises the document data bank 120 being connected with the mode of document search device 10 communications.Document data bank 120 storage file indexing units 10 carry out a plurality of files of retrieval process object.
Document data bank 120 receives the more than one keyword of input, extracts the file that all comprise described keyword from the file of storage, and exports file or its inventory that extracts.
Utilize the example of the data of the process flow diagram of Fig. 2 and Fig. 3~Fig. 7 that the action of the document retrieval system 100 that constitutes as mentioned above is described.
Fig. 2 is the process flow diagram of the action of the document search device 10 in the supporting paper searching system 100.At first keyword receiving-member 21 receives the more than one input keyword (step S1, keyword receiving step) that is used for retrieval from the user by input media 30.In this example, receive the input keyword of " sir ", " teacher " these two Japanese.
Then, keyword translation unit 22 is utilized translation service device 110, and translation keyword (step S2, keyword translation steps) translated in the input keyword.In this step S2, keyword translation unit 22 sends the input keyword to translation service device 110, and translation service device 110 generates the translation keyword to the input keyword that receives respectively, and keyword translation unit 22 is given in loopback.Make keyword translation unit 22 obtain the translation keyword like this.
The example that Fig. 3 represents to import keyword and translates the corresponding relation of keyword.In this example, this translation keyword comprises two kinds of English shown in Fig. 3 (a) and the French shown in Fig. 3 (b).In the table of Fig. 3 (a), has the translation keyword of " master " these three English of " instructor " corresponding to order 1 " teacher ", order 2, order 3 for " sir " this input keyword.Thus, translation service device 110 is with each input keyword and a plurality of translation keyword corresponding stored that have order.
In addition, in the table of Fig. 3 (b), for identical " sir " this input keyword, has the translation keyword corresponding to " instructeur " these two French of order 1 " professeur ", order 2.Thus, keyword translation unit 22 obtains the language multilingual translation keyword in addition of input keyword.
And document search device 10 also can store input keyword, the translation keyword of acquisition and corresponding relation shown in Figure 3 in the memory unit into forms such as tables.
Then, keyword score determining means 23 is determined keyword score (step S3, keyword score determining step) for each input keyword and each translation keyword.Wherein, keyword score determining means 23 is determined keyword score according to Fig. 4 and corresponding relation shown in Figure 5.
Fig. 4 represents to translate the order of keyword and translates the example of the corresponding relation of scoring in sequence according to this.Keyword score determining means 23 determines respectively to translate the keyword score of keyword according to this translation scoring.Document search device 10 is stored in corresponding relation shown in Figure 4 in its memory unit in advance with forms such as tables, and in addition, the user of document search device 10 and supvr also can suitably change this corresponding relation.
Usually give the scoring of fixing regulation for the input keyword, for example 100(in addition, this mark as described later, some different purposes because translation is marked are so represent with bracket in Fig. 4).In addition, for the translation keyword, give different translation scorings in proper order according to it.Scoring that gives of the every decline of order just reduces the numerical value of regulation, for example reduces by 10 at every turn, and order 1 is 90, and order 2 is 80, and order 3 is 70.
The value of this scoring is more big, also just means the file that comprises this keyword more important in result for retrieval (namely the order of this document is more forward in result for retrieval).Thus, keyword score determining means 23 is determined the keyword score of translation keyword according to the order of translation keyword.
In addition, the relation of this order and translation scoring is not limited to situation shown in Figure 4.For order 1 translation scoring so long as get final product than the low value of keyword score of corresponding input keyword.Also can represent by monotonic decreasing function with the reduction (namely the numerical value with order of representation becomes big in this example) of order for the translation scoring of order below 2.
Translation service device 110 sorts to the translation keyword according to the frequency that is used as the translation word usually.Wherein, under the situation of the information such as logical relation before and after the structure of not considering article and the article, among a plurality of translation words to the record in dictionary etc. of certain word, as the high translation word of frequency of utilization of translation word, can be described as more appropriate translation word in the reality.Compare with the file that only comprises the translation word that is not more appropriate, the possibility of the desirable file of the file person of being to use that comprises more appropriate translation word is big.That is, the more forward translation keyword of order can be described as more reliable keyword.Keyword score determining means 23 so it is higher that the translation of more reliable translation keyword is marked, thereby can obtain more reliable result for retrieval owing to determining the translation scoring according to order of each translation keyword.
In addition, 110 pairs of each keywords of translation service device need not carry out ordering corresponding to frequency of utilization by the statistical study of strictness.Because general dictionary etc. have been considered the frequency of utilization of translation word etc. usually to a certain extent, determine the order that it publishes, so use general well-known dictionary, the precision for improving result for retrieval can obtain effect to a certain degree.
Fig. 5 is that expression is to the example of the sequenced translation scoring of each keyword and the corresponding relation of the keyword score of finally giving each keyword.
As mentioned above, keyword score determining means 23 give usually the input keyword keyword score be 100.For the translation keyword, at first, to whole combinations of each input keyword and each translation keyword, determine the translation scoring in order.In Fig. 5, to whole combinations (adding up to 10) of two input keywords and five translation keywords, give sequenced translation scoring.
Shown in Fig. 3 (a), because translation keyword " master " is order 3 for input keyword " sir ", so in Fig. 4, give translation scoring 70 corresponding to order 3.In addition, because this translation keyword " master " is order 2 for input keyword " teacher ", so in Fig. 4, give translation scoring 80 corresponding to order 2.In addition, all not have under the situation of order for any one input keyword wherein at certain translation keyword, that is, when this translation keyword is not translation to this input keyword, the translation to this combination is marked as 0.But translation scoring in this case can not be 0 also, gets final product so long as comparison should be imported all little value of translation scoring of any one other translation keyword of keyword.
Like this, according to the translation scoring of determining, keyword score determining means 23 is determined final keyword score to each translation keyword again.In the example of Fig. 5, by giving the average translation scoring of this translation keyword, be used as the keyword score of this translation keyword.
Thus, 23 pairs of keyword score determining means are respectively translated keyword, mark to determine keyword score according to the whole translations that are associated.
Document search device 10 also can be stored in corresponding relation shown in Figure 5 in its memory unit with forms such as tables.
Wherein, as mentioned above, the keyword score of giving the input keyword is generally 100.In addition, owing to translate scoring all at (that is, below the translation scoring with respect to order 1) below 90, so the keyword score after it is averaged (keyword score of translation keyword) is usually below 90.Therefore, it is all higher than any one keyword score of the translation keyword of giving other language to give the value of keyword score of input keyword of mother tongue.
The input keyword that communicates in one's mother tongue is compared with only comprising the file of translating keyword so comprise the file of importing keyword owing to do not have translation error or translate inappropriate possiblely, and the possibility of the desirable file of the person of being to use is big.That is, we can say that the input keyword is more reliable keyword.Thus, by the scoring of more reliable input keyword is set high drawing attention, and relatively the translation keyword scoring set lowly, can obtain result for retrieval more accurately.
In addition, as the translation keyword " master " in this example, under the situation of the corresponding a plurality of input keywords of certain translation keyword, comprise the file of this translation keyword and compare with the file that only comprises other translation keywords, the possibility of the desirable file of the person of being to use is big.That is, we can say that such translation keyword is more reliable keyword.
Wherein, keyword score determining means 23 can improve the keyword score of simultaneously corresponding with a plurality of input keywords translation keyword by marking to determine keyword score according to the whole translations that are associated with certain translation keyword.For example, the translation keyword " master " of Fig. 5 is all corresponding with input keyword " sir ", " teacher ", has to correspond respectively to not to be that 0 translation marks.Translate keyword " instructor " corresponding to input keyword " sir ", and do not correspond to " teacher ", marking for the translation of " teacher " is 0.Its result, the keyword score of translation keyword " master " is higher.Thus, by the scoring of more reliable translation keyword is set high drawing attention, and relatively other the translation keyword scoring set lowly, thereby can obtain result for retrieval more accurately.
Then, document retrieval parts 24 utilize document retrieval system 100, according to input keyword and translation keyword retrieval file, obtain a plurality of files (step S4, document retrieval step) as the result for retrieval file.In this step S4, document retrieval parts 24 send input keyword and translation keyword to document data bank 120, document data bank 120 is extracted all files that comprises certain input keyword and translation keyword out from the file of storage, and gives document retrieval parts 24 file of extracting out as the loopback of result for retrieval file.
Wherein, because the translation keyword of the input keyword that communicates in one's mother tongue of document retrieval parts 24 and other language retrieves, so even in the document data bank 120 that comprises multilingual file, retrieve, also can obtain the result by primary retrieval.
In addition, the result for retrieval file that obtains in step S4 comprises the information (title, time on date, author etc.) of the textual data of identifying this document, also can not necessarily comprise this textual data.Do not comprise at the result for retrieval file under the situation of textual data, can from document data bank 120, export textual data itself according to other requirement by the user.
The information that in each result for retrieval file, can have each keyword of expression occurrence number in this textual data relatedly.
Fig. 6 represents the example of this information.In this example, extract file A~file J out as the result for retrieval file.For example translate keyword " teacher " and occur 12 times in file A, translation keyword " instructor " occurs 10 times, and translation keyword " master " occurs 6 times, represents that for file A the occurrence number of whole keywords adds up to 28 times.Document data bank 120 is respectively imported the number of times that keyword and each translation keyword occur to each result for retrieval file statistics thus, is attached to it in the result for retrieval file respectively and document retrieval parts 24 are given in loopback relatedly.In addition, in Fig. 6, the result for retrieval file is sorted by the number of times that each keyword occurs.
Document search device 10 also can be stored in corresponding relation shown in Figure 6 in its memory unit with forms such as tables.
In the example of Fig. 6, adopt the keyword occurrence number, but also can replace the discrimination that employing affix in the keyword occurrence number utilizes character recognition.
In the file of representing character string in the file with character code (data that text data or word processor are used etc.), the control treatment of employing character code can correctly calculate the occurrence number of keyword.And under the situation of the file of using the pictorial data representation character string, need carry out character recognition to handle, image transitions is become character code, but the precision that this character recognition is handled is not necessarily high.So when character recognition is handled, also can be to this document with the benchmark of regulation the degree that can carry out character recognition as discrimination, estimate, add this discrimination.For example, also can the numerical value of expression keyword occurrence number be reduced according to discrimination.Specifically, be 100% file for discrimination, directly adopt the occurrence number of keyword, be 50% file for discrimination, can reduce by half the occurrence number of keyword to adopt.
Wherein, the computing method of discrimination are so long as existing known character recognition processing method, and which kind of then adopts can.
Then, 25 pairs of each result for retrieval files of document score calculating unit, according to the keyword score of being determined by keyword score determining means 23 (with reference to Fig. 5) and respectively import keyword and the occurrence number (with reference to Fig. 6) of translation keyword, calculation document scoring (step S5, calculation document scoring step).
In this step S5, for example the number of times that the keyword score of each keyword and this keyword are occurred in its result for retrieval file multiplies each other, by all keywords being added up to come the calculation document scoring.This document scoring can be represented the possibility (accuracy) of the desirable file of this result for retrieval file person of being to use.
Fig. 7 represents to utilize the example of the result of calculation that these computing method obtain.Have keyword score and be 90 translation keyword " teacher " in file A and occurred 12 times, multiplied result is 90 * 12=1080.Equally, be 400 for the multiplied result of translating keyword " instructor ", be 450 for the multiplied result of translating keyword " master ".Input keyword in addition and translation keyword do not occur in file A, and multiplied result is 0.The value of the document score of file A for these numerical value all are added together is 1930.
In addition, document search device 10 also can be stored in corresponding relation shown in Figure 7 in its memory unit with forms such as tables.
For the file with the pictorial data representation character string, document score calculating unit 25 also can be added the discrimination that the character recognition of result for retrieval file is handled on the basis of keyword score and occurrence number, come the calculation document scoring.
Wherein, because keyword score is different values at each keyword, so the document score of keyword occurrence number file how is not necessarily high.For example, what the keyword occurrence number was maximum in the result for retrieval file is file A(28 time, with reference to Fig. 6), and document score the highest be file C(2500, with reference to Fig. 7), the transposing of their order.Its reason is that the keyword that occurs in file C is the input keyword entirely, thus the keyword score of each keyword than higher, the keyword that occurs in file A is the translation keyword entirely on the contrary, so the keyword score of each keyword is lower.In addition, the keyword score between each translation keyword is also different, so will pay attention to more reliable translation keyword.
Thus, document score calculating unit 25 is considered the difference of the matter of each keyword when calculating the document score of each result for retrieval file, so compare with the method for only marking with the occurrence number calculation document of keyword, can estimate more accurately.
Then, result for retrieval output block 26 makes result for retrieval file (being file A~file J) and by the document score that document score calculating unit 25 calculates the respectively back output (step S6, result for retrieval output step) that is associated.Show that to the user thus, the user can know result for retrieval by display device 40.At this moment, result for retrieval output block 26, and is exported by this result for retrieval file ordering in proper order with document score order from high to low.
As mentioned above, document retrieval method and document retrieval system 100 that the document search device 10 of embodiment of the present invention 1, document search device 10 are carried out, keyword score determined in the keyword of each input and the keyword of translation, and according to the score calculation document score of this keyword, so can determine priority as the file of result for retrieval output rightly.
In described embodiment 1, the language of expression input keyword is Japanese, and the language of translation keyword is English and French, but they also can be other language, for example also can comprise Chinese.It is consistent that the language that uses with the user can be set in the language of expression input keyword, and other language of expression translation keyword can be set for consistent with the language of the file that comprises in the document data bank 120.
The language of expression translation keyword also can be single language (for example only being English).Translation service device 110 also can be exported a translation keyword for the input keyword, can also export a plurality of translation keywords that do not sort.Even this structure is if importing keyword score difference between keyword and the translation keyword, also can obtain comparing result more accurately with existing retrieval.
In addition, in the example of embodiment 1, carry out OR retrieval (logic and retrieval), as long as the file of any one keyword in a plurality of input keywords and a plurality of translation keyword occurs, all obtain as the result for retrieval file.Different therewith, also can carry out AND retrieval (logic product retrieval).
In this case, in the step S4 of Fig. 2, document retrieval parts 24 send input keyword and translation keyword to document data bank 120, and the AND retrieval is carried out in indication.Document data bank 120 is extracted the All Files of meet the following conditions i and ii out from the file of storage, and gives document retrieval parts 24 file of extracting out as the loopback of result for retrieval file.
?condition i: for the input keyword " sir ", among this input keyword itself and the translation keyword " teacher " corresponding with it, " instructor ", " master ", " professeur ", " instructeur ", occur one at least.
?condition ii: for input keyword " teacher ", among this input keyword itself and the translation keyword " teacher " corresponding with it, " master ", " professeur ", occur one at least.
In other words, 120 pairs of document retrieval parts 24 and document data banks are respectively imported keyword, by this input keyword is connected with the OR condition with the translation keyword corresponding with it, make the keyword sets of each input keyword, and this keyword sets all connected with the AND condition, make final search condition.
As the result who uses this condition to retrieve, for example in embodiment 1 in the file shown in Figure 6 as the result for retrieval file, file H is owing to neither comprise input keyword " teacher ", do not comprise translation keyword " teacher ", " master ", " professeur " corresponding with it yet, the ii so do not satisfy condition is not drawn out of.In addition, the file J ii that do not satisfy condition too is not drawn out of yet.
In addition, in this example, because translation keyword " teacher ", " master " and " professeur " are and two input keywords " sir ", " teacher " corresponding translation keyword, are drawn out of so these any one files of translating keywords occur.For example file E comprises translation keyword " teacher ", because this translates keyword condition i and condition ii is satisfied, so file E is drawn out of.
Even under the situation of such AND retrieval, the later processing of step S5 can similarly be carried out with the OR retrieval.That is, identical with embodiment 1, calculate document score and export result for retrieval.But because in this example, file H and file J are not drawn out of in step S4, so not to the processing after file H and the file J execution in step S5.
In addition, in embodiment 1, when retrieving by document retrieval parts 24, must use the translation keyword to retrieve, but also can replace it, for example the user can suitably specify and not use the translation keyword and only use the input keyword to retrieve.Thus, also can carry out the processing identical with the existing document retrieval of only using the input keyword as required.
120 pairs of each files as searching object of document data bank, it also can related ground storage representation this document be the language message with what language representation, translation service device 110 too, to each translation keyword, it also can related ground this translation keyword of storage representation be the language message with what language representation.In this case, the input keyword uses the language of the regulation that is equivalent to mother tongue to get final product usually.
For example, even exist sometimes Chinese translated in certain keyword of Japanese, also be the situation of identical expression mode (character string of representing with identical character code).For this keyword, can suitably use the keyword score of input keyword to the file of Japanese, suitably use the keyword score of translation keyword for the file of Chinese.That is, in input keyword and translation keyword, for different language and the identical keyword of expression mode, when calculating the document score of result for retrieval file, also can adopt this result for retrieval file keyword score consistent with language message.
Thus, even under the situation of the keyword that the expression mode is identical having comprised multilingual, also can estimate each keyword accuracy rightly.
In embodiment 1, the number of times that document data bank 120 statistics keywords occur in the result for retrieval file, but also can add up with other inscape.For example, the textual data of result for retrieval file is sent to document search device 10 from document data bank 120, can be added up by document retrieval parts 24 or the document score calculating unit 25 of document search device 10.
Translation service device 110 and document data bank 120 are so long as about sending or receive appropriate information between the retrieval of the translation of keyword and file and the document search device 10, then be which type of device can, for example can be constituted by computing machine respectively, in addition, be installed in program in separately the memory unit by execution, can realize getting final product as the function of translation service device 110 and document data bank 120.In this case, the program of the program of document search device 10, translation service device 110 and the program of document data bank 120 be as the document retrieval program, makes this computing machine have function as document retrieval system 100.
In the hardware configuration of embodiment 1, document search device 10 as an independent computing machine comprises keyword receiving-member 21, keyword translation unit 22, keyword score determining means 23, document retrieval parts 24, document score calculating unit 25 and result for retrieval output block 26, and translation service device 110 and document data bank 120 can be set to an independent computing machine respectively.But hardware configuration also can be different therewith.For example, the computing machine of configuration file indexing unit 10 also can have simultaneously as the function of translation service device 110 with as the function of document data bank 120.

Claims (6)

1. document search device that uses the keyword retrieval file, it comprises:
The keyword receiving-member receives more than one keyword as the input keyword;
The keyword translation unit corresponding to each described input keyword, obtains described input keyword is translated into the translation keyword of multiple other language of other language;
Keyword score determining means is determined keyword score to each described input keyword,
Each described input keyword corresponding to have the order a plurality of translation keywords,
Described keyword score determining means is determined the translation scoring to whole combinations of each described input keyword and each described translation keyword according to described order,
Described keyword score determining means is determined described keyword score to each described translation keyword according to the whole described translation scoring that is associated, wherein,
The described keyword score of described input keyword is all higher than the described keyword score of any one described translation keyword of importing keyword corresponding to this;
The document retrieval parts according to described input keyword and described translation keyword retrieval file, obtain a plurality of result for retrieval files;
The document score calculating unit is marked according to described keyword score calculation document to each described result for retrieval file; And
The result for retrieval output block associates laggard line output with each described result for retrieval file and corresponding described document score.
2. document search device according to claim 1 is characterized in that,
Described keyword score determining means is determined the described keyword score of described translation keyword according to described order.
3. document search device according to claim 1 is characterized in that, the number of times that described document score calculating unit also occurs in described result for retrieval file according to each described input keyword and each described translation keyword calculates described document score.
4. document search device according to claim 3 is characterized in that, described document score calculating unit also according to the discrimination that the character recognition of described result for retrieval file is handled, calculates described document score.
5. document retrieval system is characterized in that comprising:
Document search device as claimed in claim 1;
The translation service device generates described translation keyword according to described input keyword; And
Document data bank, storage is as a plurality of described file of searching object.
6. document retrieval method that uses the keyword retrieval file, it comprises:
The keyword receiving step obtains more than one keyword as the input keyword;
Keyword translation steps, acquisition are translated into described input keyword the translation keyword of multiple other language of other language;
The keyword score determining step is determined keyword score to each described input keyword,
Each described input keyword corresponding to have the order a plurality of translation keywords,
Described keyword score determining step is determined the translation scoring to whole combinations of each described input keyword and each described translation keyword according to described order,
Described keyword score determining step is determined described keyword score to each described translation keyword according to the whole described translation scoring that is associated, wherein,
The described keyword score of described input keyword is all higher than the described keyword score of any one described translation keyword of importing keyword corresponding to this;
The document retrieval step according to described input keyword and described translation keyword retrieval file, obtains a plurality of result for retrieval files;
The document score calculation procedure is marked according to described keyword score calculation document to each described result for retrieval file; And
Result for retrieval output step associates laggard line output with each described result for retrieval file and corresponding described document score.
CN2009800000314A 2009-03-24 2009-03-24 Document search device, document search system, and document search method Expired - Fee Related CN101933017B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2009/055819 WO2010109594A1 (en) 2009-03-24 2009-03-24 Document search device, document search system, document search program, and document search method

Publications (2)

Publication Number Publication Date
CN101933017A CN101933017A (en) 2010-12-29
CN101933017B true CN101933017B (en) 2013-07-03

Family

ID=42780303

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009800000314A Expired - Fee Related CN101933017B (en) 2009-03-24 2009-03-24 Document search device, document search system, and document search method

Country Status (3)

Country Link
JP (1) JPWO2010109594A1 (en)
CN (1) CN101933017B (en)
WO (1) WO2010109594A1 (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012174741A1 (en) * 2011-06-24 2012-12-27 Google Inc. Determining cross-language query suggestion based on query translations
CN102364469B (en) * 2011-10-09 2016-08-03 北京百度网讯科技有限公司 A kind of method and device that illustrative sentence retrieval result is ranked up
JP5697256B2 (en) * 2011-11-24 2015-04-08 楽天株式会社 SEARCH DEVICE, SEARCH METHOD, SEARCH PROGRAM, AND RECORDING MEDIUM
CN104572642A (en) * 2013-10-10 2015-04-29 腾讯科技(深圳)有限公司 Key word search method and device
CN105389344A (en) * 2015-10-21 2016-03-09 南方电网科学研究院有限责任公司 Self-service novelty retrieval method and system
CN106708808B (en) * 2016-12-14 2020-01-14 东软集团股份有限公司 Information mining method and device
CN111737550B (en) * 2019-03-25 2024-01-23 阿里巴巴集团控股有限公司 Search result processing method and device, storage medium and processor

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH05151253A (en) * 1991-11-29 1993-06-18 Canon Inc Document retrieving device
JPH08212229A (en) * 1995-02-01 1996-08-20 Fuji Xerox Co Ltd Information retrieval device
US5956740A (en) * 1996-10-23 1999-09-21 Iti, Inc. Document searching system for multilingual documents
JPH11154164A (en) * 1997-11-21 1999-06-08 Hitachi Ltd Adaptability calculating method in whole sentence search processing and storage medium storing program related to the same
JP3917349B2 (en) * 2000-05-30 2007-05-23 富士通株式会社 Retrieval device and method for retrieving information using character recognition result
JP3328913B1 (en) * 2001-08-03 2002-09-30 学校法人 慶應義塾 Multilingual document retrieval system
JP2005011260A (en) * 2003-06-20 2005-01-13 Canon Sales Co Inc Document management device, document management system and program for document management
JP4640591B2 (en) * 2005-06-09 2011-03-02 富士ゼロックス株式会社 Document search device

Also Published As

Publication number Publication date
WO2010109594A1 (en) 2010-09-30
CN101933017A (en) 2010-12-29
JPWO2010109594A1 (en) 2012-09-20

Similar Documents

Publication Publication Date Title
CN109992645B (en) Data management system and method based on text data
CN100474301C (en) System and method for obtaining words or phrases unit translation information based on data excavation
CN101933017B (en) Document search device, document search system, and document search method
CN103544210B (en) System and method for identifying webpage types
CN100416570C (en) FAQ based Chinese natural language ask and answer method
CN102023995B (en) Speech retrieval apparatus and speech retrieval method
Ahmed et al. Language identification from text using n-gram based cumulative frequency addition
US20120166414A1 (en) Systems and methods for relevance scoring
CN102567509B (en) Method and system for instant messaging with visual messaging assistance
Zhang et al. Narrative text classification for automatic key phrase extraction in web document corpora
CN102662936B (en) Chinese-English unknown words translating method blending Web excavation, multi-feature and supervised learning
WO2015043075A1 (en) Microblog-oriented emotional entity search system
US7548845B2 (en) Apparatus, method, and program product for translation and method of providing translation support service
CN103678684A (en) Chinese word segmentation method based on navigation information retrieval
CN110134799B (en) BM25 algorithm-based text corpus construction and optimization method
CN101702167A (en) Method for extracting attribution and comment word with template based on internet
WO2012159558A1 (en) Natural language processing method, device and system based on semantic recognition
CN115422371A (en) Software test knowledge graph-based retrieval method
JP4426041B2 (en) Information retrieval method by category factor
TW202022635A (en) System and method for adaptively adjusting related search words
US9305103B2 (en) Method or system for semantic categorization
CN115617965A (en) Rapid retrieval method for language structure big data
JP4783563B2 (en) Index generation program, search program, index generation method, search method, index generation device, and search device
Mendes et al. Just. Ask—A multi-pronged approach to question answering
EA002016B1 (en) A method of searching for fragments with similar text and/or semantic contents in electronic documents stored on a data storage devices

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130703

Termination date: 20210324