WO2012079257A1 - 机器翻译装置和方法 - Google Patents

机器翻译装置和方法 Download PDF

Info

Publication number
WO2012079257A1
WO2012079257A1 PCT/CN2010/079963 CN2010079963W WO2012079257A1 WO 2012079257 A1 WO2012079257 A1 WO 2012079257A1 CN 2010079963 W CN2010079963 W CN 2010079963W WO 2012079257 A1 WO2012079257 A1 WO 2012079257A1
Authority
WO
WIPO (PCT)
Prior art keywords
source language
arbitrary
translation
unit
phrase
Prior art date
Application number
PCT/CN2010/079963
Other languages
English (en)
French (fr)
Inventor
徐金安
孟凡东
陈恰
潘栩
达珍
孟庆辰
Original Assignee
北京交通大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京交通大学 filed Critical 北京交通大学
Priority to PCT/CN2010/079963 priority Critical patent/WO2012079257A1/zh
Priority to CN201080070253.6A priority patent/CN103314369B/zh
Publication of WO2012079257A1 publication Critical patent/WO2012079257A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/55Rule-based translation

Definitions

  • the present invention relates to the field of machine translation, and in particular to a machine translation apparatus and method. Background technique
  • machine translation involves many disciplines and technologies such as artificial intelligence, mathematics, linguistics, computational language, speech recognition and speech synthesis. It has the characteristics of comprehensive and cross-cutting.
  • machine translation systems can be divided into two categories based on rules and corpus. Direct translation methods, conversion methods, and intermediate language methods are classified into rule-based translation methods; corpus-based methods can be further classified into memory-based books.
  • Translation methods instance-based translation methods, neural network-based translation methods, and statistical-based translation methods.
  • the existing machine translation method includes the following steps: Machine translation analyzes the source language statement, divides the source language statement by words and phrases, and establishes a parse tree. Different parsing trees will appear according to the composition of words and phrases. That is, the source language sentence analysis forest is formed, and the machine translation system analyzes the parsing trees included in the parsing forest one by one, and selects the highly credible translation from the analysis results as the final translation result.
  • the present invention provides a machine translation apparatus and method.
  • the specific technical solutions are as follows:
  • a machine translation apparatus comprising:
  • a source language input unit for inputting a source language statement
  • a source language analysis unit configured to perform lexical analysis and syntax analysis on the source language statement to obtain a syntax structure of the source language sentence, and assign a attribute feature to a node in the syntax structure;
  • the arbitrary lattice determination model storage unit is configured to store an arbitrary lattice determination model, and the arbitrary lattice determination model provides a model basis for whether the source language statement contains an arbitrary lattice;
  • An arbitrary cell determining unit configured to match the arbitrary lattice determination model according to the attribute feature, and if yes, determine that the source language statement contains an arbitrary lattice, and if not, determine that the source language statement does not Contains any Grid
  • An arbitrary lattice phrase extracting unit configured to obtain an arbitrary lattice phrase in the syntax structure according to the arbitrary lattice obtained by matching;
  • An arbitrary phrase translation unit for performing machine translation on the arbitrary lattice phrase
  • a first extracting unit configured to acquire a source language remaining statement after removing the arbitrary lattice phrase
  • a machine translation unit configured to perform machine translation on the remaining statements of the source language
  • a translation result integration unit configured to perform a combination of translation results of the arbitrary-character phrase translation unit and the machine translation unit, and use a combination with a high probability of occurrence as a target language
  • a target language output unit for outputting the target language.
  • a machine translation method comprising:
  • the arbitrary lattice determination model provides a model basis for whether the source language statement contains an arbitrary lattice
  • the target language is output.
  • FIG. 1 is a block diagram of a machine translation apparatus according to Embodiment 1 of the present invention.
  • Embodiment 1 of the present invention is a schematic diagram showing an example of a result of lexical analysis provided by Embodiment 1 of the present invention
  • FIG. 3 is a schematic diagram showing an example of a grammatical category of words and words associated with each other according to Embodiment 1 of the present invention
  • FIG. 4 is a schematic diagram showing an exemplary data structure of grammar rules provided by Embodiment 1 of the present invention
  • Embodiment 1 of the present invention is a schematic diagram showing an example of an arbitrary lattice decision model library provided by Embodiment 1 of the present invention.
  • FIG. 6 is a schematic diagram showing an example of a syntax structure analysis result provided by Embodiment 1 of the present invention.
  • Embodiment 7 is a flowchart of a machine translation method provided by Embodiment 2 of the present invention.
  • FIG. 8 is a schematic diagram showing an example of a syntax structure obtained by extracting an arbitrary cell according to Embodiment 2 of the present invention.
  • FIG. 9 is a schematic diagram of a statistical method for parallel corpus segmentation for machine translation according to Embodiment 2 of the present invention
  • FIG. 10 is a schematic diagram of a training method for a statistical-based machine translation device according to Embodiment 2 of the present invention
  • the embodiment provides a machine translation device, the device includes: a source language input unit for inputting a source language statement; a source language analysis unit, configured to perform lexical analysis and syntax analysis on the source language statement to obtain the source a syntax structure of the language statement, and assigning an attribute feature to the node in the syntax structure; an arbitrary lattice determination model storage unit, configured to store an arbitrary lattice determination model, wherein the arbitrary lattice determination model is whether the source language statement contains any Providing a model basis; an arbitrary cell determining unit, configured to match the arbitrary lattice determining model according to the attribute feature, and if yes, determining that the source language statement contains an arbitrary lattice, and if not, determining the The source language statement does not contain an arbitrary lattice; the arbitrary lattice phrase extracting unit is configured to obtain an arbitrary lattice phrase in the syntax structure according to the arbitrary lattice obtained by matching; the arbitrary lattice phrase translation unit is configured to use
  • an arbitrary lattice in the source language sentence is found, and the source language statement is split into two parts according to the arbitrary lattice, that is, a more complicated sentence is split into two.
  • a simple statement The two simple sentences are translated separately, the translation results are integrated, and the integrated result with large combination probability is selected as the translation result, thereby reducing the complexity of the syntactic structure of the source language, improving the sentence structure of the target language and the efficiency of generating the grammar. Improve the translation accuracy, and make the amount of computation for machine translation decoding appropriately reduced, providing an effective device and method for machine translation research.
  • FIG. 1 is a machine translation apparatus 100 according to Embodiment 1 of the present invention.
  • the apparatus includes: a source language input unit 101, a source language analysis unit 102, an arbitrary lattice determination model storage unit 103, and an arbitrary lattice determination unit 104.
  • the unit may be any universal input module and input device, including: a pointing device, a keyboard, a handwritten character recognition device, an optical character recognition device and a voice recognition device, and an input device in the form of a text file or a database.
  • the input source language statement is stored in the computer memory or buffer.
  • the source language analyzing unit 102 is configured to perform lexical analysis on the source language sentence input by the source language input unit 101, obtain a sequence of words of the source language sentence, perform syntactic analysis according to the word sequence, and obtain a syntactic structure of the source language sentence, which is in the syntactic structure.
  • the node is assigned an attribute feature and output to the arbitrary cell determining unit 104;
  • any general lexical analysis technique can be used in the process of lexical analysis of source language sentences, such as a method of maximizing the probability of division by dynamic programming using a word division model, that is, according to a word division model, using a dynamic programming method
  • the source language statement divides the words, and selects the most probable division method as the final output word sequence.
  • the lexical analysis tool can be used to perform lexical analysis on the input source language statements, including: Stanford Parse, Institute of Computing Technology, ICTCLAS Analysis System, ChaSen, etc.
  • any syntactic analysis method such as icon parsing and general LR profiling, can be used for syntactic analysis of source language statements.
  • syntactic analysis tools can be used for syntactic analysis, including: Japanese Cab 0C h a , KNP, etc.
  • the symbol ".” identifies a breakpoint between the 202 word and the word.
  • the identifier of the breakpoint is not unique, and it can also be "space”.
  • the lexical dictionary and the preset grammar rules are used to assign attribute features to the nodes in the syntactic structure, and the syntactic structure includes the grammatical categories of the corresponding words and each of them is closed. Connected nodes; Figure 3 shows an example of the grammatical category of words in the sequence of words 202 shown in Figure 2.
  • the vocabulary dictionary includes grammatical categories of words and words associated with each other, for example, the Japanese word 301 "Peace” is associated with the grammatical category Pron. (Pronoun), in addition to Pron., the grammatical category of the vocabulary includes V (verb), P (auxiliary), N (noun), etc.
  • a predetermined grammar rule is given, in which the grammatical category to the left of the arrow is specified with the grammatical categories 1 and 2 to the right of the arrow.
  • the sentence (grammar category S) has a noun phrase and a verb phrase (grammatical category NP VP), and the source language analyzing unit 102 will refer to the grammar rule in the process of syntactic analysis of the source language sentence.
  • the source language analysis unit 102 analyzes the syntactic structure of the Chinese sentence, and can analyze that "I” is the subject of the sentence, and “Yes” is the predicate. "Chinese people” is the result of the analysis of the object.
  • the source language analysis unit 102 can also assign the attribute words such as part of speech, semantics, and concept to the words in the word sequence by referring to the semantic class dictionary.
  • Japanese WordNet Japanese word series
  • EDR electronic dictionary etc.
  • the component "he/pronoun” in the above input sentence can be given the attribute characteristics of "person", and “ ⁇ ” can be given the attribute characteristics of "place (place)” or “building (building)", “self-driving car” can be given The characteristics of the traffic agency (vehicle)” and so on.
  • semantic dictionary the vocabulary dictionary, and the grammar rules are all stored in the source language syntax analysis unit in advance.
  • the arbitrary lattice determination model storage unit 103 is configured to store an arbitrary lattice determination model, which is composed of a number, a surface of the word (the word itself), a part of speech, a semantic classification of the word, and a lattice auxiliary word; the arbitrary lattice determination model is a knowledge base, The main function is to provide a basis for determining whether there is any space in the input source language statement;
  • the arbitrary lattice determination model may be manually written to formulate certain rules, or may be extracted from the learning data according to the machine learning principle using statistical methods; wherein, the machine learning methods are various, and may be appropriately selected according to needs.
  • SVM support vector machine
  • decision tree decision tree
  • the invention does not limit the arbitrary lattice decision The specific implementation method of the model;
  • the arbitrary cell determining unit 104 is configured to extract the node attribute feature in the data structure from the source language sentence analyzing unit 102, and match the extracted attribute feature with the intention degree determining model stored by the arbitrary cell determining model storage unit 103, if matched, Then, it is determined that there is an arbitrary lattice in the source language statement. If there is no match, it is determined that there is no arbitrary lattice in the source language sentence.
  • FIG. 5 is a schematic diagram of an example of the arbitrary lattice determination model library provided by the embodiment of the present invention.
  • the arbitrary lattice decision model in the arbitrary lattice decision model library is composed of the number, the surface of the word (the word itself), the part of speech, the semantic classification of the word, and the helper word.
  • the arbitrary cell determining unit 104 extracts the node attribute feature in the data structure from the source language sentence analyzing unit 102, and can use when the extracted attribute feature matches the arbitrary cell determining model in the arbitrary cell determining model library shown in FIG.
  • the model in the arbitrary lattice decision model library [surface + lattice auxiliary word], or [semantic classification + lattice auxiliary word], or [surface layer + part of speech + lattice auxiliary word], or [surface layer + part of speech + semantic classification + lattice auxiliary word]
  • the pattern is matched with the node attribute feature in the data structure from the source language sentence analysis unit 102 to determine whether the source language statement contains an arbitrary lattice.
  • the source language statement "People's "Books from the car ⁇ line ⁇ ”
  • the judgment model is matched, and the matching method has various forms.
  • the attribute of [self-driving] contains only the noun ⁇
  • the [self-driving] ⁇ and [] are used as the feature vector and any of the arbitrary lattice determination model libraries shown in FIG.
  • the lattice determination model performs pattern matching; when the attribute of [self-driving] contains the noun [ ⁇ ] and the semantic attribute [traffic authority], the characteristic attribute composed of [traffic authority] and [ ⁇ ] can be simply shown in FIG.
  • the arbitrary lattice decision model in the arbitrary lattice decision model library performs pattern matching; obviously, both methods are matched with the model numbered 2 in FIG. 5; thereby judging that [[] in the self-driving car is an arbitrary lattice.
  • the arbitrary cell determining unit 104 includes an extracting module 1041, a reading module 1042, and a matching module 1043.
  • the extracting module 1041 is configured to extract attribute features in the source language sentence analyzing unit 102, and the attribute features include part of speech, word meaning, concept, and the like. ;
  • the attribute features of the predicate words such as nouns, lattice auxiliary words, and verbs in the sentence are extracted as attribute features for arbitrary determination of the source language sentence;
  • the source language statement that is entered is “in the book, "The bookstore is self-driving car”, [Pi ii], [ ⁇ ],
  • the matching determination module 1042 matches the attribute feature of the extracted syntax structure node with the arbitrary lattice determination model stored by the arbitrary lattice determination model storage unit 103. If it matches, it determines that there is an arbitrary lattice in the source language sentence, and if not, determines the source language. There is no arbitrary cell in the statement;
  • the arbitrary-character phrase extracting unit 105 is configured to: when the arbitrary-cell determining unit 103 determines that there is an arbitrary lattice in the source language sentence, extract a node string associated with an arbitrary lattice from the syntactic structure as an arbitrary lattice phrase, and extract the arbitrary lattice The phrase is output to the arbitrary phrase translation unit 106;
  • Figure 6 depicts the syntactic analysis result of the input sentence "Peer ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ " " , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ⁇ ⁇ ⁇ .
  • the arbitrary-character phrase translation unit 106 is configured to extract a source language phrase after removing the arbitrary-character phrase, and integrate the extracted sentence component of the source language phrase after removing the arbitrary-character phrase, and output the translation result to the translation result integration unit. 109;
  • the translation method for the part is flexible, and the form can be various, such as a translation dictionary using a dedicated arbitrary phrase, or
  • the use of rule-based translation methods to translate arbitrary phrases can of course be implemented using instance-based, or statistical-based machine translation methods;
  • a first extracting unit 107 configured to extract a node string associated with an arbitrary cell from the syntax structure as an output to the machine translation unit 108;
  • the machine translation unit 108 is configured to perform machine translation on the statement transmitted by the first extraction unit 107, and output the translation result to the translation result integration unit 109;
  • the machine translation unit 108 is further configured to perform a machine translation process on the input source language statement directly when the arbitrary cell determination unit 104 determines that the analysis result of the source language analysis unit 102 does not include an arbitrary lattice phrase, and output the translation result to the translation.
  • machine translation unit 108 may translate incoming statements in a rule-based machine translation system, an instance-based machine translation system, or a statistical-based machine translation system.
  • the translation result integration unit 109 is configured to receive the translation result of the arbitrary lattice phrase translation unit 106 and the translation result of the machine translation unit 108, and integrate the two results to generate a complete target language sentence, and generate the target language sentence. Output to the target language output unit 110;
  • the translation result integration unit 109 includes: a translation result integration module 1091 and an integration comparison module 1092; wherein the translation result integration module 1091 is configured to perform the translation result of the arbitrary lattice phrase translation unit 106 and the translation result of the machine translation unit 108. Permutations; Specifically, the translation result integration module 1091 may sort the two parts by using a language model of the target language;
  • the integration comparison module 1092 is configured to compare the magnitude of the probability of occurrence of the integration result of the translation result integration module 1091, and output the translation integration result with a high probability of occurrence to the target language output unit 110;
  • the target language output unit 110 is configured to receive and output the target language sentence generated by the translation result integration unit 110.
  • the target language sentence has a plurality of output modes, which may be a file output, or a display output.
  • the output is displayed on the display device in the form of an image, or the result is printed by the printer and synthesized by a speech synthesizer. You can switch between using these systems or using them at the same time as needed.
  • the embodiment provides a machine translation method, the method comprising: inputting a source language statement; performing lexical analysis and syntax analysis on the source language statement to obtain a syntax structure of the source language statement, and in the syntax structure
  • the node assigns an attribute feature; according to the attribute feature, matching with the stored arbitrary lattice determination model, if it matches, determining that the source language statement contains an arbitrary lattice, and if not, determining that the source language statement does not contain any a cell, wherein the arbitrary cell decision model provides a model basis for whether the source language statement includes an arbitrary cell; and the arbitrary cell in the syntactic structure is obtained according to the random cell obtained by the matching, and the arbitrary cell is obtained Phrasing machine translation; obtaining a source language remaining statement after removing the arbitrary lattice phrase, and performing machine translation on the remaining language statement of the source language; arranging and combining the translation result of the arbitrary lattice phrase and the remaining language of the source language, A combination with a high probability
  • step S01 input source language statement, and store it in a memory unit or a buffer of a computer's memory; if necessary, various input devices can be used to input the source language statement, including: a pointing device, A keyboard, a handwritten character recognition device, an optical character recognition device and a voice recognition device, and an input device in the form of a text file or a database.
  • various input devices can be used to input the source language statement, including: a pointing device, A keyboard, a handwritten character recognition device, an optical character recognition device and a voice recognition device, and an input device in the form of a text file or a database.
  • the input source language sentence is Japanese "People's "Books of the Eighth Cars T"
  • the target language is Chinese as an example.
  • the translation method of the present invention is not limited to Japanese to Chinese translation.
  • Step S02 performing lexical analysis on the source language statement, obtaining a sequence of words of the source language sentence, performing syntax analysis according to the word sequence, obtaining a syntax structure of the source language sentence, assigning attribute features to the nodes in the syntax structure, and performing attribute features and syntax
  • the structure is output as an analysis result
  • any general lexical analysis technique can be used in the process of lexical analysis of source language sentences, such as a method of maximizing the probability of division by dynamic programming using a word division model, that is, according to a word division model, using a dynamic programming method
  • the source language statement divides the words, and selects the most probable division method as the final output word sequence.
  • the lexical analysis tool can be used to perform lexical analysis on the input source language sentences, including: Stanford Parse, ICTCLAS analysis system of Chinese Academy of Sciences, ChaSen, etc.
  • any syntactic analysis method such as icon parsing and general LR profiling, can be used for syntactic analysis of source language statements.
  • syntactic analysis tools can be used for syntactic analysis, including: Cabocha, KNP, etc. in Japanese.
  • the symbol ".” identifies a breakpoint between the 202 word and the word.
  • the identifier of the breakpoint is not unique, and it can also be "space”.
  • the lexical dictionary and the preset grammar rules are used to assign attribute features to nodes in the syntax structure, and the syntax structure includes the grammatical categories of the corresponding words and each of them is associated with each other.
  • Node An example of the grammatical category of words in the sequence of words 202 shown in FIG. 2 is given in FIG.
  • the vocabulary dictionary includes grammatical categories of words and words associated with each other, for example, the Japanese word 301 "Peace” is associated with the grammatical category Pron. (Pronoun), in addition to Pron., the grammatical category of the vocabulary includes V (verb), P (auxiliary), N (noun), etc.
  • a predetermined grammar rule is given, in which the grammatical category to the left of the arrow is specified with the grammatical categories 1 and 2 to the right of the arrow.
  • the sentence (grammatical category S) has a noun phrase and a verb phrase (grammatical category NP VP ), and the source language analyzing unit 102 will refer to the grammar rule in the process of syntactic analysis of the source language sentence.
  • the source language analysis unit 102 analyzes the syntactic structure of the Chinese sentence, and can analyze that "I” is the subject of the sentence, and “Yes” is the predicate. "Chinese people” is the result of the analysis of the object.
  • semantic dictionary the vocabulary dictionary, and the grammar rules are all stored in the source language syntax analysis unit in advance.
  • Step S03 extracting attribute features, such as words, part of speech, semantic classification, concepts, and the like from the analysis result; specifically, extracting attribute features of the predicate such as nouns, lattice auxiliary words, and verbs in the sentence as the source language statement Attribute characteristics;
  • the input source language statement "People's Bookstore is self-defeating", "He H:], [ ⁇ ], [Self-bringing and predicate [ ⁇ ] and other parts of the language, as well as surface information, part of speech, Information such as semantic classification of words is used as an attribute feature for arbitrary lattice determination.
  • Step S04 the attribute feature of the extracted syntax structure node is matched with the stored arbitrary lattice determination model. If it matches, it is determined that there is an arbitrary lattice in the source language statement, and if S05 is not matched, it is determined that there is no arbitrary lattice in the source language statement. , execute S08;
  • the arbitrary lattice decision model is composed of the number, the surface of the word (the word itself), the part of speech, the semantic classification of the word, and the auxiliary word. It is a kind of knowledge base. Its main function is to determine whether there is any in the source language statement of the input. Provide basis for
  • the arbitrary lattice determination model may be manually written to formulate certain rules, or may be extracted from the learning data according to the machine learning principle using statistical methods; wherein, the machine learning methods are various, and may be appropriately selected according to needs.
  • SVM support vector machine
  • decision tree decision tree
  • the present invention does not limit the specific implementation of the arbitrary lattice decision model;
  • matching the attribute features of the extracted syntax structure node with the stored arbitrary lattice determination model includes: matching the extracted attribute features with the arbitrary lattice determination model in the arbitrary lattice determination model library shown in FIG. 5 , You can use this model to determine the model in the model library [surface + lattice auxiliary words], or [semantic classification + lattice auxiliary words], or [surface layer + part of speech + lattice auxiliary words], or [surface layer + part of speech + semantic classification + lattice auxiliary words]
  • the pattern matching is performed in various forms and from the source language sentence analyzing unit 102 to extract the node attribute features in the data structure to determine whether the source language statement contains an arbitrary lattice.
  • the source language statement "He” "Yu Shuguan from the car ⁇ line ⁇ , can first extract the feature quantity of the [self-driving car] and [ ⁇ ] in the source language statement, and then any of the arbitrary cell model library shown in Figure 5
  • the lattice judgment model is matched, and the matching method has various forms.
  • the attribute of [self-driving] contains only the noun [n]
  • the [self-driving] [n] and [] are used as the feature vector and the arbitrary lattice determination shown in FIG.
  • the arbitrary lattice judgment model in the model library performs pattern matching; when the attribute of [self-driving] contains the noun [ ⁇ ] and the semantic attribute [traffic authority], it is possible to simply use [traffic authority] and [composition attribute features and diagrams
  • the arbitrary lattice decision model in the arbitrary lattice decision model library shown in FIG. 5 performs pattern matching; obviously, both methods are matched with the model numbered 2 in FIG. 5; thereby judging that [[] in the self-driving car is arbitrary Grid.
  • Step S05 extracting a node string associated with an arbitrary lattice from the syntax structure, performing the operation of step S06 on the extracted arbitrary lattice phrase portion, and performing an operation of S07 on removing the remaining portion of the arbitrary lattice phrase;
  • FIG. 6 depicts the syntactic analysis result of the input sentence "Pei ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
  • Step S06 performing machine translation on the extracted arbitrary lattice phrase, executing step S08;
  • the translation method for the part is flexible, and the form can be various, and the corresponding phrase pairs are extracted from the large-scale corpus.
  • Dedicated translations are implemented using a dictionary, or a rule-based translation method is used to translate arbitrary phrases, and of course, it can also be implemented using an instance-based, or statistical-based machine translation method;
  • Step S07 performing machine translation
  • the translation of the remaining part of the source language sentence after removing the arbitrary lattice phrase specifically includes: arranging and combining the remaining sentence components of the extracted source language after removing the arbitrary lattice phrase, and combining the results
  • the combination with the highest probability of occurrence is machine translation.
  • the machine translation method in this step is not specifically limited, and may be a rule-based machine translation system, an instance-based machine translation system, or a statistical-based machine translation system.
  • the translation of a string is based on an example, and the similarity between the string and the sample is used as a translation score; for a statistical-based translation system, the translation of the string Based on the translation of the language model, the translation probabilities based on the translation model are used as translation scores.
  • the translation of the strings is based on the syntax and the rules adopted, and the syntax is credible. The degree and the preference of the rule are used to obtain the translation score.
  • Step S08 integrating the translation results of steps S06 and S07;
  • the two translation results are arranged and combined, and one of the combinations with a high probability of occurrence is selected as the integration result and output.
  • Step Machine Translation Integration The function of S08 is to integrate the translation results of step S06 and step S07. If the translation result from Japanese to Chinese is "He goes to the library" and "Bicycle", the target can be used.
  • the language model of the language sorts the two parts above. It can be concluded that when the quality and scale of the Chinese corpus of the language model under construction is guaranteed, it can be calculated that the probability that he is going to the library by bicycle is the greatest. Then, the processing result of step S08 is output to the step target language output S09.
  • Step S09 outputting the integrated result output obtained in step S08 to obtain a final target language
  • the output forms are various and can be outputted through a display, a text file, or a voice output; for example, the output is displayed on the display device in the form of an image, and the result is printed by the printer and synthesized by the speech synthesizer. You can switch between using these systems or using them at the same time as needed.
  • FIG. 9 is based on statistics in the embodiment of the present invention.
  • a schematic diagram of a parallel corpus segmentation method for machine translation as shown in FIG. 9, the parallel corpus segmentation is mainly performed by the parallel corpus segmentation unit 210, and the parallel corpus segmentation unit 210 can use the arbitrary lattice decision model to determine the sentences in the corpus. It is easy to get two parts, including an arbitrary box and a sentence with an arbitrary lattice, to complete the segmentation of the original parallel corpus.
  • the purpose of such processing is to construct a translation model and a speech model for statistical machine translation, the corpus of the above two parts can be flexibly utilized as needed.
  • FIG. 10 is a schematic diagram of a training method of a statistical-based machine translation apparatus according to an embodiment of the present invention.
  • the function of the speech model/translation model construction unit 310 in the training method is to construct a translation model and a language model, and a traditional tool such as GIZA++ Etc., SRLM, etc. can be used.
  • FIG. 11 is a schematic diagram of a training method of a statistical-based machine translation apparatus according to an embodiment of the present invention.
  • the training corpus adopts a source-target language parallel corpus for removing arbitrary lattice phrases.
  • the statement, and the two simple sentences are translated separately, the translation results are integrated, and the combined result with large combined probability is selected as the translation result, thereby reducing the complexity of the syntactic structure of the source language and improving the sentence structure and grammar generation efficiency of the target language. , to improve the translation accuracy, and to reduce the amount of machine translation decoding operations, to provide an effective device and method for machine translation research.
  • All or part of the technical solutions provided by the above embodiments may be implemented by software programming, and the software program is stored in a readable storage medium such as a hard disk, an optical disk or a floppy disk in a computer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Description

机器翻译装置和方法 技术领域
本发明涉及机器翻译领域, 特别涉及一种机器翻译装置和方法。 背景技术
机器翻译作为自然语言处理的一项应用技术, 涉及人工智能、 数学、 语言学、 计算语 言学、 语音识别和语音合成等多种学科和技术, 具有综合性、 交叉性强的特点。
目前, 机器翻译***可以分为基于规则和基于语料库两大类。 直接翻译方法、 转换方 法、 中间语言方法归类于基于规则的翻译方法; 基于语料库的方法又可以分为基于记忆的 书
翻译方法、 基于实例的翻译方法、 基于神经网络的翻译方法和基于统计的翻译方法等等。
现有的机器翻译方法包括以下步骤: 机器翻译对源语言语句进行剖析, 即将源语言语 句按词、 短语进行划分, 建立剖析树, 按照词及短语的组成形式的不同将会出现不同的剖 析树, 即形成源语言语句剖析林, 机器翻译***对剖析林所包含的剖析树逐一进行分析, 并从分析结果中选择可信度高的翻译作为最终的翻译结果。
但是, 剖析树的建立过程及存在情况比较复杂, 使得机器翻译解码的运算量较大, 翻 译时间较长, 而出现的翻译结果也较多, 翻译精度很难得到保证。 发明内容
针对上述技术问题, 为了提高机器翻译的效率和精度, 本发明提供了一种机器翻译装 置和方法, 具体技术方案如下:
一种机器翻译装置, 所述装置包括:
源语言输入单元, 用于输入源语言语句;
源语言分析单元, 用于对所述源语言语句进行词法分析和句法分析得到所述源语言语 句的句法结构, 并为所述句法结构中的节点赋予属性特征;
任意格判定模型存储单元, 用于存储任意格判定模型, 所述任意格判定模型为所述源 语言语句中是否含有任意格提供模型依据;
任意格判定单元, 用于根据所述属性特征与所述任意格判定模型进行匹配, 如果匹配, 则判定所述源语言语句中含有任意格, 如果不匹配, 则判定所述源语言语句中不含有任意 格;
任意格短语提取单元, 用于根据匹配得到的所述任意格获取所述句法结构中的任意格 短语;
任意格短语翻译单元, 用于对所述任意格短语进行机器翻译;
第一提取单元, 用于获取去除所述任意格短语后的源语言剩余语句;
机器翻译单元, 用于对所述源语言剩余语句进行机器翻译;
翻译结果整合单元, 用于对所述任意格短语翻译单元及机器翻单元的翻译结果进行排 列组合, 将出现概率大的组合作为目标语言;
目标语言输出单元, 用于输出所述目标语言。
一种机器翻译方法, 所述方法包括:
输入源语言语句;
对所述源语言语句进行词法分析和句法分析得到所述源语言语句的句法结构, 并为所 述句法结构中的节点赋予属性特征;
根据所述属性特征与存储的任意格判定模型进行匹配, 如果匹配, 则判定所述源语言 语句中含有任意格, 如果不匹配, 则判定所述源语言语句中不含有任意格, 其中, 所述任 意格判定模型为所述源语言语句中是否含有任意格提供模型依据;
根据匹配得到的所述任意格获取所述句法结构中的任意格短语, 并对所述任意格短语 进行机器翻译;
获取去除所述任意格短语后的源语言剩余语句, 并对所述源语言剩余语句进行机器翻 译;
对所述任意格短语及源语言剩余语句的翻译结果进行排列组合, 将出现概率大的组合 作为目标语言;
输出所述目标语言。
本发明实施例提供的技术方案带来的有益效果是:
通过对源语言语句中的特殊语法进行分析, 找出源语言语句中的任意格, 并根据该任 意格将源语言语句拆分为两个部分, 即将一个较复杂的语句拆分为了两个简单的语句, 并 对该两个简单句子分别进行翻译, 整合翻译结果, 选择组合概率大的整合结果作为翻译结 果, 从而降低源语言的句法结构的复杂程度, 提高目标语言的句子结构和文法的生成效率, 达到提高翻译精度的效果, 降低了机器翻译解码的运算量。 附图说明
图 1是本发明实施例 1提供的一种机器翻译装置的框图;
图 2是本发明实施例 1提供的词法分析结果范例的示意图;
图 3是本发明实施例 1提供的彼此关联的单词和单词的语法范畴范例的示意图; 图 4是本发明实施例 1提供的语法规则的范例数据结构的示意图;
图 5是本发明实施例 1提供的任意格判定模型库的范例示意图;
图 6是本发明实施例 1提供的句法结构分析结果范例示意图;
图 7 是本发明实施例 2提供的一种机器翻译方法的流程图;
图 8 是本发明实施例 2提供的抽取任意格后得到的句法结构范例示意图;
图 9是本发明实施例 2提供的一种基于统计的机器翻译用平行语料库分割方法示意图; 图 10是本发明实施例 2提供的一种基于统计的机器翻译装置的训练方法示意图; 图 11是本发明实施例 2提供的一种基于统计的机器翻译装置的训练方法示意图。 具体实施方式
为使本发明的目的、 技术方案和优点更加清楚, 下面将结合附图对本发明实施方式作 进一步地详细描述。
实施例 1
本实施例提供了一种机器翻译装置, 该装置包括: 源语言输入单元, 用于输入源语言 语句; 源语言分析单元, 用于对所述源语言语句进行词法分析和句法分析得到所述源语言 语句的句法结构, 并为所述句法结构中的节点赋予属性特征; 任意格判定模型存储单元, 用于存储任意格判定模型, 所述任意格判定模型为所述源语言语句中是否含有任意格提供 模型依据; 任意格判定单元, 用于根据所述属性特征与所述任意格判定模型进行匹配, 如 果匹配, 则判定所述源语言语句中含有任意格, 如果不匹配, 则判定所述源语言语句中不 含有任意格; 任意格短语提取单元, 用于根据匹配得到的所述任意格获取所述句法结构中 的任意格短语; 任意格短语翻译单元, 用于对所述任意格短语进行机器翻译; 第一提取单 元, 用于获取去除所述任意格短语后的源语言剩余语句; 机器翻译单元, 用于对所述源语 言剩余语句进行机器翻译; 翻译结果整合单元, 用于对所述任意格短语翻译单元及机器翻 单元的翻译结果进行排列组合, 将出现概率大的组合作为目标语言; 目标语言输出单元, 用于输出所述目标语言。
本实施例通过对源语言语句的词法与句法分析, 找出源语言语句中的任意格, 并根据 该任意格将源语言语句拆分为两个部分, 即将一个较复杂的语句拆分为了两个简单的语句, 并对该两个简单句子分别进行翻译, 整合翻译结果, 选择组合概率大的整合结果作为翻译 结果, 从而降低源语言的句法结构的复杂程度, 提高目标语言的句子结构和文法的生成效 率, 达到提高翻译精度, 并使得机器翻译解码的运算量得到适当的降低, 为机器翻译研究 提供一种有效的装置和方法。
参见图 1, 图 1是本发明实施例 1提供的一种机器翻译装置 100, 该装置包括: 源语言 输入单元 101、 源语言分析单元 102、 任意格判定模型存储单元 103、 任意格判定单元 104、 任意格短语提取单元 105、任意格短语翻译单元 106、第一提取单元 107、机器翻译单元 108、 翻译结果整合单元 109以及目标语言输出单元 110; 下面详细介绍各单元的具体功能: 源语言输入单元 101, 用来输入源语言语句;
具体地, 该单元可以是任意的通用输入模块及输入装置, 包括: 定点装置、 键盘、 手 写字符识别装置、 光学字符识别装置和语音识别装置以及文本文件或数据库形式的输入装 置等。
需要说明的是, 将输入的源语言语句存储于计算机内存或缓冲区中。
源语言分析单元 102, 用于对源语言输入单元 101输入的源语言语句进行词法分析, 得 到源语言语句的词序列根据该词序列进行句法分析, 得到源语言语句的句法结构, 为句法 结构中的节点赋予属性特征并输出给任意格判定单元 104;
具体地, 在对源语言语句进行词法分析过程中可以采用任何通用的词法分析技术, 如 利用词语划分模型通过动态规划使划分概率最大化的方法等, 即根据词语划分模型, 采用 动态规划方式对源语言语句进行词语划分, 从中选择概率最大的划分方式作为最后输出的 词序列。
在具体实现时, 可以使用词法分析工具对输入的源语言语句进行词法分析, 包括: Stanford Parse、 中科院计算所 ICTCLAS分析***、 ChaSen等。
具体地, 在对源语言语句进行句法分析时, 可以采用任何常规句法分析方法, 如图标 剖析和通用 LR剖析等方法。
在具体实现时, 可以使用句法分析工具来进行句法分析, 包括: 日语的 Cab0Cha、 KNP 等。
在图 2的范例中, 源语言输入单元 101中输入的源语言语句为日文语句 "彼《図書館 〜自転車 ^行〈 ", 词序列 202给出了分析该语句的结果。 符号 ". "标识了 202词与词之 间的断点, 当然, 该断点的标识并不是唯一的, 也可以是 "空格"等。
具体地, 在对源语言语句进行词法及句法分析过程时将会参考词汇词典及预设语法规 则来为句法结构中的节点赋予属性特征, 句法结构包括对应词的语法范畴与其每一个都关 联的节点; 图 3中给出了图 2中所示的词序列 202中词的语法范畴范例。
如图 3所示,词汇词典包括彼此关联的单词和单词的语法范畴,例如日文单词 301 "彼" 与语法范畴 Pron. (代词)相关联, 除了 Pron. (代词), 词汇的语法范畴还包括 V (动词)、 P (助词)、 N (名词外) 等。
如: 对输入的源语言语句为日语的 "彼 ΰ図書館〜自転車 ^行 < "进行词法分析后得 到, 彼 /代名词 》7助词 図書館 /名词 /助词 自転車 /名词 ? /助词 行〈/动词的分析结 果。
在图 4 的范例中, 给出了预定语法规则, 在该语法规则列表中, 指定了箭头左方的语 法范畴有箭头右方的语法范畴 1和 2构成。 例如, 语句(语法范畴 S)有名词短语及动词短 语(语法范畴 NP VP)组成等, 源语言分析单元 102在对源语言语句进行词句法分析的过程 中将会参考语法规则。
例如, 对输入的源语言语句为日语的 "彼 a図書館八自転車 ^'行〈 "进行句法分析后 得到的源语言据法结构参见图 5。
再例如, 当输入的源语言语句是汉语 "我是中国人"时, 源语言分析单元 102对上述 汉语句进行句法结构分析后,可以分析出 "我"是句子的主语, "是"是谓语, "中国人" 是宾语的分析结果。
源语言分析单元 102在对源语言语句进行词法分析的过程中还可以参考义类词典为词 序列中的词赋予词性、 语义、 概念等属性特征。
具体地, 可以参考日语 WordNet,日本词语大系、 EDR 电子词典等均可以实现上述属性 的赋予功能。
如上述输入语句中的成分 "彼 /代名词"可以赋予 "人" 的属性特征, "図書館"可以 赋予 "場所(场所)"或 "建物(建筑物)"的属性特征, "自転車"可以赋予 "交通機関(交 通工具)" 的属性特征等等。
这里, 需要说明的是, 义类词典、 词汇词典及语法规则都是预先已存储在该源语言语 法分析单元中了。
任意格判定模型存储单元 103, 用于存储任意格判定模型, 由编号、 词语的表层(词本 身)、 词性、 词的语义分类以及格助词组成; 该任意格判定模型是一种知识库, 其主要功能 是为判定输入的源语言语句中与否存在任意格提供依据;
具体地, 该任意格判定模型可以由人工编写制定一定的规则, 也可以根据机器学习原 理使用统计方法从学习数据中抽取获得; 其中, 机器学习方法多种多样, 可以根据需要进 行适当的选择, 如使用支持向量机 (SVM)、 决策树等算法; 因而本发明不限定任意格判定 模型的具体实现方法;
任意格判定单元 104,用于从源语言语句分析单元 102中提取据法结构中节点属性特征, 根据提取的属性特征与任意格判定模型存储单元 103存储的意格判定模型进行匹配, 如果 匹配, 则判定源语言语句中存在任意格, 如果不匹配, 则判定源语言语句中不存在任意格; 具体地, 参见图 5, 图 5是本发明的实施例提供的任意格判定模型库的范例示意图; 该 任意格判定模型库中的任意格判定模型由编号、 词语的表层 (词本身)、 词性、 词的语义分 类以及格助词组成。 任意格判定单元 104从源语言语句分析单元 102中提取据法结构中节 点属性特征, 根据提取的属性特征与图 5所示的任意格判定模型库中的任意格判定模型进 行匹配时, 可以使用该任意格判定模型库中的模型 [表层 +格助词]、 或 [语义分类 +格助词]、 或 [表层 +词性 +格助词]、 或 [表层 +词性 +语义分类 +格助词]等多种形式和从源语言语句分析 单元 102 中提取据法结构中节点属性特征进行模式匹配, 以判定源语言语句中是否含有任 意格。
例如源语言语句 "彼《図書館 自転車 ^行< ", 可以先提取该源语言语句中的 [自転 車]和 [^ ]等特征量, 然后和图 5所示的任意格判定模型库中的任意格判定模型进行匹配, 匹配方式有多种形式, 当 [自転車]的属性中仅含有名词 ω时, 以 [自転車] ω和 [ ]为特 征向量与图 5所示的任意格判定模型库中的任意格判定模型进行模式匹配; 当 [自転車]的 属性中含有名词 [η]、 语义属性 [交通機関]时, 则可以简单地以 [交通機関]和[ ^ ]组成的 特征属性与图 5所示的任意格判定模型库中的任意格判定模型进行模式匹配; 显而易见, 两种方法均与图 5中的编号为 2的模型相匹配; 从而判定出 [自転車 中的 [^ ]为任意格。
具体地, 任意格判定单元 104包括提取模块 1041、 读取模块 1042及匹配模块 1043 ; 其中, 提取模块 1041, 用于源语言语句分析单元 102中提取属性特征, 属性特征包括 词性、 词义、 概念等;
具体地, 抽取句子中的名词、 格助词、 动词等谓语词的属性特征作为源语言语句任意 格判定用的属性特征;
例如, 输入的源语言语句"彼《図書館 自転車 行〈 "中, [彼 ii]、 [図書館 ]、
[自転車 ]以及谓语词 [行〈]等部分语段, 以及各个词语的表层信息、 词性、 词的语义分 类等信息来作为任意格判定用的属性特征。
匹配判定模块 1042, 提取句法结构节点的属性特征与任意格判定模型存储单元 103存 储的任意格判定模型进行匹配, 如果匹配, 则判定源语言语句中存在任意格, 如果不匹配, 则判定源语言语句中不存在任意格;
例如, 从输入的源语言语句 "彼 図書館八自転車 ^行< "提取的任意格判定用的属 性特征和图 11所示的模型进行匹配, 可以判定出 [自転車 中的 [^ ]为任意格。
任意格短语提取单元 105,用于当任意格判定单元 103判定源语言语句中存在任意格时, 从句法结构中抽取与任意格相关联的节点字串作为任意格短语, 并将抽取的任意格短语输 出给任意格短语翻译单元 106;
例如, 图 6描述了输入语句 "彼 ί±図書館 自転車 ^行< " 的句法分析结果, 当 "自 転車 " 中的 "被判定为任意格时, 只需要把 ΝΡ短语 "自転車 /N 提取出来即 可。
任意格短语翻译单元 106, 用于抽取去除任意格短语后的源语言短语, 并对该抽取的去 除任意格短语后的源语言短语的句子成分进行整合, 并将翻译结果输出给翻译结果整合单 元 109;
需要说明的是, 由于被抽取的任意格短语一般是短小的语言片段, 所以针对该部分的 翻译手法的灵活度较大, 形式可以多种多样, 如使用专用的任意格短语的翻译字典, 或使 用基于规则的翻译方法对任意格短语进行翻译, 当然也可以采用基于实例、 或基于统计的 机器翻译方法来实现;
第一提取单元 107,用于从句法结构中抽取与任意格相关联的节点字串作为输出给机器 翻译单元 108;
具体地, 输入语句 "彼 図書館 自転車 f行〈 " 中的任意格短语 "自転車 /N -C- /P" 被提取之后, 得到剩余部分 "彼 t±図書館 行〈 ", 其句子结构如图 7所示,
机器翻译单元 108, 用于对第一提取单元 107下传的语句进行机器翻译, 并将翻译结果 输出给翻译结果整合单元 109;
机器翻译单元 108,还用于当任意格判定单元 104判定源语言分析单元 102的分析结果 中不含有任意格短语时, 直接对输入的源语言语句进行机器翻译处理, 并将翻译结果输出 给翻译结果整合单元 109;
具体地, 机器翻译单元 108可以在基于规则的机器翻译***, 也可以是基于实例的机 器翻译***, 或基于统计的机器翻译***中翻译传入的语句。
翻译结果整合单元 109,用来接收任意格短语翻译单元 106的翻译结果和机器翻译单元 108的翻译结果, 并将这两个结果进行整合, 产生完整的目标语言句子, 并将产生的目标语 言句子输出给目标语言输出单元 110;
具体地, 翻译结果整合单元 109包括: 翻译结果整合模块 1091及整合对比模块 1092; 其中, 翻译结果整合模块 1091, 用于将任意格短语翻译单元 106的翻译结果和机器翻 译单元 108的翻译结果进行排列组合; 具体地, 翻译结果整合模块 1091可以使用目标语言的语言模型对上述两个部分进行排 序;
整合对比模块 1092, 用于对比翻译结果整合模块 1091的整合结果出现概率的大小, 将 出现概率大的翻译整合结果输出给目标语言输出单元 110;
目标语言输出单元 110, 用于接收并输出翻译结果整合单元 110产生的目标语言句子; 具体地, 目标语言句子的输出方式有很多, 可以是文件输出, 也可以是显示器输出等。 例如, 输出到显示设备上以图像的形势显示出来, 或由打印机打印出结果以及由语音合成 器进行合成。 可以随时根据需要切换使用这些***或者同时采用这些***。
本实施例通过对源语言语句的词法与句法分析, 找出源语言语句中的任意格, 并根据 该任意格将源语言语句拆分为两个部分, 即将一个较复杂的语句拆分为了两个简单的语句, 并对该两个简单句子分别进行翻译, 整合翻译结果, 选择组合概率大的整合结果作为翻译 结果, 从而降低源语言的句法结构的复杂程度, 提高目标语言的句子结构和文法的生成效 率, 达到提高翻译精度, 并使得机器翻译解码的运算量得到适当的降低, 为机器翻译研究 提供一种有效的装置和方法。 实施例 2
本实施例提供了一种机器翻译方法, 该方法包括: 输入源语言语句; 对所述源语言语 句进行词法分析和句法分析得到所述源语言语句的句法结构, 并为所述句法结构中的节点 赋予属性特征; 根据所述属性特征与存储的任意格判定模型进行匹配, 如果匹配, 则判定 所述源语言语句中含有任意格, 如果不匹配, 则判定所述源语言语句中不含有任意格, 其 中, 所述任意格判定模型为所述源语言语句中是否含有任意格提供模型依据; 根据匹配得 到的所述任意格获取所述句法结构中的任意格短语, 并对所述任意格短语进行机器翻译; 获取去除所述任意格短语后的源语言剩余语句, 并对所述源语言剩余语句进行机器翻译; 对所述任意格短语及源语言剩余语句的翻译结果进行排列组合, 将出现概率大的组合作为 目标语言; 输出所述目标语言。
本实施例通过对源语言语句的词法与句法分析, 找出源语言语句中的任意格, 并根据 该任意格将源语言语句拆分为两个部分, 即将一个较复杂的语句拆分为了两个简单的语句, 并对该两个简单句子分别进行翻译, 整合翻译结果, 选择组合概率大的整合结果作为翻译 结果, 从而降低源语言的句法结构的复杂程度, 提高目标语言的句子结构和文法的生成效 率, 达到提高翻译精度, 并使得机器翻译解码的运算量得到适当的降低, 为机器翻译研究 提供一种有效的装置和方法。 参见图 7, 图 7是本发明实施例 2提供的一种机器翻译方法的流程图。 具体实现流程 如下- 步骤 S01, 输入源语言语句, 并将其存入计算机的内存等记忆单元或者缓冲区中; 需要说明的, 可以使用各种输入设备来输入源语言语句, 包括: 定点装置、 键盘、 手 写字符识别装置、 光学字符识别装置和语音识别设备以及文本文件或数据库形式的输入装 置等。
这里, 以输入的源语言语句是日语 "彼《図書館八自転車 T行〈 ", 而目标语言是中文 为例进行说明, 当然本发明涉及的翻译方法并不限于日文到中文的翻译。
步骤 S02, 对源语言语句进行词法分析, 得到源语言语句的词序列, 根据该词序列进行 句法分析, 得到源语言语句的句法结构, 为句法结构中的节点赋予属性特征, 将属性特征 及句法结构作为分析结果输出;
具体地, 在对源语言语句进行词法分析过程中可以采用任何通用的词法分析技术, 如 利用词语划分模型通过动态规划使划分概率最大化的方法等, 即根据词语划分模型, 采用 动态规划方式对源语言语句进行词语划分, 从中选择概率最大的划分方式作为最后输出的 词序列。
需要说明的是, 在具体实现时, 可以使用词法分析工具对输入的源语言语句进行词法 分析, 包括: Stanford Parse、 中科院计算所 ICTCLAS分析***、 ChaSen等。
具体地, 在对源语言语句进行句法分析时, 可以采用任何常规句法分析方法, 如图标 剖析和通用 LR剖析等方法。
需要说明的是, 在具体实现时, 可以使用句法分析工具来进行句法分析, 包括: 日语 的 Cabocha、 KNP等。
在图 2的范例中, 源语言输入单元 101中输入的源语言语句为日文语句 "彼《図書館 〜自転車 ^行〈 ", 词序列 202给出了分析该语句的结果。 符号 ". "标识了 202词与词之 间的断点, 当然, 该断点的标识并不是唯一的, 也可以是 "空格"等。
具体地, 在对源语言语句进行词法及句法分析过程时将会参考词汇词典及预设语法规 则来为句法结构中的节点赋予属性特征, 句法结构包括对应词的语法范畴与其每一个都关 联的节点; 图 3中给出了图 2中所示的词序列 202中词的语法范畴范例。
如图 3所示,词汇词典包括彼此关联的单词和单词的语法范畴,例如日文单词 301 "彼" 与语法范畴 Pron. (代词)相关联, 除了 Pron. (代词), 词汇的语法范畴还包括 V (动词)、 P (助词)、 N (名词外) 等。
如: 对输入的源语言语句为日语的 "彼《図書館 自転車 ^?行< "进行词法分析后得 到, 彼 /代名词 助词 図書館 /名词 /助词 自転車 /名词 ? /助词 行< /动词的分析结 果。
在图 4 的范例中, 给出了预定语法规则, 在该语法规则列表中, 指定了箭头左方的语 法范畴有箭头右方的语法范畴 1和 2构成。 例如, 语句(语法范畴 S )有名词短语及动词短 语(语法范畴 NP VP )组成等, 源语言分析单元 102在对源语言语句进行词句法分析的过程 中将会参考语法规则。
例如, 对输入的源语言语句为日语的 "彼 ϋ図書館八自転車 行< "进行句法分析后 得到的源语言据法结构参见图 5。
再例如, 当输入的源语言语句是汉语 "我是中国人"时, 源语言分析单元 102对上述 汉语句进行句法结构分析后,可以分析出 "我"是句子的主语, "是"是谓语, "中国人" 是宾语的分析结果。
这里, 需要说明的是, 义类词典、 词汇词典及语法规则都是预先已存储在该源语言语 法分析单元中了。
步骤 S03, 从分析结果中提取属性特征, 如词、 词性、 语义分类、 概念等属性; 具体地, 抽取句子中的名词、 格助词、 动词等谓语词的属性特征作为源语言语句任意 格判定用的属性特征;
例如, 输入的源语言语句"彼 図書館 自転車 行〈 "中, [彼 H:]、 [図書館 八]、 [自転車 以及谓语词 [行〈]等部分语段, 以及各个词语的表层信息、 词性、 词的语义分 类等信息来作为任意格判定用的属性特征。
歩骤 S04,提取句法结构节点的属性特征与存储的任意格判定模型进行匹配,如果匹配, 则判定源语言语句中存在任意格, 执行 S05如果不匹配, 则判定源语言语句中不存在任意 格, 执行 S08;
其中, 任意格判定模型由编号、 词语的表层 (词本身)、 词性、 词的语义分类以及格助 词组成, 是一种知识库, 其主要功能是为判定输入的源语言语句中与否存在任意格提供依 据;
具体地, 该任意格判定模型可以由人工编写制定一定的规则, 也可以根据机器学习原 理使用统计方法从学习数据中抽取获得; 其中, 机器学习方法多种多样, 可以根据需要进 行适当的选择, 如使用支持向量机 (SVM)、 决策树等算法; 因而本发明不限定任意格判定 模型的具体实现方法;
具体地, 参见图 5, 提取句法结构节点的属性特征与存储的任意格判定模型进行匹配包 括: 根据提取的属性特征与图 5所示的任意格判定模型库中的任意格判定模型进行匹配时, 可以使用该任意格判定模型库中的模型 [表层 +格助词]、 或 [语义分类 +格助词]、 或 [表层 + 词性 +格助词]、 或 [表层 +词性 +语义分类 +格助词]等多种形式和从源语言语句分析单元 102 中提取据法结构中节点属性特征进行模式匹配, 以判定源语言语句中是否含有任意格。
例如源语言语句 "彼《図書館 自転車 ^行<,,, 可以先提取该源语言语句中的 [自転 車]和 [^ ]等特征量, 然后和图 5所示的任意格判定模型库中的任意格判定模型进行匹配, 匹配方式有多种形式, 当 [自転車]的属性中仅含有名词 [n]时, 以 [自転車] [n]和 [ ]为特 征向量与图 5所示的任意格判定模型库中的任意格判定模型进行模式匹配; 当 [自転車]的 属性中含有名词 [η]、 语义属性 [交通機関]时, 则可以简单地以 [交通機関]和[ 组成的 属性特征与图 5所示的任意格判定模型库中的任意格判定模型进行模式匹配; 显而易见, 两种方法均与图 5中的编号为 2的模型相匹配; 从而判定出 [自転車 中的 [^ ]为任意格。
步骤 S05, 从句法结构中抽取与任意格相关联的节点字串作为, 对提取的任意格短语部 分执行步骤 S06的操作, 对去除任意格短语的剩余部分执行 S07的操作;
具体地, 图 6描述了输入语句 "彼 ti図書館 自転車 行〈 "的句法分析结果, 当 "自 転車 " 中的 "被判定为任意格时, 只需要把 ΝΡ短语 "自転車 /Ν /Ρ"提取出来即 可。
具体地, 输入语句 "彼 t±図書館 自転車 ^行〈 " 中的任意格短语 "自転車 /N -C- /P" 被提取之后, 得到剩余部分 "彼 図書館 行〈 ", 其句子结构如图 8所示,
步骤 S06, 对提取的任意格短语进行机器翻译, 执行步骤 S08;
需要说明的是, 由于被抽取的任意格短语一般是短小的语言片段, 所以针对该部分的 翻译手法的灵活度较大, 形式可以多种多样, 从大规模语料库中提取相应的短语对构筑成 专用的翻译用词典实现, 或使用基于规则的翻译方法对任意格短语进行翻译, 当然也可以 采用基于实例、 或基于统计的机器翻译方法来实现;
例如, 对提取的任意格短语 "自転車 进行翻译, 可以得到 "骑自行车"的翻译结 果。
歩骤 S07, 进行机器翻译;
这里, 还需要说明的是, 在对去除任意格短语后的源语言语句的剩余部分进行翻译具 体包括: 对该抽取的去除任意格短语后的源语言的剩余句子成分进行排列组合, 将组合结 果出现概率最大的组合进行机器翻译。
具体地, 本步骤中的机器翻译方法不做特定的限定, 可以是基于规则的机器翻译***, 也可以是基于实例的机器翻译***, 或基于统计的机器翻译***等。
例如, 对提取任意格短语后的剩余源语言语句 "彼《図書館 行〈 "进行机器翻译处 理, 翻译结果为 "他去 图书馆"。
例如, 对于基于范例的翻译***, 对字符串的翻译是以范例为翻译依据的, 并将字符 串与范例之间的相似度作为翻译分数; 对于基于统计的翻译***而言, 字符串的翻译是以 语言模型为翻译依据的, 并将基于翻译模型的翻译概率作为翻译分数; 对于基于规则的翻 译***, 字符串的翻译是以句法及采用的规则为翻译依据的, 并将句法的可信度和采用规 则的优选度来获得翻译分数。
步骤 S08, 将步骤 S06与 S07的翻译结果进行整合;
具体地, 将两个翻译结果进行排列组合, 并从中选择组合结果出现概率大的一个作为 整合结果并输出。
步骤机器翻译整合 S08的功能是对步骤 S06和步骤 S07的翻译结果进行整合, 如上述 的从日语到汉语的翻译结果为 "他 去 图书馆"和 "骑自行车"两个部分时, 可以使用目 标语言的语言模型对上述两个部分进行排序。 可以断定, 当构建中语言模型的中文语料库 的质量和规模得到保障时, 可以计算出 "他骑自行车去 图书馆"的概率是最大的。 然后 把步骤 S08的处理结果输出到步骤目标语言输出 S09。
歩骤 S09, 输出歩骤 S08得到的整合结果输出, 得到最终的目标语言;
具体地, 输出形式多种多样, 可以通过显示器、 文本文件或语音输出等; 例如, 输出 到显示设备上以图像的形势显示出来, 由打印机打印出结果以及由语音合成器进行合成。 可以随时根据需要切换使用这些***或者同时采用这些***。
另外, 由于本发明方法中的步骤 S06、 步骤 S07的翻译形式可以多种多样, 当采用基于 统计的机器翻译方法时, 可以对训练语料进行适当的处理, 图 9 是本发明实施例基于统计 的机器翻译用平行语料库分割方法的示意图, 如图 9所示, 平行语料库的分割主要由平行 语料库分割单元 210完成, 平行语料库分割单元 210可以使用任意格判定模型, 对语料库 中的句子进行判定, 这样很容易得到不含任意格和含任意格的句子等两个部分, 完成对原 始平行语料库的分割。 这样处理的目的在于构筑统计机器翻译的翻译模型和言语模型时, 上述两个部分的语料库可以根据需要加以灵活地利用。
当然, 也可以不必对用于训练的语料库进行分割处理, 直接进行翻译训练。 图 10是本 发明实施例提供的一种基于统计的机器翻译装置的训练方法的示意图, 本训练方法中言语 模型 ·翻译模型构建单元 310的功能在于构建翻译模型和语言模型, 传统的工具如 GIZA++ 等, SRLM等均可以加以利用。
图 11是本发明实施例基于统计的机器翻译装置的训练方法的示意图, 与图 10所示的 训练方法的不同点在于训练语料库采用了去除任意格短语的源目标语言平行语料库。 通过对源语言语句的词法与句法分析, 找出源语言语句中的任意格, 并根据该任意格 将源语言语句拆分为两个部分, 即将一个较复杂的语句拆分为了两个简单的语句, 并对该 两个简单句子分别进行翻译, 整合翻译结果, 选择组合概率大的整合结果作为翻译结果, 从而降低源语言的句法结构的复杂程度, 提高目标语言的句子结构和文法的生成效率, 达 到提高翻译精度, 并使得机器翻译解码的运算量得到适当的降低, 为机器翻译研究提供一 种有效的装置和方法。
以上实施例提供的技术方案中的全部或部分内容可以通过软件编程实现, 其软件程序 存储在可读取的存储介质中, 存储介质例如: 计算机中的硬盘、 光盘或软盘。
以上所述仅为本发明的较佳实施例, 并不用以限制本发明, 凡在本发明的精神和原则 之内, 所作的任何修改、 等同替换、 改进等, 均应包含在本发明的保护范围之内。

Claims

1、 一种机器翻译装置, 其特征在于, 所述装置包括:
源语言输入单元, 用于输入源语言语句;
源语言分析单元, 用于对所述源语言语句进行词法分析和句法分析得到所述源语言语句 的句法结构, 并为所述句法权结构中的节点赋予属性特征;
任意格判定模型存储单元, 用于存储任意格判定模型, 所述任意格判定模型为所述源语 言语句中是否含有任意格提供模型依据;
任意格判定单元, 用于根据所述属性特征与所述任意格判定模型进行匹配, 如果匹配, 则判定所述源语言语句中含有任意格, 如果不匹配, 则判定所述源语言语句中不含有任意格; 任意格短语提取单元, 用于根据匹配得到的所述任意格获取所述句法结构中的任意格短 语;
任意格短语翻译单元, 用于对所述任意格短语进行机器翻译;
第一提取单元, 用于获取去除所述任意格短语后的源语言书剩余语句;
机器翻译单元, 用于对所述源语言剩余语句进行机器翻译;
翻译结果整合单元, 用于对所述任意格短语翻译单元及机器翻单元的翻译结果进行排列 组合, 将出现概率大的组合作为目标语言;
目标语言输出单元, 用于输出所述目标语言。
2、 根据权利要求 1所述的装置, 其特征在于, 所述源语言分析单元, 具体用于: 根据词法词典对所述源语言语句进行词法分析, 得到所述源语言语句的词序列; 根据预设语法规则对所述源语言语句的词序列进行句法分析, 得到所述源语言语句的句 法结构, 所述句法结构包括所述词序列中对应词的语法范畴与其每一个都关联的节点;
根据义类词典为所述句法结构中的节点赋予属性特征,所述属性特征包括词本身、词性、 词义或概念属性。
3、 根据权利要求 1所述的装置, 其特征在于,
所述任意格短语提取单元, 具体用于获取所述句法结构中与所述任意格相关联的节点字 串作为任意格短语。
4、根据权利要求 1所述的装置, 其特征在于, 所述任意格短语翻译单元, 具体用于根据 任意格翻译字典对所述任意格短语进行翻译。
5、根据权利要求 1所述的装置, 其特征在于, 所述第一提取单元, 还用于对所述源语言 剩余语句的句法结构中节点短语进行排列组合, 将其中出现概率大的组合输出给所述机器翻 译单元。
6、 一种机器翻译方法, 其特征在于, 所述方法包括:
输入源语言语句;
对所述源语言语句进行词法分析和句法分析得到所述源语言语句的句法结构, 并为所述 句法结构中的节点赋予属性特征;
根据所述属性特征与存储的任意格判定模型进行匹配, 如果匹配, 则判定所述源语言语 句中含有任意格, 如果不匹配, 则判定所述源语言语句中不含有任意格, 其中, 所述任意格 判定模型为所述源语言语句中是否含有任意格提供模型依据;
根据匹配得到的所述任意格获取所述句法结构中的任意格短语, 并对所述任意格短语进 行机器翻译;
获取去除所述任意格短语后的源语言剩余语句,并对所述源语言剩余语句进行机器翻译; 对所述任意格短语及源语言剩余语句的翻译结果进行排列组合, 将出现概率大的组合作 为目标语言;
输出所述目标语言。
7、根据权利要求 6所述的方法, 其特征在于, 所述对所述源语言语句进行词法分析和句 法分析得到所述源语言语句的句法结构, 并为所述句法结构中的节点赋予属性特征包括- 根据词法词典对所述源语言语句进行词法分析, 得到所述源语言语句的词序列; 根据预设语法规则对所述源语言语句的词序列进行句法分析, 得到所述源语言语句的句 法结构, 所述句法结构包括词序列中对应词的语法范畴与其每一个都关联的节点;
根据义类词典为所述句法结构中的节点赋予属性特征, 所述属性特征包括词性、 词义或 概念属性。
8、根据权利要求 6所述的方法, 其特征在于, 所述根据所述任意格获取所述句法结构中 的任意格短语包括: 获取所述句法结构中与所述任意格相关联的节点字串作为任意格短语。
9、根据权利要求 6所述的方法, 其特征在于, 所述方法还包括: 根据任意格翻译字典对 所述任意格短语进行翻译。
10、 根据权利要求 6所述的方法, 其特征在于, 所述方法还包括: 对所述源语言剩余语 句的句法结构中节点短语进行排列组合, 将其中出现概率大的组合进行机器翻译。
PCT/CN2010/079963 2010-12-17 2010-12-17 机器翻译装置和方法 WO2012079257A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2010/079963 WO2012079257A1 (zh) 2010-12-17 2010-12-17 机器翻译装置和方法
CN201080070253.6A CN103314369B (zh) 2010-12-17 2010-12-17 机器翻译装置和方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2010/079963 WO2012079257A1 (zh) 2010-12-17 2010-12-17 机器翻译装置和方法

Publications (1)

Publication Number Publication Date
WO2012079257A1 true WO2012079257A1 (zh) 2012-06-21

Family

ID=46243999

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2010/079963 WO2012079257A1 (zh) 2010-12-17 2010-12-17 机器翻译装置和方法

Country Status (2)

Country Link
CN (1) CN103314369B (zh)
WO (1) WO2012079257A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105320650A (zh) * 2014-07-31 2016-02-10 崔晓光 一种机器翻译方法及其***
CN111241245A (zh) * 2020-01-14 2020-06-05 百度在线网络技术(北京)有限公司 人机交互处理方法、装置及电子设备

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268132B (zh) * 2014-09-11 2017-04-26 北京交通大学 机器翻译方法及***
CN104268133B (zh) * 2014-09-11 2018-02-13 北京交通大学 机器翻译方法及***
JP6553180B2 (ja) * 2014-10-17 2019-07-31 エム・ゼット・アイ・ピィ・ホールディングス・リミテッド・ライアビリティ・カンパニーMz Ip Holdings, Llc 言語検出を行うためのシステムおよび方法
CN104391842A (zh) * 2014-12-18 2015-03-04 苏州大学 一种翻译模型构建方法和***
CN110175338B (zh) * 2019-05-31 2023-09-26 北京金山数字娱乐科技有限公司 一种数据处理方法及装置
CN111104796B (zh) * 2019-12-18 2023-05-05 北京百度网讯科技有限公司 用于翻译的方法和装置
CN112613326B (zh) * 2020-12-18 2022-11-08 北京理工大学 一种融合句法结构的藏汉语言神经机器翻译方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1156287A (zh) * 1995-09-11 1997-08-06 松下电器产业株式会社 机器翻译用中文生成装置
JP2827321B2 (ja) * 1989-09-18 1998-11-25 日本電気株式会社 日本語から中国語への機械翻訳方式
CN1308748A (zh) * 1998-05-04 2001-08-15 特雷道斯股份有限公司 机器辅助翻译工具
CN1407483A (zh) * 2001-09-04 2003-04-02 优网通国际资讯股份有限公司 文本表达方法及***以及文本翻译方法及***
CN1595398A (zh) * 2003-09-09 2005-03-16 株式会社国际电气通信基础技术研究所 选择改良多个候补译文所生成的最优译文的机器翻译***
CN101593174A (zh) * 2009-03-11 2009-12-02 林勋准 一种机器翻译方法及***

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2827321B2 (ja) * 1989-09-18 1998-11-25 日本電気株式会社 日本語から中国語への機械翻訳方式
CN1156287A (zh) * 1995-09-11 1997-08-06 松下电器产业株式会社 机器翻译用中文生成装置
CN1308748A (zh) * 1998-05-04 2001-08-15 特雷道斯股份有限公司 机器辅助翻译工具
CN1407483A (zh) * 2001-09-04 2003-04-02 优网通国际资讯股份有限公司 文本表达方法及***以及文本翻译方法及***
CN1595398A (zh) * 2003-09-09 2005-03-16 株式会社国际电气通信基础技术研究所 选择改良多个候补译文所生成的最优译文的机器翻译***
CN101593174A (zh) * 2009-03-11 2009-12-02 林勋准 一种机器翻译方法及***

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105320650A (zh) * 2014-07-31 2016-02-10 崔晓光 一种机器翻译方法及其***
CN111241245A (zh) * 2020-01-14 2020-06-05 百度在线网络技术(北京)有限公司 人机交互处理方法、装置及电子设备
CN111241245B (zh) * 2020-01-14 2021-02-05 百度在线网络技术(北京)有限公司 人机交互处理方法、装置及电子设备

Also Published As

Publication number Publication date
CN103314369B (zh) 2015-08-12
CN103314369A (zh) 2013-09-18

Similar Documents

Publication Publication Date Title
WO2012079257A1 (zh) 机器翻译装置和方法
KR101130444B1 (ko) 기계번역기법을 이용한 유사문장 식별 시스템
US9697477B2 (en) Non-factoid question-answering system and computer program
CN103970798B (zh) 数据的搜索和匹配
US20100121630A1 (en) Language processing systems and methods
WO2010046782A2 (en) Hybrid machine translation
Karim Technical challenges and design issues in bangla language processing
Abdurakhmonova et al. Linguistic functionality of Uzbek Electron Corpus: uzbekcorpus. uz
Zakharov Corpora of the Russian language
Peng et al. Research on tree kernel-based personal relation extraction
JP5722375B2 (ja) 文末表現変換装置、方法、及びプログラム
Soumya et al. Development of a POS tagger for Malayalam-an experience
Monga et al. Speech to Indian Sign Language Translator
Nguyen et al. A tree-to-string phrase-based model for statistical machine translation
CN105045784A (zh) 英语词句的存取装置方法和装置
JP4478042B2 (ja) 頻度情報付き単語集合生成方法、プログラムおよびプログラム記憶媒体、ならびに、頻度情報付き単語集合生成装置、テキスト索引語作成装置、全文検索装置およびテキスト分類装置
JP6145011B2 (ja) 文正規化システム、文正規化方法及び文正規化プログラム
JP3050743B2 (ja) 言語データベースの形態素列変換装置
JP2019087058A (ja) 文章中の省略を特定する人工知能装置
España-Bonet et al. Going beyond zero-shot MT: combining phonological, morphological and semantic factors. The UdS-DFKI System at IWSLT 2017
Tsai et al. Applying an NVEF Word-Pair Identifier to the Chinese Syllable-to-Word Conversion Problem
Samir et al. Training and evaluation of TreeTagger on Amazigh corpus
JP2014134871A (ja) 質問応答用検索キーワード生成方法、装置、及びプログラム
JP4940251B2 (ja) 文書処理プログラム及び文書処理装置
JPH11338863A (ja) 未知名詞および表記ゆれカタカナ語自動収集・認定装置、ならびにそのための処理手順を記録した記録媒体

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10860630

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 07/10/2013)

122 Ep: pct application non-entry in european phase

Ref document number: 10860630

Country of ref document: EP

Kind code of ref document: A1