WO2017012327A1 - 句法分析的方法和装置 - Google Patents

句法分析的方法和装置 Download PDF

Info

Publication number
WO2017012327A1
WO2017012327A1 PCT/CN2016/072422 CN2016072422W WO2017012327A1 WO 2017012327 A1 WO2017012327 A1 WO 2017012327A1 CN 2016072422 W CN2016072422 W CN 2016072422W WO 2017012327 A1 WO2017012327 A1 WO 2017012327A1
Authority
WO
WIPO (PCT)
Prior art keywords
language sentence
target language
state transition
syntax tree
instance
Prior art date
Application number
PCT/CN2016/072422
Other languages
English (en)
French (fr)
Inventor
涂兆鹏
陈晓
姜文斌
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2017012327A1 publication Critical patent/WO2017012327A1/zh
Priority to US15/872,993 priority Critical patent/US10909315B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/45Example-based machine translation; Alignment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/47Machine-assisted translation, e.g. using translation memory

Definitions

  • the present invention relates to the field of information technology and, more particularly, to a method and apparatus for syntax analysis.
  • the existing syntactic analysis methods can be roughly divided into two categories: supervised syntactic analysis and unsupervised syntactic analysis.
  • Supervised syntactic analysis extracts features from the artificially annotated syntax tree library, and learns the relationship between features and artificially labeled syntactic structures through machine learning models; The child searches for a syntactic structure that matches the features in the upper sentence based on the learned model to generate a syntax tree for the given sentence.
  • Supervised syntactic analysis requires the operational judgment of the machine learning model learning features and the syntactic structure of the annotations. The acquisition of the jurisprudence requires a large amount of manually labeled data. In the absence of training data, it is completely impossible to carry out syntactic analysis of the current language.
  • Unsupervised syntactic analysis is the automatic generation of a syntactic library with annotation information for sentences without annotation information.
  • the biggest flaw in unsupervised syntactic analysis is that it is impossible to obtain a practical parser by simply performing unsupervised learning on raw text.
  • the embodiment of the invention provides a method and a device for syntactic analysis, which can automatically generate a syntax tree consistent with syntactic knowledge, thereby improving the efficiency of syntactic analysis.
  • a method of syntactic analysis comprising:
  • a syntax tree of the target language sentence is generated according to the state transition instance of the target language sentence.
  • determining a state transition instance of the target language sentence according to the source language sentence and a correspondence between a word of the target language sentence and a word of the source language sentence including:
  • x r and x l is determined corresponding to the source language sentence fragments and y r y l based on the correspondence relationship;
  • a state transition instance of the target language sentence is determined according to a state transition instance corresponding to all adjacent segments in the target language sentence.
  • the correspondence between x l and x r is obtained according to the component relationship of y l and y r in the syntax tree of the source language sentence.
  • State transition instances including:
  • the method further includes:
  • a state transition instance of the target language sentence is determined according to a score of the state transition instance corresponding to all adjacent segments in the target language sentence.
  • determining a state of the target language sentence according to a score of a state transition instance corresponding to all adjacent segments in the target language sentence Transfer instances including:
  • the highest score N-1 state transition instances in the state transition instances corresponding to all adjacent segments in the target language sentence are determined as state transition instances of the target language sentence, where N is the length of the target language sentence.
  • the state transition instances corresponding to x l and x r are scored, including:
  • is the alignment matrix
  • ⁇ ) represents the score of the state transition instance obtained from x l and x r , and y l and y r .
  • the source language sentences that are mutually translated with the target language sentence are obtained, including:
  • the source language sentence that is translated into the target language sentence is obtained.
  • generating a syntax tree of the target language sentence including:
  • T represents the state transition operation and D represents the derivation of the syntax tree.
  • the method further includes:
  • the target language analyzer is trained according to the syntax tree of the target language sentence.
  • an apparatus for syntactic analysis comprising:
  • An obtaining module configured to obtain a source language sentence that is a translation of the target language sentence
  • a determining module configured to determine a state transition instance of the target language sentence according to the source language sentence and a correspondence between a word of the target language sentence and a word of the source language sentence;
  • a generating module configured to generate a syntax tree of the target language sentence according to the state transition instance of the target language sentence.
  • the determining module is specifically configured to:
  • x r and x l is determined corresponding to the source language sentence fragments and y r y l based on the correspondence relationship;
  • a state transition instance of the target language sentence is determined according to a state transition instance corresponding to all adjacent segments in the target language sentence.
  • the determining module is specifically configured to:
  • the determining module is specifically configured to:
  • a state transition instance of the target language sentence is determined according to a score of the state transition instance corresponding to all adjacent segments in the target language sentence.
  • the determining module is specifically configured to:
  • the highest score N-1 state transition instances in the state transition instances corresponding to all adjacent segments in the target language sentence are determined as state transition instances of the target language sentence, where N is the length of the target language sentence.
  • the determining module is specifically configured to:
  • is the alignment matrix
  • ⁇ ) represents the score of the state transition instance obtained from x l and x r , and y l and y r .
  • the acquiring module is specifically configured to:
  • the source language sentence that is translated into the target language sentence is obtained.
  • the generating module is specifically configured to:
  • T represents the state transition operation and D represents the derivation of the syntax tree.
  • the apparatus further includes:
  • a training module is configured to train the target language analyzer according to a syntax tree of the target language sentence.
  • the embodiment of the present invention generates a syntax tree of a target language sentence according to a source language sentence that is a translation of a target language sentence, and can obtain a target language sentence without manual labeling.
  • FIG. 1 is a schematic flow chart of a method of syntax analysis according to an embodiment of the present invention.
  • FIG. 2 is a schematic diagram of a syntax tree of a source language sentence according to an embodiment of the present invention.
  • FIG. 3 is a schematic diagram of a fragment corresponding to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of an extraction example of an embodiment of the present invention.
  • FIG. 5 is a schematic flow chart of a method of syntax analysis according to another embodiment of the present invention.
  • Figure 6 is a schematic block diagram of an apparatus for syntax analysis of one embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of an apparatus for syntax analysis according to another embodiment of the present invention.
  • the target language is the language to be analyzed.
  • the target language can be a resource scarce language. Due to the scarcity of syntactic resources, there is no syntactic parser or a parsing parser without a high efficiency for the resource scarce language.
  • the source language is a language that can be parsed by an existing syntactic parser or syntactic parsing method.
  • the source language can be a resource rich language for which a syntactic parser is available or can be trained by an existing syntactic tree library to obtain a parser.
  • FIG. 1 shows a schematic flow diagram of a method 100 of syntax analysis in accordance with an embodiment of the present invention. As shown in FIG. 1, the method 100 includes:
  • a syntax tree of a target language sentence is generated by using a source language sentence that is a translation of the target language sentence. For a target language sentence, first determine the state transition of the target language sentence according to the correspondence between the source language sentence and the words of the target language sentence and the words of the source language sentence, and then transfer the instance according to the state of the target language sentence. Generate a syntax tree for the target language sentence.
  • the target language syntax tree library can be obtained from a plurality of target language sentences. Therefore, the embodiment of the present invention can obtain the target language syntax tree library without manual labeling, and the target language syntax tree library is more in line with the syntactic knowledge than the automatically generated syntax tree library in the unsupervised learning.
  • the syntax analysis method of the embodiment of the present invention generates a syntax tree of a target language sentence according to a source language sentence that is a translation of a target language sentence, and can obtain a syntax tree of the target language sentence that conforms to the syntax knowledge without manual labeling, thereby enabling Improve the efficiency of syntactic analysis.
  • obtaining a source language sentence that is a translation of the target language sentence including:
  • the source language sentence that is translated into the target language sentence is obtained.
  • Parallel corpus is a bilingual corpus, which refers to the corpus of the source language and the target language that are translated at the sentence level. That is, the target language sentence in the parallel corpus has a translation of the source language sentence.
  • the parallel corpus may be a bilingual parallel corpus, a bilingual dictionary, or a bilingual correspondence rule.
  • the target language sentence is selected from the parallel corpus, and the syntax tree of the target language sentence is generated according to the translation of the target language sentence (source language sentence).
  • the corresponding sentences in the parallel corpus may be pre-processed in the corresponding language, for example, Chinese needs to perform word segmentation, and English needs to perform tokenization to reduce data sparsity and increase data consistency.
  • determining a state transition instance of the target language sentence according to the source language sentence and the correspondence between the words of the target language sentence and the words of the source language sentence including:
  • the segments y l and y r of the source language sentence corresponding to x l and x r are determined according to the correspondence relationship;
  • a state transition instance of the target language sentence is determined according to a state transition instance corresponding to all adjacent segments in the target language sentence.
  • the source language sentence is analyzed, and the syntax tree of the source language sentence is obtained.
  • the syntax tree of the source language sentence can be obtained by an existing source language parser, for example, a Stanford parser; or an parser trained on the existing syntactic tree library of the source language.
  • the fragments y l and y r of the source language corresponding to ⁇ x l , x r > are obtained, which are expressed as ⁇ y l , y r >.
  • y l and y r are not necessarily adjacent.
  • the correspondence can be obtained by using an existing alignment tool, such as GIZA++, or other automatic alignment tools, which are not limited by the present invention.
  • the word alignment specific form is: 1:12:3..., indicating that the first word of the source language sentence corresponds to the first word of the target language sentence, the second word of the source language sentence, and the third word of the target language sentence correspond to, for example, .
  • the state transition instance is extracted according to the relationship between y l and y r .
  • the merge operation instance is extracted, that is, the derived positive example; if y l and y r cannot constitute the source language
  • a component in the syntactic tree of a sentence extracts a separate operation instance, that is, a derivation counterexample.
  • the state transition instance is then selected in all of the extracted state transition instances.
  • the method 100 further includes:
  • determining a state transition instance of the target language sentence according to the state transition instance corresponding to all adjacent segments in the target language sentence including:
  • a state transition instance of the target language sentence is determined according to a score of the state transition instance corresponding to all adjacent segments in the target language sentence.
  • the alignment strength of the source language segment and the target language segment may be high or low.
  • the selection may be based on the best one or more alignment results.
  • the N-1 state transition instances with the highest score among the state transition instances corresponding to all adjacent segments in the target language sentence may be determined as the state transition instance of the target language sentence, where N is the target The length of the language sentence.
  • the state transition instances corresponding to x l and x r may be scored according to the following equation.
  • is the alignment matrix
  • ⁇ ) represents the score of the state transition instance obtained according to x l and x r
  • i is the fragment x
  • j is the word in the fragment y.
  • the given target language sentence is: "railway workers learn English grammar", whose translation is the source language sentence "Railway Workers Learn English Grammar”.
  • the corresponding segments ⁇ y l , y r > of the above two adjacent segments in the source language sentence are obtained.
  • the corresponding segment is ⁇ learning, English grammar>.
  • a state transition instance is extracted based on the relationship between y l and y r .
  • ⁇ V, NP> constitutes a larger component VP, that is, ⁇ V, NP> can be merged. Therefore, the derivation considers that ⁇ learn, English grammar> can also be merged to extract the merge operation instance and score the instance.
  • N is the length of the target language sentence.
  • the syntax tree of the target language sentence may be generated according to the state transition instance of the target language sentence.
  • the syntax tree Y(X) of the target language sentence X may be generated according to the following equation.
  • T represents a state transition operation and D represents a derivation of a syntax tree.
  • T ( ⁇ , ⁇ , ⁇ ), where ⁇ reduce,separate ⁇ , indicating that the two components are merged or separated, ⁇ NT , indicating the target non-terminal after the merge, ⁇ left,right ⁇ , which indicates which is the central component after the merge.
  • the state transition operation ( ⁇ , ⁇ , ⁇ ) can be broken down into two parts:
  • the feature template can be used to extract the corresponding features of each instance, and the correlation probability (ie, the score) is obtained by training the classifier.
  • equation (3) is only one way of generating a syntax tree, and the present invention may also generate a syntax tree by using the variant of the equation (3) or other scoring-based manner, which is not limited in the present invention.
  • the method 100 further includes:
  • the syntax tree of the generated target language sentence can be used to train the target language analyzer. That is to say, the syntax trees of multiple target language sentences can form a target language syntax tree library for training the target language analyzer.
  • the existing technology can be used by the syntax tree library training analyzer, and will not be described here.
  • the syntax analysis method of the embodiment of the present invention generates a syntax tree of the target language sentence according to the source language sentence that is the translation of the target language sentence, and can obtain the syntax tree of the target language sentence that is consistent with the syntax knowledge without manual labeling. This can improve the efficiency of syntactic analysis.
  • the size of the sequence numbers of the above processes does not mean the order of execution, and the order of execution of each process should be determined by its function and internal logic, and should not be taken to the embodiments of the present invention.
  • the implementation process constitutes any limitation.
  • FIG. 6 shows a schematic block diagram of an apparatus 600 for syntax analysis in accordance with an embodiment of the present invention. As shown in FIG. 6, the apparatus 600 includes:
  • the obtaining module 610 is configured to obtain a source language sentence that is a translation of the target language sentence with each other;
  • a determining module 620 configured to determine, according to the source language sentence, a correspondence between a word of the target language sentence and a word of the source language sentence, a state transition instance of the target language sentence;
  • the generating module 630 is configured to generate a syntax tree of the target language sentence according to the state transition instance of the target language sentence.
  • a syntax tree of a target language sentence is generated by using a source language sentence that is a translation of the target language sentence.
  • the state transition instance of the target language sentence is determined according to the correspondence between the source language sentence and the words of the target language sentence and the words of the source language sentence, and then the syntax of the target language sentence is generated according to the state transition instance of the target language sentence. tree.
  • the target language syntax tree library can be obtained from a plurality of target language sentences. Therefore, the embodiment of the present invention can obtain the target language syntax tree library without manual labeling, and the target language syntax tree library is more in line with the syntactic knowledge than the automatically generated syntax tree library in the unsupervised learning.
  • the apparatus for syntactic analysis generates a syntax tree of a target language sentence according to a source language sentence that is a translation of a target language sentence, and can obtain a syntax tree of the target language sentence that conforms to the syntax knowledge without manual labeling, thereby enabling Improve the efficiency of syntactic analysis.
  • the determining module 620 is specifically configured to:
  • x r and x l is determined corresponding to the source language sentence fragments and y r y l based on the correspondence relationship;
  • a state transition instance of the target language sentence is determined according to a state transition instance corresponding to all adjacent segments in the target language sentence.
  • the determining module 620 is specifically configured to:
  • the determining module 620 is specifically configured to:
  • a state transition instance of the target language sentence is determined according to a score of the state transition instance corresponding to all adjacent segments in the target language sentence.
  • the determining module 620 is specifically configured to:
  • the highest score N-1 state transition instances in the state transition instances corresponding to all adjacent segments in the target language sentence are determined as state transition instances of the target language sentence, where N is the length of the target language sentence.
  • the determining module 620 is specifically configured to:
  • is the alignment matrix
  • ⁇ ) represents the score of the state transition instance obtained from x l and x r , and y l and y r .
  • the acquiring module 610 is specifically configured to:
  • the source language sentence that is translated into the target language sentence is obtained.
  • the generating module 630 is specifically configured to:
  • T represents the state transition operation and D represents the derivation of the syntax tree.
  • the apparatus 600 further includes:
  • a training module is configured to train the target language analyzer according to a syntax tree of the target language sentence.
  • the apparatus 600 for syntax analysis according to an embodiment of the present invention may correspond to an execution body of a method of syntax analysis according to an embodiment of the present invention, and the above and other operations and/or functions of respective modules in the apparatus 600 are respectively implemented to implement the foregoing method.
  • the corresponding process, for the sake of brevity, will not be described here.
  • the apparatus for syntactic analysis generates a syntax tree of a target language sentence according to a source language sentence that is a translation of a target language sentence, and may not require manual labeling to obtain a syntax tree matching the syntax knowledge of the preferred target language sentence. This can improve the efficiency of syntactic analysis.
  • FIG. 7 shows the structure of an apparatus for syntax analysis provided by still another embodiment of the present invention, comprising at least one processor 702 (for example, a CPU), at least one network interface 705 or other communication interface, a memory 706, and at least one communication bus. 703, used to implement connection communication between these components.
  • the processor 702 is configured to execute executable modules, such as computer programs, stored in the memory 706.
  • the memory 706 may include a high speed random access memory (RAM), and may also include a non-volatile memory such as at least one disk memory.
  • a communication connection with at least one other network element is achieved by at least one network interface 705 (which may be wired or wireless).
  • the memory 706 stores a program 7061, and the processor 702 executes the program 7061 for performing the following operations:
  • a syntax tree of the target language sentence is generated according to the state transition instance of the target language sentence.
  • processor 702 is specifically configured to:
  • x r and x l is determined corresponding to the source language sentence fragments and y r y l based on the correspondence relationship;
  • a state transition instance of the target language sentence is determined according to a state transition instance corresponding to all adjacent segments in the target language sentence.
  • processor 702 is specifically configured to:
  • processor 702 is specifically configured to:
  • a state transition instance of the target language sentence is determined according to a score of the state transition instance corresponding to all adjacent segments in the target language sentence.
  • processor 702 is specifically configured to:
  • the highest score N-1 state transition instances in the state transition instances corresponding to all adjacent segments in the target language sentence are determined as state transition instances of the target language sentence, where N is the length of the target language sentence.
  • processor 702 is specifically configured to:
  • is the alignment matrix
  • ⁇ ) represents the score of the state transition instance obtained from x l and x r , and y l and y r .
  • processor 702 is specifically configured to:
  • the source language sentence that is translated into the target language sentence is obtained.
  • processor 702 is specifically configured to:
  • T represents the state transition operation and D represents the derivation of the syntax tree.
  • the processor 702 is further configured to train the target language analyzer according to the syntax tree of the target language sentence.
  • the syntax tree of the target language sentence is generated according to the source language sentence that is the translation of the target language sentence, and the syntactic knowledge of the target language sentence can be obtained without manual labeling. Syntactic tree, which improves syntactic analysis s efficiency.
  • the term "and/or” is merely an association relationship describing an associated object, indicating that there may be three relationships.
  • a and/or B may indicate that A exists separately, and A and B exist simultaneously, and B cases exist alone.
  • the character "/" in this article generally indicates that the contextual object is an "or" relationship.
  • the disclosed systems, devices, and methods may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, or an electrical, mechanical or other form of connection.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the embodiments of the present invention.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold as a separate product When sold or used, it can be stored on a computer readable storage medium.
  • the technical solution of the present invention contributes in essence or to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium.
  • a number of instructions are included to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

一种句法分析的方法和装置,该方法包括:获取与目标语言句子互为译文的源语言句子(S110);根据该源语言句子,以及该目标语言句子的词语与该源语言句子的词语的对应关系,确定该目标语言句子的状态转移实例(S120);根据该目标语言句子的状态转移实例,生成该目标语言句子的句法树(S130)。该句法分析的方法和装置,能够提高句法分析的效率。

Description

句法分析的方法和装置
本申请要求于2015年7月22日提交中国专利局、申请号为201510435938.0、发明名称为“句法分析的方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及信息技术领域,并且更具体地,涉及句法分析的方法和装置。
背景技术
随着互联网的高速发展带来的网络文本数据***式的增长和经济全球化的发展,不同国家之间的信息交流和交换变得越来越频繁。同时,蓬勃发展的互联网为获取诸如英语、汉语、法语、德语、日语等各种语言形式的信息提供了极大地便利。这些语言服务包括信息检索、文本摘要、机器翻译、自动问答等。在众多语言服务中,句法分析能够为其带来巨大的性能提升。句法分析能够对语言的结构进行分析,利用句法分析所得到的句子结构,可以更好地帮助下层应用更好的捕捉句子的结构信息,在此基础上进一步理解语义信息。在机器翻译中可以有多种运用,如更好的辅助统计机器翻译完成源语言到目标语言翻译过程中的长距离调序,也可以用于指导目标译文的生成过程,使得译文更符合语法结构,从而带来译文质量的提升。
然而,除了少数热门的语言,如英语、日语、法语、德语等,众多小语种,如东南亚语言包括泰语、缅甸语,越南语、柬埔寨语等的句法分析远未得到良好发展。其瓶颈在于,这些小语种的句法资源的严重稀缺。句法资源的建设需要投入巨大的人力,并且只有在构建到一定规模之后,才能使得自动句法分析的性能达到应用的程度。另一方面,在实际人工构建句法资源的过程中,还需要面临不同语言句法结构的标准确立的问题,要尽可能地统一标注标准。这些困难都导致了在短期内构建资源稀缺语言的自动句法分析器难以实现。
当前已有的句法分析方法可以大致分为两类:有监督的句法分析和无监督的句法分析。有监督的句法分析是在人工标注的句法树库中抽取特征,通过机器学习模型学习到特征和人工标注的句法结构的关系;对于待标注句 子,根据学习到的模型搜索匹配上句子中特征的句法结构组合,以生成给定句子的句法树。有监督的句法分析需要机器学习模型学习特征和标注的句法结构的操作判定,判例的取得需要大量的人工标注的数据。在没有训练数据的情况下,完全不可能开展当前语言的句法分析。人工标注句法树库需要消耗大量的人力和时间成本,保证标注标准的一致性也存在一定的困难。即便存在训练数据,如果训练的数据规模太小,那么判例的学习也极容易过拟合,造成在实际的运用当中性能表现较差。
无监督的句法分析是对不带标注信息的句子自动生成带有标注信息的句法库。无监督的句法分析最大的缺陷在于单纯通过在生文本上进行无监督学习,以致于无法得到可以实用的句法分析器。
发明内容
本发明实施例提供了一种句法分析的方法和装置,能够自动生成符合句法知识的句法树,从而提高句法分析的效率。
第一方面,提供了一种句法分析的方法,包括:
获取与目标语言句子互为译文的源语言句子;
根据该源语言句子,以及该目标语言句子的词语与该源语言句子的词语的对应关系,确定该目标语言句子的状态转移实例;
根据该目标语言句子的状态转移实例,生成该目标语言句子的句法树。
结合第一方面,在第一种可能的实现方式中,根据该源语言句子,以及该目标语言句子的词语与该源语言句子的词语的对应关系,确定该目标语言句子的状态转移实例,包括:
根据该源语言句子,获取该源语言句子的句法树;
对于该目标语言句子中的任意相邻的片段xl和xr,根据该对应关系确定xl和xr对应的该源语言句子的片段yl和yr
若yl和yr是该源语言句子的句法树中的成分,则根据yl和yr在该源语言句子的句法树中的关系,获取xl和xr对应的状态转移实例;
根据该目标语言句子中的所有相邻的片段对应的状态转移实例,确定该目标语言句子的状态转移实例。
结合第一方面的第一种可能的实现方式,在第二种可能的实现方式中,根据yl和yr在该源语言句子的句法树中的成分关系,获取xl和xr对应的状 态转移实例,包括:
若yl和yr构成该源语言句子的句法树中的一个成分,则获取归并操作实例;
若yl和yr无法构成该源语言句子的句法树中的一个成分,则获取分离操作实例。
结合第一方面的第一或二种可能的实现方式,在第三种可能的实现方式中,该方法还包括:
对xl和xr对应的状态转移实例进行评分;
该根据该目标语言句子中的所有相邻的片段对应的状态转移实例,确定该目标语言句子的状态转移实例,包括:
根据该目标语言句子中的所有相邻的片段对应的状态转移实例的评分,确定该目标语言句子的状态转移实例。
结合第一方面的第三种可能的实现方式,在第四种可能的实现方式中,根据该目标语言句子中的所有相邻的片段对应的状态转移实例的评分,确定该目标语言句子的状态转移实例,包括:
将该目标语言句子中的所有相邻的片段对应的状态转移实例中评分最高的N-1个状态转移实例确定为该目标语言句子的状态转移实例,其中,N为该目标语言句子的长度。
结合第一方面的第三或四种可能的实现方式,在第五种可能的实现方式中,对xl和xr对应的状态转移实例进行评分,包括:
根据以下等式对xl和xr对应的状态转移实例进行评分,
p(xl,xr,yl,yr|Α)=p(xl,yl|Α)×p(xr,yr|Α),
Figure PCTCN2016072422-appb-000001
其中,Α为对齐矩阵,p(xl,xr,yl,yr|Α)表示根据xl和xr,以及yl和yr,获取的状态转移实例的评分。
结合第一方面或第一方面的上述任一种可能的实现方式,在第六种可能的实现方式中,获取与目标语言句子互为译文的源语言句子,包括:
根据目标语言和源语言的平行语料,获取与该目标语言句子互为译文的该源语言句子。
结合第一方面或第一方面的上述任一种可能的实现方式,在第七种可能 的实现方式中,根据该目标语言句子的状态转移实例,生成该目标语言句子的句法树,包括:
根据以下等式生成该目标语言句子X的句法树Y(X),
Figure PCTCN2016072422-appb-000002
其中,T表示状态转移操作,D表示句法树的推导。
结合第一方面或第一方面的上述任一种可能的实现方式,在第八种可能的实现方式中,该方法还包括:
根据该目标语言句子的句法树,训练目标语言分析器。
第二方面,提供了一种句法分析的装置,包括:
获取模块,用于获取与目标语言句子互为译文的源语言句子;
确定模块,用于根据该源语言句子,以及该目标语言句子的词语与该源语言句子的词语的对应关系,确定该目标语言句子的状态转移实例;
生成模块,用于根据该目标语言句子的状态转移实例,生成该目标语言句子的句法树。
结合第二方面,在第一种可能的实现方式中,该确定模块具体用于:
根据该源语言句子,获取该源语言句子的句法树;
对于该目标语言句子中的任意相邻的片段xl和xr,根据该对应关系确定xl和xr对应的该源语言句子的片段yl和yr
若yl和yr是该源语言句子的句法树中的成分,则根据yl和yr在该源语言句子的句法树中的关系,获取xl和xr对应的状态转移实例;
根据该目标语言句子中的所有相邻的片段对应的状态转移实例,确定该目标语言句子的状态转移实例。
结合第二方面的第一种可能的实现方式,在第二种可能的实现方式中,该确定模块具体用于:
若yl和yr构成该源语言句子的句法树中的一个成分,则获取归并操作实例;
若yl和yr无法构成该源语言句子的句法树中的一个成分,则获取分离操作实例。
结合第二方面的第一或二种可能的实现方式,在第三种可能的实现方式中,该确定模块具体用于:
对xl和xr对应的状态转移实例进行评分;
根据该目标语言句子中的所有相邻的片段对应的状态转移实例的评分,确定该目标语言句子的状态转移实例。
结合第二方面的第三种可能的实现方式,在第四种可能的实现方式中,该确定模块具体用于:
将该目标语言句子中的所有相邻的片段对应的状态转移实例中评分最高的N-1个状态转移实例确定为该目标语言句子的状态转移实例,其中,N为该目标语言句子的长度。
结合第二方面的第三或四种可能的实现方式,在第五种可能的实现方式中,该确定模块具体用于:
根据以下等式对xl和xr对应的状态转移实例进行评分,
p(xl,xr,yl,yr|Α)=p(xl,yl|Α)×p(xr,yr|Α),
Figure PCTCN2016072422-appb-000003
其中,Α为对齐矩阵,p(xl,xr,yl,yr|Α)表示根据xl和xr,以及yl和yr,获取的状态转移实例的评分。
结合第二方面或第二方面的上述任一种可能的实现方式,在第六种可能的实现方式中,该获取模块具体用于:
根据目标语言和源语言的平行语料,获取与该目标语言句子互为译文的该源语言句子。
结合第二方面或第二方面的上述任一种可能的实现方式,在第七种可能的实现方式中,该生成模块具体用于:
根据以下等式生成该目标语言句子X的句法树Y(X),
Figure PCTCN2016072422-appb-000004
其中,T表示状态转移操作,D表示句法树的推导。
结合第二方面或第二方面的上述任一种可能的实现方式,在第八种可能的实现方式中,该装置还包括:
训练模块,用于根据该目标语言句子的句法树,训练目标语言分析器。
基于上述技术方案,本发明实施例根据与目标语言句子互为译文的源语言句子生成目标语言句子的句法树,可以无需人工标注得到目标语言句子的 符合句法知识的句法树,从而能够提高句法分析的效率。
附图说明
为了更清楚地说明本发明实施例的技术方案,下面将对本发明实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本发明一个实施例的句法分析的方法的示意性流程图。
图2是本发明实施例的源语言句子的句法树的示意图。
图3是本发明实施例的片段对应的示意图。
图4是本发明实施例的抽取实例的示意图。
图5是本发明另一实施例的句法分析的方法的示意性流程图。
图6是本发明一个实施例的句法分析的装置的示意性框图。
图7是本发明另一实施例的句法分析的装置的结构示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明的一部分实施例,而不是全部实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动的前提下所获得的所有其他实施例,都应属于本发明保护的范围。
在本发明实施例中,目标语言为待分析的语言。例如,目标语言可以是资源稀缺语言,由于句法资源的稀缺,对于该资源稀缺语言,没有句法分析器或没有较高效率的句法分析器。
在本发明实施例中,源语言为可以通过已有的句法分析器或句法分析方法进行句法分析的语言。例如,源语言可以是资源丰富语言,对于该资源丰富语言,已有句法分析器或者可以通过已有的句法树库训练得到句法分析器。
图1示出了根据本发明实施例的句法分析的方法100的示意性流程图。如图1所示,该方法100包括:
S110,获取与目标语言句子互为译文的源语言句子;
S120,根据该源语言句子,以及该目标语言句子的词语与该源语言句子 的词语的对应关系,确定该目标语言句子的状态转移实例;
S130,根据该目标语言句子的状态转移实例,生成该目标语言句子的句法树。
在本发明实施例中,利用与目标语言句子互为译文的源语言句子生成目标语言句子的句法树。对于一个目标语言句子,先根据源语言句子以及目标语言句子的词语与源语言句子的词语的对应关系确定目标语言句子的状态转移(transition)实例(instance),再根据目标语言句子的状态转移实例生成目标语言句子的句法树。这样,可以由多个目标语言句子得到目标语言句法树库。因此,本发明实施例不需要人工标注就能得到目标语言句法树库,而且该目标语言句法树库相对于无监督学习中自动生成的句法树库更符合句法知识。
因此,本发明实施例的句法分析的方法,根据与目标语言句子互为译文的源语言句子生成目标语言句子的句法树,可以无需人工标注得到目标语言句子的符合句法知识的句法树,从而能够提高句法分析的效率。
在本发明一个实施例中,可选地,获取与目标语言句子互为译文的源语言句子,包括:
根据目标语言和源语言的平行语料,获取与该目标语言句子互为译文的该源语言句子。
平行语料属于双语语料,是指源语言和目标语言在句子级别互为翻译的语料。也就是说,平行语料中的目标语言句子具有源语言句子的译文。例如,平行语料可以为双语平行语料库、双语词典或双语对应规则等。本发明实施例中从平行语料中选择目标语言句子,再根据目标语言句子的译文(源语言句子)生成目标语言句子的句法树。
可选地,平行语料中的对应句子可以经过相应语言的预处理,例如是汉语需要进行分词,英语则需要进行标记解析(tokenize)以减小数据稀疏性和增加数据一致性。
在本发明一个实施例中,可选地,根据该源语言句子,以及该目标语言句子的词语与该源语言句子的词语的对应关系,确定该目标语言句子的状态转移实例,包括:
根据该源语言句子,获取该源语言句子的句法树;
对于该目标语言句子中的任意相邻的片段xl和xr,根据该对应关系确定 xl和xr对应的该源语言句子的片段yl和yr
若yl和yr是该源语言句子的句法树中的成分,则根据yl和yr在该源语言句子的句法树中的关系,获取xl和xr对应的状态转移实例;
根据该目标语言句子中的所有相邻的片段对应的状态转移实例,确定该目标语言句子的状态转移实例。
具体而言,在得到目标语言句子对应的源语言句子后,对源语言句子进行分析,获取该源语言句子的句法树。可通过已有的源语言的分析器得到源语言句子的句法树,例如,斯坦福分析器(Stanford parser);也可以使用在源语言已有的句法树库上训练得到的分析器。
枚举目标语言句子X中任意相邻的片段xl和xr,表示为<xl,xr>。
根据目标语言句子的词语与源语言句子的词语的对应关系得到<xl,xr>所对应的源语言的片段yl和yr,表示为<yl,yr>。yl和yr不一定相邻。该对应关系可以使用已有的对齐工具得到,如GIZA++,也可以使用其他的自动对齐工具,本发明对此并不限定。例如,词语对齐具体形式为:1:12:3…,表示源语言句子第1个词和目标语言句子第1个词对应,源语言句子第2个词和目标语言句子第3个词对应等。
如果所得源语言句子片段<yl,yr>不是源语言句子的句法树中的成分,则重新选择目标语言句子X中另外相邻的片段。如果所得源语言句子片段<yl,yr>是源语言句子的句法树中的成分,则根据yl和yr的关系,抽取状态转移实例。具体地,如果yl和yr构成了源语言句子的句法树中的一个成分,即更大的成分,则抽取归并操作实例,即推导的正例;如果yl和yr无法构成源语言句子的句法树中的一个成分,则抽取分离操作实例,即推导的反例。
重复上述步骤,直至枚举完成。再在所抽取的所有状态转移实例中选择状态转移实例。
在本发明一个实施例中,可选地,该方法100还包括:
对xl和xr对应的状态转移实例进行评分;
在这种情况下,根据该目标语言句子中的所有相邻的片段对应的状态转移实例,确定该目标语言句子的状态转移实例,包括:
根据该目标语言句子中的所有相邻的片段对应的状态转移实例的评分,确定该目标语言句子的状态转移实例。
具体而言,由于词语对应(或称为对齐)有一定的误差,尤其是对异构 的语言间。因此,源语言片段和目标语言片段的对齐强弱程度可能有高有低。可选地,在确定<xl,xr>对应的源语言的片段<yl,yr>时,可以根据最好的一个或多个对齐结果选择。在选择状态转移实例时,可根据状态转移实例的评分进行选择。可选地,可以将该目标语言句子中的所有相邻的片段对应的状态转移实例中评分最高的N-1个状态转移实例确定为该目标语言句子的状态转移实例,其中,N为该目标语言句子的长度。
在本发明一个实施例中,可选地,可根据以下等式对xl和xr对应的状态转移实例进行评分,
p(xl,xr,yl,yr|Α)=p(xl,yl|Α)×p(xr,yr|Α)    (1)
Figure PCTCN2016072422-appb-000005
其中,Α为对齐矩阵,p(xl,xr,yl,yr|Α)表示根据xl和xr,以及yl和yr,获取的状态转移实例的评分,i为片段x中的词,j为片段y中的词。
应理解,上述等式(1)和(2)只是对实例进行评分的一种方式,本发明还可以通过其他方式对实例进行评分,例如,采用其他对齐方式或其他对齐矩阵进行评分,本发明对此并不限定。
下面将结合具体的例子详细描述本发明实施例。应注意,这只是为了帮助本领域技术人员更好地理解本发明实施例,而非限制本发明实施例的范围。
给定目标语言句子为:“railway workers learn English grammar”,其译文为源语言句子“铁路工人学习英语语法”。
对于源语言句子,根据已有的句法分析器可到其句法树,如图2所示。
枚举目标语言句子两个相邻的片段,如这两个相邻的片段<xl,xr>为<learn,English grammar>。
根据对应关系得到上述两个相邻的片段在源语言句子中对应的片段<yl,yr>。如图3所示,对应片段为<学习,英语语法>。
判断<yl,yr>是否是源语言句子句法树中的成分。从图2中可以得到,<学习,英语语法>是源语言句子句法树中的成分,<V,NP>。
根据yl和yr的关系,抽取状态转移实例。如图4所示,<V,NP>构成更大的成分VP,即<V,NP>可以归并。因此,推导认为<learn,English grammar>也可以归并,从而抽取归并操作实例,并对该实例进行评分。
重复上述步骤直至枚举完所有相邻片断,最后选择分数最高的N-1个实例(N为目标语言句子的长度)。
在获取了目标语言句子的状态转移实例后,可根据该目标语言句子的状态转移实例,生成该目标语言句子的句法树。
在本发明一个实施例中,可选地,可根据以下等式生成该目标语言句子X的句法树Y(X),
Figure PCTCN2016072422-appb-000006
其中,T表示状态转移操作(transition operation),D表示句法树的推导(derivation)。
式(3)中,对于实例对应的状态转移操作T,可表示为T=(λ,α,β),其中,λ∈{reduce,separate},表示两个成分该归并还是分离,α∈NT,表示归并之后的目标非终结符(non-terminal),β∈{left,right},表示归并之后哪个是中心成分。
状态转移操作(λ,α,β)可以拆解为两部分:
(λ,α),成分句法分析操作;
(λ,β),依存句法分析操作。
状态转移操作T=(λ,α,β)的评分p(T)是两部分评分之积:
p(T|S,Cc,Cd)=p(λ,α,β|S,Cc,Cd)=p(λ,α|S,Cc)×p(λ,β|S,Cd)   (4)
其中,S表示状态,Cc和Cd分别表示成分分类器和依存分类器。
对于p(λ,α|S,Cc)和p(λ,β|S,Cd),可以使用特征模板抽取每个实例的对应特征,并通过训练分类器得到相关概率(即评分)。
应理解,上述等式(3)只是生成句法树的一种方式,本发明还可以利用等式(3)的变形或者其他基于评分的方式生成句法树,本发明对此并不限定。
在本发明一个实施例中,如图5所示,可选地,该方法100还包括:
S140,根据该目标语言句子的句法树,训练目标语言分析器。
具体而言,前述生成的目标语言句子的句法树可以用来训练目标语言分析器。也就是说,多个目标语言句子的句法树可组成目标语言句法树库,用于训练目标语言分析器。由句法树库训练分析器可以使用已有的技术,在此不再赘述。
本发明实施例的句法分析的方法,根据与目标语言句子互为译文的源语言句子生成目标语言句子的句法树,可以不需要人工标注得到较优的目标语言句子的符合句法知识的句法树,从而能够提高句法分析的效率。
应理解,在本发明的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本发明实施例的实施过程构成任何限定。
上文中详细描述了根据本发明实施例的句法分析的方法,下面将描述根据本发明实施例的句法分析的装置。
图6示出了根据本发明实施例的句法分析的装置600的示意性框图。如图6所示,该装置600包括:
获取模块610,用于获取与目标语言句子互为译文的源语言句子;
确定模块620,用于根据该源语言句子,以及该目标语言句子的词语与该源语言句子的词语的对应关系,确定该目标语言句子的状态转移实例;
生成模块630,用于根据该目标语言句子的状态转移实例,生成该目标语言句子的句法树。
在本发明实施例中,利用与目标语言句子互为译文的源语言句子生成目标语言句子的句法树。对于一个目标语言句子,先根据源语言句子以及目标语言句子的词语与源语言句子的词语的对应关系确定目标语言句子的状态转移实例,再根据目标语言句子的状态转移实例生成目标语言句子的句法树。这样,可以由多个目标语言句子得到目标语言句法树库。因此,本发明实施例不需要人工标注就能得到目标语言句法树库,而且该目标语言句法树库相对于无监督学习中自动生成的句法树库更符合句法知识。
因此,本发明实施例的句法分析的装置,根据与目标语言句子互为译文的源语言句子生成目标语言句子的句法树,可以无需人工标注得到目标语言句子的符合句法知识的句法树,从而能够提高句法分析的效率。
在本发明一个实施例中,可选地,该确定模块620具体用于:
根据该源语言句子,获取该源语言句子的句法树;
对于该目标语言句子中的任意相邻的片段xl和xr,根据该对应关系确定xl和xr对应的该源语言句子的片段yl和yr
若yl和yr是该源语言句子的句法树中的成分,则根据yl和yr在该源语言句子的句法树中的关系,获取xl和xr对应的状态转移实例;
根据该目标语言句子中的所有相邻的片段对应的状态转移实例,确定该目标语言句子的状态转移实例。
在本发明一个实施例中,可选地,该确定模块620具体用于:
若yl和yr构成该源语言句子的句法树中的一个成分,则获取归并操作实例;
若yl和yr无法构成该源语言句子的句法树中的一个成分,则获取分离操作实例。
在本发明一个实施例中,可选地,该确定模块620具体用于:
对xl和xr对应的状态转移实例进行评分;
根据该目标语言句子中的所有相邻的片段对应的状态转移实例的评分,确定该目标语言句子的状态转移实例。
在本发明一个实施例中,可选地,该确定模块620具体用于:
将该目标语言句子中的所有相邻的片段对应的状态转移实例中评分最高的N-1个状态转移实例确定为该目标语言句子的状态转移实例,其中,N为该目标语言句子的长度。
在本发明一个实施例中,可选地,该确定模块620具体用于:
根据以下等式对xl和xr对应的状态转移实例进行评分,
p(xl,xr,yl,yr|Α)=p(xl,yl|Α)×p(xr,yr|Α),
Figure PCTCN2016072422-appb-000007
其中,Α为对齐矩阵,p(xl,xr,yl,yr|Α)表示根据xl和xr,以及yl和yr,获取的状态转移实例的评分。
在本发明一个实施例中,可选地,该获取模块610具体用于:
根据目标语言和源语言的平行语料,获取与该目标语言句子互为译文的该源语言句子。
在本发明一个实施例中,可选地,该生成模块630具体用于:
根据以下等式生成该目标语言句子X的句法树Y(X),
Figure PCTCN2016072422-appb-000008
其中,T表示状态转移操作,D表示句法树的推导。
在本发明一个实施例中,可选地,该装置600还包括:
训练模块,用于根据该目标语言句子的句法树,训练目标语言分析器。
根据本发明实施例的句法分析的装置600可对应于根据本发明实施例的句法分析的方法的执行主体,并且装置600中的各个模块的上述和其它操作和/或功能分别为了实现前述方法的相应流程,为了简洁,在此不再赘述。
本发明实施例的句法分析的装置,根据与目标语言句子互为译文的源语言句子生成目标语言句子的句法树,可以不需要人工标注得到较优的目标语言句子的符合句法知识的句法树,从而能够提高句法分析的效率。
图7示出了本发明的又一实施例提供的句法分析的装置的结构,包括至少一个处理器702(例如CPU),至少一个网络接口705或者其他通信接口,存储器706,和至少一个通信总线703,用于实现这些部件之间的连接通信。处理器702用于执行存储器706中存储的可执行模块,例如计算机程序。存储器706可能包含高速随机存取存储器(RAM:Random Access Memory),也可能还包括非不稳定的存储器(non-volatile memory),例如至少一个磁盘存储器。通过至少一个网络接口705(可以是有线或者无线)实现与至少一个其他网元之间的通信连接。
在一些实施方式中,存储器706存储了程序7061,处理器702执行程序7061,用于执行以下操作:
获取与目标语言句子互为译文的源语言句子;
根据该源语言句子,以及该目标语言句子的词语与该源语言句子的词语的对应关系,确定该目标语言句子的状态转移实例;
根据该目标语言句子的状态转移实例,生成该目标语言句子的句法树。
可选地,处理器702具体用于:
根据该源语言句子,获取该源语言句子的句法树;
对于该目标语言句子中的任意相邻的片段xl和xr,根据该对应关系确定xl和xr对应的该源语言句子的片段yl和yr
若yl和yr是该源语言句子的句法树中的成分,则根据yl和yr在该源语言句子的句法树中的关系,获取xl和xr对应的状态转移实例;
根据该目标语言句子中的所有相邻的片段对应的状态转移实例,确定该目标语言句子的状态转移实例。
可选地,处理器702具体用于:
若yl和yr构成该源语言句子的句法树中的一个成分,则获取归并操作实 例;
若yl和yr无法构成该源语言句子的句法树中的一个成分,则获取分离操作实例。
可选地,处理器702具体用于:
对xl和xr对应的状态转移实例进行评分;
根据该目标语言句子中的所有相邻的片段对应的状态转移实例的评分,确定该目标语言句子的状态转移实例。
可选地,处理器702具体用于:
将该目标语言句子中的所有相邻的片段对应的状态转移实例中评分最高的N-1个状态转移实例确定为该目标语言句子的状态转移实例,其中,N为该目标语言句子的长度。
可选地,处理器702具体用于:
根据以下等式对xl和xr对应的状态转移实例进行评分,
p(xl,xr,yl,yr|Α)=p(xl,yl|Α)×p(xr,yr|Α),
Figure PCTCN2016072422-appb-000009
其中,Α为对齐矩阵,p(xl,xr,yl,yr|Α)表示根据xl和xr,以及yl和yr,获取的状态转移实例的评分。
可选地,处理器702具体用于:
根据目标语言和源语言的平行语料,获取与该目标语言句子互为译文的该源语言句子。
可选地,处理器702具体用于:
根据以下等式生成该目标语言句子X的句法树Y(X),
Figure PCTCN2016072422-appb-000010
其中,T表示状态转移操作,D表示句法树的推导。
可选地,处理器702还用于根据该目标语言句子的句法树,训练目标语言分析器。
从本发明实施例提供的以上技术方案可以看出,本发明实施例根据与目标语言句子互为译文的源语言句子生成目标语言句子的句法树,可以无需人工标注得到目标语言句子的符合句法知识的句法树,从而能够提高句法分析 的效率。
应理解,在本发明实施例中,术语“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系。例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的***、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的***、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本发明实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销 售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求的保护范围为准。

Claims (18)

  1. 一种句法分析的方法,其特征在于,包括:
    获取与目标语言句子互为译文的源语言句子;
    根据所述源语言句子,以及所述目标语言句子的词语与所述源语言句子的词语的对应关系,确定所述目标语言句子的状态转移实例;
    根据所述目标语言句子的状态转移实例,生成所述目标语言句子的句法树。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述源语言句子,以及所述目标语言句子的词语与所述源语言句子的词语的对应关系,确定所述目标语言句子的状态转移实例,包括:
    根据所述源语言句子,获取所述源语言句子的句法树;
    对于所述目标语言句子中的任意相邻的片段xl和xr,根据所述对应关系确定xl和xr对应的所述源语言句子的片段yl和yr
    若yl和yr是所述源语言句子的句法树中的成分,则根据yl和yr在所述源语言句子的句法树中的关系,获取xl和xr对应的状态转移实例;
    根据所述目标语言句子中的所有相邻的片段对应的状态转移实例,确定所述目标语言句子的状态转移实例。
  3. 根据权利要求2所述的方法,其特征在于,所述根据yl和yr在所述源语言句子的句法树中的成分关系,获取xl和xr对应的状态转移实例,包括:
    若yl和yr构成所述源语言句子的句法树中的一个成分,则获取归并操作实例;
    若yl和yr无法构成所述源语言句子的句法树中的一个成分,则获取分离操作实例。
  4. 根据权利要求2或3所述的方法,其特征在于,所述方法还包括:
    对xl和xr对应的状态转移实例进行评分;
    所述根据所述目标语言句子中的所有相邻的片段对应的状态转移实例,确定所述目标语言句子的状态转移实例,包括:
    根据所述目标语言句子中的所有相邻的片段对应的状态转移实例的评分,确定所述目标语言句子的状态转移实例。
  5. 根据权利要求4所述的方法,其特征在于,所述根据所述目标语言句子中的所有相邻的片段对应的状态转移实例的评分,确定所述目标语言句 子的状态转移实例,包括:
    将所述目标语言句子中的所有相邻的片段对应的状态转移实例中评分最高的N-1个状态转移实例确定为所述目标语言句子的状态转移实例,其中,N为所述目标语言句子的长度。
  6. 根据权利要求4或5所述的方法,其特征在于,所述对xl和xr对应的状态转移实例进行评分,包括:
    根据以下等式对xl和xr对应的状态转移实例进行评分,
    p(xl,xr,yl,yr|Α)=p(xl,yl|Α)×p(xr,yr|Α),
    其中,Α为对齐矩阵,p(xl,xr,yl,yr|Α)表示根据xl和xr,以及yl和yr,获取的状态转移实例的评分。
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,所述获取与目标语言句子互为译文的源语言句子,包括:
    根据目标语言和源语言的平行语料,获取与所述目标语言句子互为译文的所述源语言句子。
  8. 根据权利要求1至7中任一项所述的方法,其特征在于,所述根据所述目标语言句子的状态转移实例,生成所述目标语言句子的句法树,包括:
    根据以下等式生成所述目标语言句子X的句法树Y(X),
    Figure PCTCN2016072422-appb-100002
    其中,T表示状态转移操作,D表示句法树的推导。
  9. 根据权利要求1至8中任一项所述的方法,其特征在于,所述方法还包括:
    根据所述目标语言句子的句法树,训练目标语言分析器。
  10. 一种句法分析的装置,其特征在于,包括:
    获取模块,用于获取与目标语言句子互为译文的源语言句子;
    确定模块,用于根据所述源语言句子,以及所述目标语言句子的词语与所述源语言句子的词语的对应关系,确定所述目标语言句子的状态转移实例;
    生成模块,用于根据所述目标语言句子的状态转移实例,生成所述目标 语言句子的句法树。
  11. 根据权利要求10所述的装置,其特征在于,所述确定模块具体用于:
    根据所述源语言句子,获取所述源语言句子的句法树;
    对于所述目标语言句子中的任意相邻的片段xl和xr,根据所述对应关系确定xl和xr对应的所述源语言句子的片段yl和yr
    若yl和yr是所述源语言句子的句法树中的成分,则根据yl和yr在所述源语言句子的句法树中的关系,获取xl和xr对应的状态转移实例;
    根据所述目标语言句子中的所有相邻的片段对应的状态转移实例,确定所述目标语言句子的状态转移实例。
  12. 根据权利要求11所述的装置,其特征在于,所述确定模块具体用于:
    若yl和yr构成所述源语言句子的句法树中的一个成分,则获取归并操作实例;
    若yl和yr无法构成所述源语言句子的句法树中的一个成分,则获取分离操作实例。
  13. 根据权利要求11或12所述的装置,其特征在于,所述确定模块具体用于:
    对xl和xr对应的状态转移实例进行评分;
    根据所述目标语言句子中的所有相邻的片段对应的状态转移实例的评分,确定所述目标语言句子的状态转移实例。
  14. 根据权利要求13所述的装置,其特征在于,所述确定模块具体用于:
    将所述目标语言句子中的所有相邻的片段对应的状态转移实例中评分最高的N-1个状态转移实例确定为所述目标语言句子的状态转移实例,其中,N为所述目标语言句子的长度。
  15. 根据权利要求13或14所述的装置,其特征在于,所述确定模块具体用于:
    根据以下等式对xl和xr对应的状态转移实例进行评分,
    p(xl,xr,yl,yr|Α)=p(xl,yl|Α)×p(xr,yr|Α),
    Figure PCTCN2016072422-appb-100003
    其中,Α为对齐矩阵,p(xl,xr,yl,yr|Α)表示根据xl和xr,以及yl和yr,获取的状态转移实例的评分。
  16. 根据权利要求10至15中任一项所述的装置,其特征在于,所述获取模块具体用于:
    根据目标语言和源语言的平行语料,获取与所述目标语言句子互为译文的所述源语言句子。
  17. 根据权利要求10至16中任一项所述的装置,其特征在于,所述生成模块具体用于:
    根据以下等式生成所述目标语言句子X的句法树Y(X),
    Figure PCTCN2016072422-appb-100004
    其中,T表示状态转移操作,D表示句法树的推导。
  18. 根据权利要求10至17中任一项所述的装置,其特征在于,所述装置还包括:
    训练模块,用于根据所述目标语言句子的句法树,训练目标语言分析器。
PCT/CN2016/072422 2015-07-22 2016-01-28 句法分析的方法和装置 WO2017012327A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/872,993 US10909315B2 (en) 2015-07-22 2018-01-17 Syntax analysis method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510435938.0 2015-07-22
CN201510435938.0A CN106372053B (zh) 2015-07-22 2015-07-22 句法分析的方法和装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US15/872,993 Continuation US10909315B2 (en) 2015-07-22 2018-01-17 Syntax analysis method and apparatus

Publications (1)

Publication Number Publication Date
WO2017012327A1 true WO2017012327A1 (zh) 2017-01-26

Family

ID=57834797

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/072422 WO2017012327A1 (zh) 2015-07-22 2016-01-28 句法分析的方法和装置

Country Status (3)

Country Link
US (1) US10909315B2 (zh)
CN (1) CN106372053B (zh)
WO (1) WO2017012327A1 (zh)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291684B (zh) * 2016-04-12 2021-02-09 华为技术有限公司 语言文本的分词方法和***
CN109145315B (zh) 2018-09-05 2022-03-18 腾讯科技(深圳)有限公司 文本翻译方法、装置、存储介质和计算机设备
CN110750989B (zh) * 2019-10-28 2023-09-19 北京金山数字娱乐科技有限公司 一种语句分析的方法及装置
CN112800754B (zh) * 2021-01-26 2024-07-02 浙江香侬慧语科技有限责任公司 基于预训练语言模型的无监督语法推导方法、装置和介质
CN113689749A (zh) * 2021-08-30 2021-11-23 临沂职业学院 一种测验定制化的英语翻译教学管理***及方法
CN114595688B (zh) * 2022-01-06 2023-03-10 昆明理工大学 融合词簇约束的汉越跨语言词嵌入方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1172993A (zh) * 1997-07-02 1998-02-11 陈肇雄 特殊语言现象处理技术
US8150677B2 (en) * 2008-06-26 2012-04-03 Microsoft Corporation Machine translation using language order templates
CN102760121A (zh) * 2012-06-28 2012-10-31 中国科学院计算技术研究所 依存映射方法及***
CN103116578A (zh) * 2013-02-07 2013-05-22 北京赛迪翻译技术有限公司 一种融合句法树和统计机器翻译技术的翻译方法与装置

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1999063456A1 (fr) * 1998-06-04 1999-12-09 Matsushita Electric Industrial Co., Ltd. Dispositif de preparation de regles de conversion du langage, dispositif de conversion du langage et support d'enregistrement de programme
US6947885B2 (en) * 2000-01-18 2005-09-20 At&T Corp. Probabilistic model for natural language generation
AU2002316581A1 (en) * 2001-07-03 2003-01-21 University Of Southern California A syntax-based statistical translation model
EP1351158A1 (en) * 2002-03-28 2003-10-08 BRITISH TELECOMMUNICATIONS public limited company Machine translation
US7593843B2 (en) * 2004-03-30 2009-09-22 Microsoft Corporation Statistical language model for logical form using transfer mappings
GB2415518A (en) * 2004-06-24 2005-12-28 Sharp Kk Method and apparatus for translation based on a repository of existing translations
US7200550B2 (en) * 2004-11-04 2007-04-03 Microsoft Corporation Projecting dependencies to generate target language dependency structure
US7672830B2 (en) * 2005-02-22 2010-03-02 Xerox Corporation Apparatus and methods for aligning words in bilingual sentences
US9020804B2 (en) * 2006-05-10 2015-04-28 Xerox Corporation Method for aligning sentences at the word level enforcing selective contiguity constraints
FR2906049A1 (fr) * 2006-09-19 2008-03-21 Alcatel Sa Procede, mis en oeuvre par ordinateur, de developpement d'une ontologie a partir d'un texte en langage naturel
US8452585B2 (en) * 2007-06-21 2013-05-28 Microsoft Corporation Discriminative syntactic word order model for machine translation
US8046211B2 (en) * 2007-10-23 2011-10-25 Microsoft Corporation Technologies for statistical machine translation based on generated reordering knowledge
US8060360B2 (en) * 2007-10-30 2011-11-15 Microsoft Corporation Word-dependent transition models in HMM based word alignment for statistical machine translation
US8504354B2 (en) * 2008-06-02 2013-08-06 Microsoft Corporation Parallel fragment extraction from noisy parallel corpora
CN102214166B (zh) * 2010-04-06 2013-02-20 三星电子(中国)研发中心 基于句法分析和层次模型的机器翻译***和方法
CN102789451B (zh) * 2011-05-16 2015-06-03 北京百度网讯科技有限公司 一种个性化的机器翻译***、方法及训练翻译模型的方法
CN102708098B (zh) * 2012-05-30 2015-02-04 中国科学院自动化研究所 一种基于依存连贯性约束的双语词语自动对齐方法
CN104239290B (zh) * 2014-08-08 2017-02-15 中国科学院计算技术研究所 基于依存树的统计机器翻译方法及***
CN104281564B (zh) 2014-08-12 2017-08-08 中国科学院计算技术研究所 一种双语无监督句法分析方法及***

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1172993A (zh) * 1997-07-02 1998-02-11 陈肇雄 特殊语言现象处理技术
US8150677B2 (en) * 2008-06-26 2012-04-03 Microsoft Corporation Machine translation using language order templates
CN102760121A (zh) * 2012-06-28 2012-10-31 中国科学院计算技术研究所 依存映射方法及***
CN103116578A (zh) * 2013-02-07 2013-05-22 北京赛迪翻译技术有限公司 一种融合句法树和统计机器翻译技术的翻译方法与装置

Also Published As

Publication number Publication date
CN106372053A (zh) 2017-02-01
CN106372053B (zh) 2020-04-28
US10909315B2 (en) 2021-02-02
US20180157634A1 (en) 2018-06-07

Similar Documents

Publication Publication Date Title
WO2017012327A1 (zh) 句法分析的方法和装置
US8670975B2 (en) Adaptive pattern learning for bilingual data mining
CN100511215C (zh) 多语种翻译存储器和翻译方法
Xiong et al. Modeling the translation of predicate-argument structure for smt
US20150186361A1 (en) Method and apparatus for improving a bilingual corpus, machine translation method and apparatus
CN109460552B (zh) 基于规则和语料库的汉语语病自动检测方法及设备
CN105068997B (zh) 平行语料的构建方法及装置
US9311299B1 (en) Weakly supervised part-of-speech tagging with coupled token and type constraints
CN105593845A (zh) 基于自学排列的排列语料库的生成装置及其方法、使用排列语料库的破坏性表达语素分析装置及其语素分析方法
CN112101014B (zh) 一种混合特征融合的中文化工文献分词方法
CN112257462A (zh) 一种基于神经机器翻译技术的超文本标记语言翻译方法
Lai et al. Semeval 2022 task 12: Symlink-linking mathematical symbols to their descriptions
Van Der Goot et al. Lexical normalization for code-switched data and its effect on POS-tagging
CN112257460A (zh) 基于枢轴的汉越联合训练神经机器翻译方法
CN107491441B (zh) 一种基于强制解码的动态抽取翻译模板的方法
Anju et al. Malayalam to English machine translation: An EBMT system
Guo et al. Character-level dependency model for joint word segmentation, POS tagging, and dependency parsing in Chinese
Pham et al. A Machine Learning based Textual Entailment Recognition System of JAIST Team for NTCIR9 RITE.
Saini et al. Relative clause based text simplification for improved english to hindi translation
Aguilar et al. Development and verification of a verbal corpus based on natural language for Ecuadorian dialect
Braune et al. Rule selection with soft syntactic features for string-to-tree statistical machine translation
Hu et al. Exploring Discourse Structure in Document-level Machine Translation
Cui Design of intelligent recognition English translation model based on feature extraction algorithm
Phodong et al. Improvement of word alignment in Thai-English statistical machine translation by grammatical attributes identification
El-Kahlout et al. Initial explorations in two-phase Turkish dependency parsing by incorporating constituents

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16827020

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16827020

Country of ref document: EP

Kind code of ref document: A1