CN107391495B - Sentence alignment method of bilingual parallel corpus - Google Patents

Sentence alignment method of bilingual parallel corpus Download PDF

Info

Publication number
CN107391495B
CN107391495B CN201710433746.5A CN201710433746A CN107391495B CN 107391495 B CN107391495 B CN 107391495B CN 201710433746 A CN201710433746 A CN 201710433746A CN 107391495 B CN107391495 B CN 107391495B
Authority
CN
China
Prior art keywords
sentence
alignment
word
source language
target language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201710433746.5A
Other languages
Chinese (zh)
Other versions
CN107391495A (en
Inventor
刘强
彭蓉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Tongwen Century Technology Co ltd
Original Assignee
Beijing Tongwen Century Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Tongwen Century Technology Co ltd filed Critical Beijing Tongwen Century Technology Co ltd
Priority to CN201710433746.5A priority Critical patent/CN107391495B/en
Publication of CN107391495A publication Critical patent/CN107391495A/en
Application granted granted Critical
Publication of CN107391495B publication Critical patent/CN107391495B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/51Translation evaluation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a sentence alignment method of bilingual parallel corpus, which comprises the following steps: A. acquiring a bilingual probability distribution dictionary containing a word translation pair and a word translation probability of a source language and a target language; B. constructing a dynamic programming matrix according to the number of sentences of a source language and a target language of a text to be aligned; determining evaluation scores based on sentence length information, word information and word inter-translation probability under different alignment modes according to the dynamic programming matrix and the bilingual probability distribution dictionary; C. according to the evaluation score, determining an alignment path under an alignment mode with the evaluation score larger than a specified threshold; D. and determining an alignment path sequence of the source language sentence and the target language sentence of the text to be aligned according to the alignment path. Therefore, the sentence alignment method of the bilingual parallel corpus provided by the application is beneficial to improving the precision of automatic sentence alignment of the bilingual parallel corpus.

Description

Sentence alignment method of bilingual parallel corpus
Technical Field
The invention relates to the technical field of language translation processing, in particular to a sentence alignment method of bilingual parallel linguistic data.
Background
Sentence alignment, i.e., determining which sentence(s) in the source language text and which sentence(s) in the target language text are translations of each other. The problem of sentence alignment is that the mapping between sentences in the bilingual text has many to many mappings, which is easy to generate mismatching.
Currently, sentence alignment methods in the prior art include a method based on sentence length, a method based on word alignment or character string alignment, a method based on offset position alignment, and the like. These methods rely on sentence length, sentence position, or sentence length ratio information between two languages. However, the sentence alignment method in the above-mentioned technology is poor in alignment effect because the alignment parameters involved are relatively single.
Therefore, a bilingual parallel corpus sentence alignment method is needed to improve the sentence alignment effect of the bilingual parallel corpus.
Disclosure of Invention
In view of the above, the present application provides a sentence alignment method for bilingual parallel corpus to improve the precision of sentence alignment of bilingual parallel corpus.
The application provides a sentence alignment method of bilingual parallel corpus, which comprises the following steps:
A. acquiring a bilingual probability distribution dictionary containing a word translation pair and a word translation probability of a source language and a target language;
B. constructing a dynamic programming matrix according to the number of sentences of a source language and a target language of a text to be aligned;
determining evaluation scores based on sentence length information, word information and word inter-translation probability under different alignment modes according to the dynamic programming matrix and the bilingual probability distribution dictionary;
C. according to the evaluation score, determining an alignment path under an alignment mode with the evaluation score larger than a specified threshold;
D. and determining an alignment path sequence of the source language sentence and the target language sentence of the text to be aligned according to the alignment path.
According to the method, evaluation scores based on sentence length information, word information and word inter-translation probability under different alignment modes are obtained by constructing a dynamic programming matrix according to the sentence length, the word number, the total sentence number ratio of the source language to be aligned and the target language and the bilingual probability distribution dictionary; further acquiring an alignment path according to the evaluation score; and finally, acquiring an alignment path sequence of sentences in the source language and the target language according to the alignment path. The application relates to various alignment parameters, and compared with the prior art, the application has better alignment effect. Wherein the sentence length information includes: the number of characters contained in the source and target language sentences (where the characters may be words, terms, or characters); the word information includes: the distribution of the set of words (the set of words without repeated words) in the source and target language sentences, the number of words in the set of sentences (the number of words without repeated words), the number of lumped words in the source language sentences, and the frequency of occurrence of the current source language word in its set of sentence words. The sentence set here means that if the alignment mode is source language 2 sentences and target language 3 sentences are aligned, the source language sentence set is 2 sentences, and the target language sentence set is 3 sentences.
Preferably, the step a further comprises:
converting words of a source language and a target language in the bilingual probability distribution dictionary into a number form for storage;
and numbering the sentences of the source language and the target language of the text to be aligned according to the sequence of the sentences, and numbering the words after word segmentation according to the numbers in the bilingual probability distribution dictionary.
Therefore, sentences and words are numbered, and convenience in subsequent calculation and sentence alignment is facilitated.
Preferably, the step B includes:
b1, constructing a dynamic programming matrix M according to the sentence number of the source language and the target language of the text to be alignedAlignWherein:
Figure BDA0001318070030000031
cellijrepresenting an element in the matrix, n being the number of source language sentences and m being the number of target language sentences; cellijIs a triple (score, lang1_ path, lang2_ path), score is used for recording the alignment pattern evaluation score at the position, lang1_ path is used for recording the alignment path of the source language at the position, and lang2_ path is used for recording the alignment path of the target speech at the position; the alignment path is used for recording an alignment mode and corresponding sentence identifications of a source language and a target language;
b2, setting a two-dimensional window smaller than nxm;
b3, moving the two-dimensional window in the source language sentence set to be aligned and the sentence set of the target language;
and respectively corresponding to the source language sentence set covered by each window obtained by the mobile window and the sentence subset of the target language, and calculating evaluation scores of the source language sentence set and the target language sentence subset in the window range under different alignment modes according to the sentence length, the word set, the word number of the sentence set, the word occurrence frequency and the bilingual probability distribution dictionary of the text to be aligned in the window.
Therefore, by constructing the dynamic programming matrix and further calculating evaluation scores in different alignment modes in a window mode containing the sentence numbers of the specified source language and the target language, the sentence alignment effect in different alignment modes can be better obtained.
Preferably, the calculation formula of the evaluation scores in different alignment modes in step B2 is as follows:
Figure BDA0001318070030000032
length_penaltysentencepunishment is carried out for the sentence length; the dependency _ matrix is a sentence length penalty coefficient; xtokens is the word set distribution in the source language sentence, ytokens is the word set distribution in the target language sentence, lxtokensFor the word number, l, of a set of source language sentencesytokensNumber of words for a subset of target language sentences, XwcAggregating the word number for the source language sentence, xfreq, yfreq being the frequency values of the current source language word and target language word in their respective sentence sets, ywfreqIs the probability of a translation of the current source language word in the loop in the bilingual probability distribution dictionary.
Therefore, the evaluation scores under different alignment modes can be accurately obtained through the calculation formula.
Preferably, the two-dimensional window is a 5 × 5 sized window;
the sentence length penalty coefficient penalty _ matrix is:
Figure BDA0001318070030000041
Paband expressing the sentence length penalty coefficient in the alignment mode of the corresponding a sentence source language sentence to the b sentence source language sentence.
Preferably, the sentence length penalty length _ penaltysentenceThe calculation formula of (2) is as follows:
Figure BDA0001318070030000042
wherein, xlen represents the length of the source language sentence subset in the window under the current mode, ylen represents the length of the target language sentence subset in the window under the current mode, and M, N represents the weighted weight ratio; l is a standard alignment sentence length critical value; is the average number of characters corresponding to the source language characters in the target language.
Therefore, the sentence length punishment can be accurately obtained through the calculation formula.
Preferably, the method further comprises:
and under the same two-dimensional window, when the same sentence exists in a plurality of sentence pairs, selecting the sentence pair with the highest sentence alignment accuracy value to record.
Therefore, the same sentence can be prevented from being repeatedly appeared in different sentence pairs, so that the same sentence is prevented from corresponding to a plurality of paths.
Preferably, the method further comprises:
and under different two-dimensional windows, when the same sentence exists in a plurality of sentence pairs, selecting the sentence pair with the highest sentence alignment accuracy value to record.
Therefore, the same sentence can be prevented from being repeatedly appeared in different sentence pairs, so that the same sentence is prevented from corresponding to a plurality of paths.
Herein, a sentence pair refers to a pair of a source language sentence (one or more sentences) and a target language sentence (one or more sentences) in an alignment manner when the source language sentence and the target language sentence are aligned, and a sentence pair may include a source language sentence and a target language sentence, or a plurality of source language sentences and a plurality of target language sentences, or a plurality of source language sentences and a target language sentence, or a source language sentence and a plurality of target language sentences.
In summary, the evaluation scores based on sentence length information, word information and word inter-translation probability under different alignment modes are obtained by constructing a dynamic programming matrix according to the sentence length, the word number, the total sentence ratio of the source language to be aligned and the target language and the bilingual probability distribution dictionary; further acquiring an alignment path according to the evaluation score; and finally, acquiring an alignment path sequence of sentences in the source language and the target language according to the alignment path. Compared with the prior art, the alignment method has better alignment effect.
Drawings
Fig. 1 is a schematic flow chart illustrating an automatic sentence alignment method for bilingual parallel corpus according to the present application.
Detailed Description
Bilingual parallel-language corpus sentence alignment, that is, establishing sentence-level alignment relationship between bilingual texts, is to determine which sentence(s) in the source language text and which sentence(s) in the target language text are translations of each other.
For example, if S is an original text and T is a translated text, S is S1s2…sm,T=t1t2…tmSeeking a ═ a1a2…arWherein: a isi=(sj..sk,tp..tq) That is, a pair sequence of a source language sentence of the original text and a target language sentence of the translated text is sought, and the original fragment sj..skAnd the translated text segment tp..tqThe two are translated, and no further sentence-level alignment exists between the two (aligned sentence pairs, sentence beads). In most cases, one original sentence corresponds to one translated sentence, and one sentence pair is composed of one original sentence and one translated sentence. However, due to the difference between the bilingual grammar structure and the ideograph, the aligned sentence pair has the situation that one sentence corresponds to a plurality of sentences, the plurality of sentences correspond to one sentence, one sentence corresponds to zero sentences, zero sentences correspond to 1 sentence, and the plurality of sentences correspond to a plurality of sentences, which easily generates mismatching.
As will be described in detail, the present invention, when combined with a specific example, will be exemplified by assuming that the source language has 10 sentences, i.e., n ═ 10, and the target language has 8 sentences, i.e., m ═ 8, where the 10 sentences of the source language to be aligned are labeled as S1, S2 … … S10, respectively; the 8 sentences of the target language to be aligned are labeled as T1 ', T2 ', … … and T8 ', respectively.
The embodiment of the application provides a sentence alignment method of bilingual parallel corpus, which fully considers the factors of sentence length (the sentence length can be measured by the number of characters contained in the sentence or by the number of words contained in the sentence), alignment mode, the number of words, word translation in a bilingual word probability vocabulary, word inter-translation probability and the like, and constructs a method for solving bilingual sentence level automatic alignment by using a dynamic programming method, and referring to a flow chart shown in fig. 1, the specific method of the invention comprises the following steps:
s101, a bilingual probability distribution dictionary is obtained by using a bilingual probability distribution dictionary generating tool. For example, a bilingual word level alignment intermediate result is generated by using a berkley aligner word alignment tool, and the translation probability of the source language and the target language is further calculated according to the result, so as to obtain a bilingual probability distribution dictionary containing the word translation pairs of the source language and the target language and the word translation probability corresponding to the translation pairs. The following is an expression form of the english-chinese probability distribution dictionary:
Figure BDA0001318070030000061
Figure BDA0001318070030000071
the steps further include: and converting the words of the source language and the target language in the bilingual probability distribution dictionary into a number form for storage. For example, one expression form of a numeric numbered english-chinese probability distribution dictionary is as follows:
Figure BDA0001318070030000072
Figure BDA0001318070030000081
from the above, if the words abandon, abc, abduct, and abide are numbered 1, 2, 3, and 4; the possible Chinese translation numbers are respectively-1 to-16, and the probabilities of words in different target languages translated by corresponding to the same source language word are stored, for example, if the probability of "abandon" translated into "background" is 5%, the probability of "abandon" translated into "discarding" is 10%, and the like, corresponding records are stored, so that the probability word list expression in the form of digital numbers is formed.
Further, all sentences to be aligned are participled using a participle algorithm service. For example, English can be segmented by a space, and Chinese can be segmented by a segmentation program. And all sentences and words are converted into serial numbers on the basis of the results of all sentences after word segmentation. Wherein, the sentences are numbered according to the sentence sequence, and the words are mapped and numbered according to the probability distribution word list. That is, the source language words are numbered according to word number "1, 2, 3, 4 …." etc. in step S101. The target language words are numbered according to the word numbers "-1 to-16 …" and the like in step S101. Or if the word is found in the probability distribution word list, adding the word into the probability distribution word list and numbering the word.
In the method, a word segmentation engine of a source language and a target language is used for segmenting sentences into word sets.
S102, according to the total number of source language sentences in the original text to be aligned, the total number of target language sentences in the translated text, word inter-translation pairs and corresponding word inter-translation probabilities, a dynamic programming matrix is constructed, and evaluation scores in different sentence alignment modes based on sentence length, words and word inter-translation probabilities are obtained.
Specifically, the method comprises the following substeps:
s1021, initializing an alignment dynamic programming matrix M according to the number of source language sentences and the number of target language sentences in the original text to be alignedAlignWherein
Figure BDA0001318070030000091
cellijOne element in the representation matrix, n being the number of source language sentences and m being the number of target language sentences. cellijIs a triple (score, lang1_ path, lang2_ path), where score is the alignment pattern evaluation score at the position (see the description of calculating match _ score (xtoken, ytokens) below, where the value of match _ score (xtkens, ytokens) is the value of score here), lang1_ path is the alignment path of the source language at the position, and lang2_ path is the alignment path of the target language at the position. The alignment path describes an alignment mode described later and corresponding sentence identifiers in the source language and the target language (for the alignment path, see the following description for details).
In this example, n is 10 and m is 8, so
Figure BDA0001318070030000092
S1022, the number of horizontal loops is made a value smaller than the total number m of sentences in the target language sentence, S1023 to S1025 are sequentially executed the number of times, and j is incremented by 1 each time starting from 1, that is, j is incremented from 1 to the calculated value as a whole.
In this example, the number of horizontal cycles is set to 3, which is calculated as m-5 (5 is a window size 5 × 5 described later), and m is known to be 8 as described above.
S1023, the number of vertical circulation is made a number smaller than the total number n of sentences in the source language sentence, and S1024-S1025 are sequentially executed the number of times, i is incremented by 1 each time starting from 1, i.e., i is incremented from 1 to the calculated number as a whole.
In this example, the number of vertical cycles is set to 5, which is calculated as n-5 (the 5 is a window size 5 × 5 described later), and n is known to be 10 as described above.
S1024, setting a window with the size of 5 multiplied by 5 for sentences of a source language in the original text and a target language in the translated text to be aligned; and determining the current window by taking the ith sentence in the current source language and the jth sentence in the target language as starting points of the window.
If the sentences of the source language in the original text and the target language in the translated text are corresponding to the matrix, it can be understood that the current sentence is the correct sentenceFixed 5 × 5 window range
Figure BDA0001318070030000101
That is, the source language includes i-th through i + 5-th sentences, and the subset of target language sentences includes j-th through j + 5-th sentences.
In the combination example, when the cycle is repeated until j is 2 and i is 3, namely, each window of 5 × 5 corresponds to the range of the window
Figure BDA0001318070030000102
That is, the set of source language sentences includes clauses 3 through 8, i.e., S3, S4 … … S8; the target language sentence set includes sentences 2 through 7, i.e., T2 ', T3 ' … … T7 '.
And S1025, calculating the evaluation scores of the source language sentence set and the target language sentence set in the current window range in each alignment mode. The specific description is as follows:
first, the above-mentioned alignment pattern is described, the alignment pattern is related to the window size, which is obtained statistically, and the sentence alignment under the window with the size of 5 × 5 may include the following alignment patterns:
(1)1-0: 1 source language sentence to 0 target language sentence;
(2)0-1: 0 source language sentence to 1 target language sentence;
(3)1-1: 1 source language sentence to 1 target language sentence;
(4)1-2: 1 source language sentence versus 2 target language sentences;
(5)2-1: 2 source language sentences to 1 target language sentence;
(6)2-2: 2 source language sentences to 2 target language sentences;
(7)1-3: 1 source language sentence versus 3 target language sentences;
(8)3-1: 3 source language sentences to 1 target language sentence;
(9)2-3: 2 source language sentences to 3 target language sentences;
(10)3-2: 3 source language sentences to 2 target language sentences;
(11)1-4: 1 source language sentence versus 4 target language sentences;
(12)4-1: 4 source language sentences versus 1 target language sentence.
As can be seen from the above, the alignment patterns such as 4-3 are not included in this example, because the statistically lower probability of occurrence of the alignment pattern under the 5 × 5 window is not selected to be used in the alignment pattern under the 5 × 5 window.
Here, the effect of the evaluation score is described again: the higher the evaluation score value of a certain alignment pattern is, the higher the possibility that the certain alignment pattern is present, that is, the higher the possibility that the certain alignment pattern is selected in step S103 described later.
In this step, the evaluation score in each alignment mode in the current window is calculated by the following function:
Figure BDA0001318070030000111
wherein length _ dependencysentencePenalty for sentence length (see the description below in particular); the dependency _ matrix is a sentence length penalty coefficient (see the following description for details); xtokens is the distribution of the word set (the set of words without repeated words) in the source language sentence under the window, ytokens is the distribution of the word set (the set of words without repeated words) in the target language sentence under the window, and lxtokensThe number of words in the set of source language sentences under the window (the number of words after removing repeated words), lytokensThe number of words (the number of words after removal of repeated words), X, of the target language sentence subset under the windowwcThe lumped word number of the source language sentence under the window (when calculating, the repeated words in the sentence are counted), xfreq is the frequency value of the appearance of the current source language word under the window in the word set xtoken of the sentence, yfreq is the frequency value of the appearance of the current target language word under the window in the word set ytkens of the sentence, ywfreqIs the probability of a translation of the current target language word in the bilingual probability distribution dictionary. For example, if the source language or the target language is chinese, a word may be a word.
Wherein, regarding the penalty coefficient penalty _ matrix for sentence length: corresponding to the sentence alignment mode expressed in the foregoing, the invention defines a sentence length penalty matrix of an alignment mode of 5 × 5, each element of the matrix represents a sentence length penalty coefficient in a certain alignment mode, and is expressed as follows:
Figure BDA0001318070030000112
wherein P will be in this example10P11P12P13……P41Abstraction as PabP, the number of source sentences in sentence a and the number of target sentences in sentence b in a certain alignment pattern in the above window 5 × 5 are represented by a and babAnd (4) a sentence length penalty factor in the alignment pattern match (a-b). The long penalty factor is related to the alignment pattern, and is an expression of the probability of the alignment pattern appearing in the manually aligned corpus, which is also called prior probability, and is generally a constant, and the occurrence probability prob (match) of the alignment pattern can be counted by the manually aligned parallel corpus. That is, the probabilities of the 12 alignment patterns can be counted in advance through ten thousand manually aligned sentence sets (for example, chinese aligned english).
Wherein a sentence length penalty length _ penalty is definedsentenceThe calculation formula of (2) is as follows:
Figure BDA0001318070030000121
wherein, xlen represents PabCorresponding Source language sentence subset Length in Window in aligned mode (e.g., under 5 × 5 Window, PabWhen a is 2 and b is 3, the alignment is 2-3, and xlen is 2) in the alignment mode, ylen represents PabThe length of the subset of target language sentences in the window in the corresponding alignment mode, M, N, represents the weighted weight ratio, which is an empirical value. And L is a standard alignment sentence length critical value, namely the L can be used as an expected value of the length of the aligned sentence set and is obtained through statistics. Is the average number of characters corresponding to a source language character in a target language, (where the source language and the target language are words)The characters in the symbol ratio may be words, terms, or characters, for example, if the source language is french and the target language is english, the characters may be english characters corresponding to french or words corresponding to english, and if the source language is chinese and the target language is english, the characters may be one chinese character corresponding to several english characters, one chinese character corresponding to several english words, one chinese term corresponding to several english characters, or one chinese term corresponding to several english words).
This can be found by the following statistics:
Figure BDA0001318070030000122
obey a normal distribution. Statistical demonstration of a large corpus of data, language l1In the language l2The corresponding character number C is a random variable, and the random variable C follows normal distribution N (C, s)2) On this basis, random variables are defined. Wherein l1And l2Are two variables, c and s2The two parameters of the normal distribution can be obtained by nonlinear regression method through the sampling statistics data of the corpus.
The formula for calculating the evaluation score calculates the evaluation score in each alignment mode in the current window, implicitly calculates the accuracy values (including the accuracy value exceeding the set threshold and the accuracy value lower than the set threshold) of different sentence alignments in each alignment mode in the current window in the calculation process of calculating the evaluation score, and records the sentence pair corresponding to the accuracy value exceeding the set threshold and the accuracy value corresponding to the sentence pair.
For example, when looping to j — 2 and i — 3, it is assumed that there are S5 pairs and T4' of sentences of sentence pairs corresponding to the accuracy value exceeding the set threshold value in the alignment mode 1-1; s6 pair T5'; and S7 for T6', corresponding sentence pair identifications are recorded correspondingly.
If the same sentence exists in a plurality of sentence pairs in the same window (i.e. in this step, corresponding to this single cycle), or between different windows (see step S103 for details), a sentence pair with a higher accuracy value is selected and recorded accordingly.
For example, a window in which there are S5 pair T4 ', S5 pair T5', S5 pair (T4 'and T5'), (S5 and S6) pair (T4 ', T5', and T6 '), all of the four sentence pairs including S5, selects one of the sentence pairs with the highest alignment accuracy to be retained, and retains S5 pair T4' as in this embodiment.
In conjunction with the above specific example, when looping to j 2 and i 3, there is a correspondence for each window of 5 × 5
Figure BDA0001318070030000131
That is, when the source language sentence set includes sentences S3, S4 … … S8, and the target language sentence set includes sentences T2 ', T3 ' … … T7 ', it is assumed that the evaluation score values of this step S1026 in each of the above-described alignment modes are as follows:
in the alignment pattern 1-0, the obtained evaluation score was 0.167; the combination of the source language sentence aligned with the target language sentence in the alignment mode is S8 to 0;
in the alignment pattern 0-1, an evaluation score of 0.167 was obtained; aligning the combination of the target language sentences into 0 pairs of T2' in the alignment source language sentences;
in the alignment pattern 1-1, the obtained evaluation score was 0.5; the combination of the source language sentence aligned to the target language sentence in the alignment mode is: s5 pair T4'; s6 pair T5'; s7 pair T6';
in the alignment pattern 2-1, the obtained evaluation score was 0.167; the combination of the source language sentence aligned to the target language sentence in the alignment mode is: s3, S4 vs T3';
the evaluation score in the remaining alignment patterns was 0.
After the execution of S102 is completed (the nested loop is also sequentially completed), the loop is repeated 3 × 5 to 15 times (the number of times is the values of j and i determined in steps S1022 and S1023) in accordance with the present example where n is 10 and m is 8, and the evaluation score in each alignment mode in each 5 × 5 window can be calculated by the evaluation function. And recording the alignment accuracy values of the corresponding sentence pairs in each alignment mode and the different sentence pairs in each alignment mode.
S103, according to the M determined in the S102AlignEach cell of (2)ijAnd acquiring the evaluation scores of the records, and acquiring each alignment path of the record corresponding to each alignment mode with the evaluation score larger than a specified threshold value.
For the optimal value in each 5 × 5 evaluation score matrix obtained above, the aligned paths in the dynamic programming matrix, i.e., the values of lang1_ path and lang2_ path in each triple of cells (score, lang1_ path, lang2_ path), are set. For example, when aligning to mode 2-1, assuming that the evaluation function score in this mode is 0.167, the values of score, lang1_ path and lang2_ path in the triple of cell (score, lang1_ path, lang2_ path) are respectively
score=0.167,lang1_path=2,lang2_path=1,
cell (score, lang1_ path, lang2_ path) is set to
cell (score 0.167, lang1_ path 2, lang2_ path 1). Meanwhile, the alignment path also records which two source languages are aligned with which target language (i.e. records the sentence identifiers) in the alignment mode 2-1. For example, the combination of the source language sentence aligned to the target language sentence in the alignment mode is: s3, S4 vs T3'. For example in cells of the aforementioned dynamic programming matrix33And cell43Points are respectively set to be cells (score is 0.167, lang1_ path is 2, lang2_ path is 1) and sentence pairs S3, S4 and T3' are respectively recorded.
For example, if four sentence pairs including S5 in the S5 pair T4 ', the S5 pair T5', the S5 pair (T4 'and T5'), and the (S5 and S6) pair (T4 ', T5', and T6 ') appear in 4 5 × windows, it selects a sentence pair with the highest sentence alignment accuracy value to be retained, e.g., S5 is selected to be retained to T4' in this example, that is, when the alignment mode 1-1 evaluation score is 0.5, for a cell54Is set to cell (score ═ cell)0.5, lang1_ path 1, lang2_ path 1) and records the sentence pair T3 'S5 versus T4'.
The other cellijs are set accordingly according to the above example.
From above, setting is completed
Figure BDA0001318070030000151
Of cellij (c) is calculated.
And S104, acquiring an alignment path sequence of the source language sentence to be aligned and the target language sentence to be aligned according to each alignment path.
The alignment path sequence obtained by the last acquisition is an array. For example, 10 sentences of the source language to be aligned of the specific example are respectively labeled as S1, S2, S3, S4, S5, S6, S7, S8, S9, S10; the 8 sentences of the target language to be aligned are labeled as T1 ', T2', T3 ', T4', T5 ', T6', T7 'and T8', respectively. Assuming that an alignment pattern 1-1, an alignment pattern 1-0 and an alignment pattern 2-1 with higher evaluation scores are obtained from the alignment path, and at the same time, which source language(s) and which target language(s) are aligned specifically in each alignment pattern is recorded in the alignment path; for example, the sentences aligned between the source language and the target language in the alignment mode 1-1 are: s1 pair T1'; s2 pair T2'; s5 pair T4'; s6 pair T5'; s7 pair T6'; s9 pair T7'; s10 pair T8'; the sentences aligned between the source language and the target language in the alignment mode 1-0 are: s8 vs 0; the sentences aligned between the source language and the target language in the alignment mode 2-1 are: s3, S4 vs T3'. And according to the above information, each cell in the dynamic programming matrix has been set accordingly in step S103ijThe alignment path of (1). The final alignment sequence of the source language sentence and the target language sentence to be aligned can be derived from the dynamic programming matrix as follows:
Figure BDA0001318070030000152
Figure BDA0001318070030000161
in summary, the evaluation scores based on sentence length information, word information and word inter-translation probability under different alignment modes are obtained by constructing a dynamic programming matrix according to the sentence length, the word number, the total sentence ratio of the source language to be aligned and the target language and the bilingual probability distribution dictionary; further acquiring an alignment path according to the evaluation score; and finally, acquiring an alignment path sequence of sentences in the source language and the target language according to the alignment path. Compared with the prior art, the alignment method has better alignment effect.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
For example, the window size used in the above step may also be 4 × 4, 4 × 5, 6 × 6, etc., and the corresponding alignment mode is also changed correspondingly to match the window size.
In addition, even if the window size is not changed, different alignment modes can be selected according to the source and target languages, for example, 5 × 5 windows in the example of the present invention, and alignment modes such as 3-4(3 source language sentences to 4 target language sentences), 4-3(4 source language sentences to 3 target language sentences) and the like can be added, or a certain alignment mode listed in the above steps is not selected. As described above, the alignment patterns may be selected according to the statistical probability of the occurrence of a certain alignment pattern in the corresponding source and target languages.

Claims (8)

1. A sentence alignment method of bilingual parallel corpus is characterized by comprising the following steps:
A. acquiring a bilingual probability distribution dictionary containing a word translation pair and a word translation probability of a source language and a target language;
B. constructing a dynamic programming matrix according to the number of sentences of a source language and a target language of a text to be aligned;
determining evaluation scores based on sentence length information, word information and word inter-translation probability under different alignment modes according to the dynamic programming matrix and the bilingual probability distribution dictionary;
C. according to the evaluation score, determining an alignment path under an alignment mode with the evaluation score larger than a specified threshold;
D. and determining an alignment path sequence of the source language sentence and the target language sentence of the text to be aligned according to the alignment path.
2. The method of claim 1, wherein step a further comprises:
converting words of a source language and a target language in the bilingual probability distribution dictionary into a number form for storage;
and numbering the sentences of the source language and the target language of the text to be aligned according to the sequence of the sentences, and numbering the words after word segmentation according to the numbers in the bilingual probability distribution dictionary.
3. The method of claim 2, wherein step B comprises:
b1, constructing a dynamic programming matrix M according to the sentence number of the source language and the target language of the text to be alignedAlignWherein:
Figure FDA0002516701440000011
cellijrepresenting an element in the matrix, n being the number of source language sentences and m being the number of target language sentences; cellijIs a triple (score, lang1_ path, lang2_ path), score is used for recording the alignment pattern evaluation score at the current position, lang1_ path is used for recording the alignment path of the source language at the position, and lang2_ path is used for recording the alignment path of the target language at the position; the alignment path is used for recording an alignment mode and corresponding sentence identifications of a source language and a target language;
b2, setting a two-dimensional window smaller than nxm;
b3, moving the two-dimensional window in the source language sentence set to be aligned and the sentence set of the target language;
and respectively calculating evaluation scores of the source language sentence set and the target language sentence set in the window range in different alignment modes according to the sentence length, the word set, the word number of the sentence set, the word occurrence frequency and the bilingual probability distribution dictionary of the source language and the target language of the text to be aligned in the window, the word set in the sentence set, the word number of the sentence set, the word occurrence frequency and the bilingual probability distribution dictionary.
4. The method according to claim 3, wherein the evaluation scores in different alignment modes in step B3 are calculated by the following formula:
Figure FDA0002516701440000021
length_penaltysentencepunishment is carried out for the sentence length; the dependency _ matrix is a sentence length penalty coefficient; xtokens is the word set distribution in the source language sentence, ytokens is the word set distribution in the target language sentence, lxtokensFor the word number, l, of a set of source language sentencesytokensNumber of words for a subset of target language sentences, XwcThe word number is lumped for the source language sentence, xfreq, yfreq are the frequency values of the current source language word and the target language word in the sentence set, ywfreqIs the translation probability of the current source language word in the loop in the bilingual probability distribution dictionary.
5. The alignment method according to claim 4, wherein the two-dimensional window is a 5 x 5 sized window;
the sentence length penalty coefficient penalty _ matrix is:
Figure FDA0002516701440000022
Paband expressing the sentence length penalty coefficient in the alignment mode of the corresponding a sentence source language sentence to the b sentence target language sentence.
6. According to the rightThe alignment method according to claim 4, wherein the sentence length penalty length dependencysentenceThe calculation formula of (2) is as follows:
Figure FDA0002516701440000031
wherein, xlen represents the length of the source language sentence subset in the window under the current mode, ylen represents the length of the target language sentence subset in the window under the current mode, and M, N represents the weighted weight ratio; l is a standard alignment sentence-to-sentence length critical value; is the average number of characters corresponding to the source language characters in the target language.
7. The alignment method according to claim 3, further comprising:
and under the same two-dimensional window, when the same sentence exists in a plurality of sentence pairs, selecting the sentence pair with the highest sentence alignment accuracy value to record.
8. The alignment method according to claim 3, further comprising:
and under different two-dimensional windows, when the same sentence exists in a plurality of sentence pairs, selecting the sentence pair with the highest sentence alignment accuracy value to record.
CN201710433746.5A 2017-06-09 2017-06-09 Sentence alignment method of bilingual parallel corpus Expired - Fee Related CN107391495B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710433746.5A CN107391495B (en) 2017-06-09 2017-06-09 Sentence alignment method of bilingual parallel corpus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710433746.5A CN107391495B (en) 2017-06-09 2017-06-09 Sentence alignment method of bilingual parallel corpus

Publications (2)

Publication Number Publication Date
CN107391495A CN107391495A (en) 2017-11-24
CN107391495B true CN107391495B (en) 2020-08-21

Family

ID=60332179

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710433746.5A Expired - Fee Related CN107391495B (en) 2017-06-09 2017-06-09 Sentence alignment method of bilingual parallel corpus

Country Status (1)

Country Link
CN (1) CN107391495B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109062910A (en) * 2018-07-26 2018-12-21 苏州大学 Sentence alignment method based on deep neural network
CN109344413B (en) * 2018-10-16 2022-05-20 北京百度网讯科技有限公司 Translation processing method, translation processing device, computer equipment and computer readable storage medium
CN109697287B (en) * 2018-12-20 2020-01-21 龙马智芯(珠海横琴)科技有限公司 Sentence-level bilingual alignment method and system
CN111178089B (en) * 2019-12-20 2023-03-14 沈阳雅译网络技术有限公司 Bilingual parallel data consistency detection and correction method
CN113642337B (en) * 2020-05-11 2023-12-19 阿里巴巴集团控股有限公司 Data processing method and device, translation method, electronic device, and computer-readable storage medium
CN112668307B (en) * 2020-12-30 2022-06-21 清华大学 Automatic bilingual sentence alignment method and device
CN115345127A (en) * 2022-06-08 2022-11-15 甲骨易(北京)语言科技股份有限公司 Parallel corpus sentence level alignment system and method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101667177A (en) * 2009-09-23 2010-03-10 清华大学 Method and device for aligning bilingual text
CN101989261A (en) * 2009-08-01 2011-03-23 中国科学院计算技术研究所 Method for extracting phrases of statistical machine translation
CN103942339A (en) * 2014-05-08 2014-07-23 深圳市宜搜科技发展有限公司 Synonym mining method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080154577A1 (en) * 2006-12-26 2008-06-26 Sehda,Inc. Chunk-based statistical machine translation system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101989261A (en) * 2009-08-01 2011-03-23 中国科学院计算技术研究所 Method for extracting phrases of statistical machine translation
CN101667177A (en) * 2009-09-23 2010-03-10 清华大学 Method and device for aligning bilingual text
CN103942339A (en) * 2014-05-08 2014-07-23 深圳市宜搜科技发展有限公司 Synonym mining method and device

Also Published As

Publication number Publication date
CN107391495A (en) 2017-11-24

Similar Documents

Publication Publication Date Title
CN107391495B (en) Sentence alignment method of bilingual parallel corpus
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
CN108280064B (en) Combined processing method for word segmentation, part of speech tagging, entity recognition and syntactic analysis
CN106547739A (en) A kind of text semantic similarity analysis method
CN104156349B (en) Unlisted word discovery and Words partition system and method based on statistics dictionary model
CN110750993A (en) Word segmentation method, word segmentation device, named entity identification method and system
WO2023005293A1 (en) Text error correction method, apparatus, and device, and storage medium
CN111966812B (en) Automatic question answering method based on dynamic word vector and storage medium
CN110222328B (en) Method, device and equipment for labeling participles and parts of speech based on neural network and storage medium
CN115795009A (en) Cross-language question-answering system construction method and device based on generating type multi-language model
CN103678282A (en) Word segmentation method and device
CN111599340A (en) Polyphone pronunciation prediction method and device and computer readable storage medium
CN108959474B (en) Entity relation extraction method
CN110472062B (en) Method and device for identifying named entity
CN114818891B (en) Small sample multi-label text classification model training method and text classification method
CN110633467A (en) Semantic relation extraction method based on improved feature fusion
CN111738002A (en) Ancient text field named entity identification method and system based on Lattice LSTM
CN111291565A (en) Method and device for named entity recognition
CN115081437A (en) Machine-generated text detection method and system based on linguistic feature contrast learning
CN104572632B (en) A kind of method in the translation direction for determining the vocabulary with proper name translation
CN113553847A (en) Method, device, system and storage medium for parsing address text
CN110991193A (en) Translation matrix model selection system based on OpenKiwi
CN110674642A (en) Semantic relation extraction method for noisy sparse text
CN110929532A (en) Data processing method, device, equipment and storage medium
CN116955644A (en) Knowledge fusion method, system and storage medium based on knowledge graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20200119

Address after: 305, 3 / F, China Tianli building, No. 56, Zhichun Road, Haidian District, Beijing 100086

Applicant after: Beijing Tongwen Century Technology Co., Ltd

Address before: 100086, Haidian District, Zhichun Road, Beijing No. 56, China Tianli building, seven floor

Applicant before: Beijing I Translation Technology Co., Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200821

Termination date: 20210609

CF01 Termination of patent right due to non-payment of annual fee