CN107391495B

CN107391495B - Sentence alignment method of bilingual parallel corpus

Info

Publication number: CN107391495B
Application number: CN201710433746.5A
Authority: CN
Inventors: 刘强; 彭蓉
Original assignee: Beijing Tongwen Century Technology Co ltd
Current assignee: Beijing Tongwen Century Technology Co ltd
Priority date: 2017-06-09
Filing date: 2017-06-09
Publication date: 2020-08-21
Anticipated expiration: 2037-06-09
Also published as: CN107391495A

Abstract

The invention provides a sentence alignment method of bilingual parallel corpus, which comprises the following steps: A. acquiring a bilingual probability distribution dictionary containing a word translation pair and a word translation probability of a source language and a target language; B. constructing a dynamic programming matrix according to the number of sentences of a source language and a target language of a text to be aligned; determining evaluation scores based on sentence length information, word information and word inter-translation probability under different alignment modes according to the dynamic programming matrix and the bilingual probability distribution dictionary; C. according to the evaluation score, determining an alignment path under an alignment mode with the evaluation score larger than a specified threshold; D. and determining an alignment path sequence of the source language sentence and the target language sentence of the text to be aligned according to the alignment path. Therefore, the sentence alignment method of the bilingual parallel corpus provided by the application is beneficial to improving the precision of automatic sentence alignment of the bilingual parallel corpus.

Description

Sentence alignment method of bilingual parallel corpus

Technical Field

The invention relates to the technical field of language translation processing, in particular to a sentence alignment method of bilingual parallel linguistic data.

Background

Sentence alignment, i.e., determining which sentence(s) in the source language text and which sentence(s) in the target language text are translations of each other. The problem of sentence alignment is that the mapping between sentences in the bilingual text has many to many mappings, which is easy to generate mismatching.

Currently, sentence alignment methods in the prior art include a method based on sentence length, a method based on word alignment or character string alignment, a method based on offset position alignment, and the like. These methods rely on sentence length, sentence position, or sentence length ratio information between two languages. However, the sentence alignment method in the above-mentioned technology is poor in alignment effect because the alignment parameters involved are relatively single.

Therefore, a bilingual parallel corpus sentence alignment method is needed to improve the sentence alignment effect of the bilingual parallel corpus.

Disclosure of Invention

In view of the above, the present application provides a sentence alignment method for bilingual parallel corpus to improve the precision of sentence alignment of bilingual parallel corpus.

The application provides a sentence alignment method of bilingual parallel corpus, which comprises the following steps:

A. acquiring a bilingual probability distribution dictionary containing a word translation pair and a word translation probability of a source language and a target language;

B. constructing a dynamic programming matrix according to the number of sentences of a source language and a target language of a text to be aligned;

determining evaluation scores based on sentence length information, word information and word inter-translation probability under different alignment modes according to the dynamic programming matrix and the bilingual probability distribution dictionary;

C. according to the evaluation score, determining an alignment path under an alignment mode with the evaluation score larger than a specified threshold;

D. and determining an alignment path sequence of the source language sentence and the target language sentence of the text to be aligned according to the alignment path.

According to the method, evaluation scores based on sentence length information, word information and word inter-translation probability under different alignment modes are obtained by constructing a dynamic programming matrix according to the sentence length, the word number, the total sentence number ratio of the source language to be aligned and the target language and the bilingual probability distribution dictionary; further acquiring an alignment path according to the evaluation score; and finally, acquiring an alignment path sequence of sentences in the source language and the target language according to the alignment path. The application relates to various alignment parameters, and compared with the prior art, the application has better alignment effect. Wherein the sentence length information includes: the number of characters contained in the source and target language sentences (where the characters may be words, terms, or characters); the word information includes: the distribution of the set of words (the set of words without repeated words) in the source and target language sentences, the number of words in the set of sentences (the number of words without repeated words), the number of lumped words in the source language sentences, and the frequency of occurrence of the current source language word in its set of sentence words. The sentence set here means that if the alignment mode is source language 2 sentences and target language 3 sentences are aligned, the source language sentence set is 2 sentences, and the target language sentence set is 3 sentences.

Preferably, the step a further comprises:

converting words of a source language and a target language in the bilingual probability distribution dictionary into a number form for storage;

and numbering the sentences of the source language and the target language of the text to be aligned according to the sequence of the sentences, and numbering the words after word segmentation according to the numbers in the bilingual probability distribution dictionary.

Therefore, sentences and words are numbered, and convenience in subsequent calculation and sentence alignment is facilitated.

Preferably, the step B includes:

b1, constructing a dynamic programming matrix M according to the sentence number of the source language and the target language of the text to be aligned_AlignWherein:

cell_ijrepresenting an element in the matrix, n being the number of source language sentences and m being the number of target language sentences; cell_ijIs a triple (score, lang1_ path, lang2_ path), score is used for recording the alignment pattern evaluation score at the position, lang1_ path is used for recording the alignment path of the source language at the position, and lang2_ path is used for recording the alignment path of the target speech at the position; the alignment path is used for recording an alignment mode and corresponding sentence identifications of a source language and a target language;

b2, setting a two-dimensional window smaller than nxm;

b3, moving the two-dimensional window in the source language sentence set to be aligned and the sentence set of the target language;

and respectively corresponding to the source language sentence set covered by each window obtained by the mobile window and the sentence subset of the target language, and calculating evaluation scores of the source language sentence set and the target language sentence subset in the window range under different alignment modes according to the sentence length, the word set, the word number of the sentence set, the word occurrence frequency and the bilingual probability distribution dictionary of the text to be aligned in the window.

Therefore, by constructing the dynamic programming matrix and further calculating evaluation scores in different alignment modes in a window mode containing the sentence numbers of the specified source language and the target language, the sentence alignment effect in different alignment modes can be better obtained.

Preferably, the calculation formula of the evaluation scores in different alignment modes in step B2 is as follows:

length_penalty_sentencepunishment is carried out for the sentence length; the dependency _ matrix is a sentence length penalty coefficient; xtokens is the word set distribution in the source language sentence, ytokens is the word set distribution in the target language sentence, l_xtokensFor the word number, l, of a set of source language sentences_ytokensNumber of words for a subset of target language sentences, X_wcAggregating the word number for the source language sentence, xfreq, yfreq being the frequency values of the current source language word and target language word in their respective sentence sets, y_wfreqIs the probability of a translation of the current source language word in the loop in the bilingual probability distribution dictionary.

Therefore, the evaluation scores under different alignment modes can be accurately obtained through the calculation formula.

Preferably, the two-dimensional window is a 5 × 5 sized window;

the sentence length penalty coefficient penalty _ matrix is:

P_aband expressing the sentence length penalty coefficient in the alignment mode of the corresponding a sentence source language sentence to the b sentence source language sentence.

Preferably, the sentence length penalty length _ penalty_sentenceThe calculation formula of (2) is as follows:

wherein, xlen represents the length of the source language sentence subset in the window under the current mode, ylen represents the length of the target language sentence subset in the window under the current mode, and M, N represents the weighted weight ratio; l is a standard alignment sentence length critical value; is the average number of characters corresponding to the source language characters in the target language.

Therefore, the sentence length punishment can be accurately obtained through the calculation formula.

Preferably, the method further comprises:

and under the same two-dimensional window, when the same sentence exists in a plurality of sentence pairs, selecting the sentence pair with the highest sentence alignment accuracy value to record.

Therefore, the same sentence can be prevented from being repeatedly appeared in different sentence pairs, so that the same sentence is prevented from corresponding to a plurality of paths.

Preferably, the method further comprises:

and under different two-dimensional windows, when the same sentence exists in a plurality of sentence pairs, selecting the sentence pair with the highest sentence alignment accuracy value to record.

Herein, a sentence pair refers to a pair of a source language sentence (one or more sentences) and a target language sentence (one or more sentences) in an alignment manner when the source language sentence and the target language sentence are aligned, and a sentence pair may include a source language sentence and a target language sentence, or a plurality of source language sentences and a plurality of target language sentences, or a plurality of source language sentences and a target language sentence, or a source language sentence and a plurality of target language sentences.

In summary, the evaluation scores based on sentence length information, word information and word inter-translation probability under different alignment modes are obtained by constructing a dynamic programming matrix according to the sentence length, the word number, the total sentence ratio of the source language to be aligned and the target language and the bilingual probability distribution dictionary; further acquiring an alignment path according to the evaluation score; and finally, acquiring an alignment path sequence of sentences in the source language and the target language according to the alignment path. Compared with the prior art, the alignment method has better alignment effect.

Drawings

Fig. 1 is a schematic flow chart illustrating an automatic sentence alignment method for bilingual parallel corpus according to the present application.

Detailed Description

Bilingual parallel-language corpus sentence alignment, that is, establishing sentence-level alignment relationship between bilingual texts, is to determine which sentence(s) in the source language text and which sentence(s) in the target language text are translations of each other.

For example, if S is an original text and T is a translated text, S is S₁s₂…s_m，T＝t₁t₂…t_mSeeking a ═ a₁a₂…a_rWherein: a is_i＝(s_j..s_k,t_p..t_q) That is, a pair sequence of a source language sentence of the original text and a target language sentence of the translated text is sought, and the original fragment s_j..s_kAnd the translated text segment t_p..t_qThe two are translated, and no further sentence-level alignment exists between the two (aligned sentence pairs, sentence beads). In most cases, one original sentence corresponds to one translated sentence, and one sentence pair is composed of one original sentence and one translated sentence. However, due to the difference between the bilingual grammar structure and the ideograph, the aligned sentence pair has the situation that one sentence corresponds to a plurality of sentences, the plurality of sentences correspond to one sentence, one sentence corresponds to zero sentences, zero sentences correspond to 1 sentence, and the plurality of sentences correspond to a plurality of sentences, which easily generates mismatching.

As will be described in detail, the present invention, when combined with a specific example, will be exemplified by assuming that the source language has 10 sentences, i.e., n ═ 10, and the target language has 8 sentences, i.e., m ═ 8, where the 10 sentences of the source language to be aligned are labeled as S1, S2 … … S10, respectively; the 8 sentences of the target language to be aligned are labeled as T1 ', T2 ', … … and T8 ', respectively.

The embodiment of the application provides a sentence alignment method of bilingual parallel corpus, which fully considers the factors of sentence length (the sentence length can be measured by the number of characters contained in the sentence or by the number of words contained in the sentence), alignment mode, the number of words, word translation in a bilingual word probability vocabulary, word inter-translation probability and the like, and constructs a method for solving bilingual sentence level automatic alignment by using a dynamic programming method, and referring to a flow chart shown in fig. 1, the specific method of the invention comprises the following steps:

s101, a bilingual probability distribution dictionary is obtained by using a bilingual probability distribution dictionary generating tool. For example, a bilingual word level alignment intermediate result is generated by using a berkley aligner word alignment tool, and the translation probability of the source language and the target language is further calculated according to the result, so as to obtain a bilingual probability distribution dictionary containing the word translation pairs of the source language and the target language and the word translation probability corresponding to the translation pairs. The following is an expression form of the english-chinese probability distribution dictionary:

the steps further include: and converting the words of the source language and the target language in the bilingual probability distribution dictionary into a number form for storage. For example, one expression form of a numeric numbered english-chinese probability distribution dictionary is as follows:

from the above, if the words abandon, abc, abduct, and abide are numbered 1, 2, 3, and 4; the possible Chinese translation numbers are respectively-1 to-16, and the probabilities of words in different target languages translated by corresponding to the same source language word are stored, for example, if the probability of "abandon" translated into "background" is 5%, the probability of "abandon" translated into "discarding" is 10%, and the like, corresponding records are stored, so that the probability word list expression in the form of digital numbers is formed.

Further, all sentences to be aligned are participled using a participle algorithm service. For example, English can be segmented by a space, and Chinese can be segmented by a segmentation program. And all sentences and words are converted into serial numbers on the basis of the results of all sentences after word segmentation. Wherein, the sentences are numbered according to the sentence sequence, and the words are mapped and numbered according to the probability distribution word list. That is, the source language words are numbered according to word number "1, 2, 3, 4 …." etc. in step S101. The target language words are numbered according to the word numbers "-1 to-16 …" and the like in step S101. Or if the word is found in the probability distribution word list, adding the word into the probability distribution word list and numbering the word.

In the method, a word segmentation engine of a source language and a target language is used for segmenting sentences into word sets.

S102, according to the total number of source language sentences in the original text to be aligned, the total number of target language sentences in the translated text, word inter-translation pairs and corresponding word inter-translation probabilities, a dynamic programming matrix is constructed, and evaluation scores in different sentence alignment modes based on sentence length, words and word inter-translation probabilities are obtained.

Specifically, the method comprises the following substeps:

s1021, initializing an alignment dynamic programming matrix M according to the number of source language sentences and the number of target language sentences in the original text to be aligned_AlignWherein

cell_ijOne element in the representation matrix, n being the number of source language sentences and m being the number of target language sentences. cell_ijIs a triple (score, lang1_ path, lang2_ path), where score is the alignment pattern evaluation score at the position (see the description of calculating match _ score (xtoken, ytokens) below, where the value of match _ score (xtkens, ytokens) is the value of score here), lang1_ path is the alignment path of the source language at the position, and lang2_ path is the alignment path of the target language at the position. The alignment path describes an alignment mode described later and corresponding sentence identifiers in the source language and the target language (for the alignment path, see the following description for details).

In this example, n is 10 and m is 8, so

S1022, the number of horizontal loops is made a value smaller than the total number m of sentences in the target language sentence, S1023 to S1025 are sequentially executed the number of times, and j is incremented by 1 each time starting from 1, that is, j is incremented from 1 to the calculated value as a whole.

In this example, the number of horizontal cycles is set to 3, which is calculated as m-5 (5 is a window size 5 × 5 described later), and m is known to be 8 as described above.

S1023, the number of vertical circulation is made a number smaller than the total number n of sentences in the source language sentence, and S1024-S1025 are sequentially executed the number of times, i is incremented by 1 each time starting from 1, i.e., i is incremented from 1 to the calculated number as a whole.

In this example, the number of vertical cycles is set to 5, which is calculated as n-5 (the 5 is a window size 5 × 5 described later), and n is known to be 10 as described above.

S1024, setting a window with the size of 5 multiplied by 5 for sentences of a source language in the original text and a target language in the translated text to be aligned; and determining the current window by taking the ith sentence in the current source language and the jth sentence in the target language as starting points of the window.

If the sentences of the source language in the original text and the target language in the translated text are corresponding to the matrix, it can be understood that the current sentence is the correct sentenceFixed 5 × 5 window range

That is, the source language includes i-th through i + 5-th sentences, and the subset of target language sentences includes j-th through j + 5-th sentences.

In the combination example, when the cycle is repeated until j is 2 and i is 3, namely, each window of 5 × 5 corresponds to the range of the window

That is, the set of source language sentences includes clauses 3 through 8, i.e., S3, S4 … … S8; the target language sentence set includes sentences 2 through 7, i.e., T2 ', T3 ' … … T7 '.

And S1025, calculating the evaluation scores of the source language sentence set and the target language sentence set in the current window range in each alignment mode. The specific description is as follows:

first, the above-mentioned alignment pattern is described, the alignment pattern is related to the window size, which is obtained statistically, and the sentence alignment under the window with the size of 5 × 5 may include the following alignment patterns:

(1)1-0: 1 source language sentence to 0 target language sentence;

(2)0-1: 0 source language sentence to 1 target language sentence;

(3)1-1: 1 source language sentence to 1 target language sentence;

(4)1-2: 1 source language sentence versus 2 target language sentences;

(5)2-1: 2 source language sentences to 1 target language sentence;

(6)2-2: 2 source language sentences to 2 target language sentences;

(7)1-3: 1 source language sentence versus 3 target language sentences;

(8)3-1: 3 source language sentences to 1 target language sentence;

(9)2-3: 2 source language sentences to 3 target language sentences;

(10)3-2: 3 source language sentences to 2 target language sentences;

(11)1-4: 1 source language sentence versus 4 target language sentences;

(12)4-1: 4 source language sentences versus 1 target language sentence.

As can be seen from the above, the alignment patterns such as 4-3 are not included in this example, because the statistically lower probability of occurrence of the alignment pattern under the 5 × 5 window is not selected to be used in the alignment pattern under the 5 × 5 window.

Here, the effect of the evaluation score is described again: the higher the evaluation score value of a certain alignment pattern is, the higher the possibility that the certain alignment pattern is present, that is, the higher the possibility that the certain alignment pattern is selected in step S103 described later.

In this step, the evaluation score in each alignment mode in the current window is calculated by the following function:

wherein length _ dependency_sentencePenalty for sentence length (see the description below in particular); the dependency _ matrix is a sentence length penalty coefficient (see the following description for details); xtokens is the distribution of the word set (the set of words without repeated words) in the source language sentence under the window, ytokens is the distribution of the word set (the set of words without repeated words) in the target language sentence under the window, and l_xtokensThe number of words in the set of source language sentences under the window (the number of words after removing repeated words), l_ytokensThe number of words (the number of words after removal of repeated words), X, of the target language sentence subset under the window_wcThe lumped word number of the source language sentence under the window (when calculating, the repeated words in the sentence are counted), xfreq is the frequency value of the appearance of the current source language word under the window in the word set xtoken of the sentence, yfreq is the frequency value of the appearance of the current target language word under the window in the word set ytkens of the sentence, y_wfreqIs the probability of a translation of the current target language word in the bilingual probability distribution dictionary. For example, if the source language or the target language is chinese, a word may be a word.

Wherein, regarding the penalty coefficient penalty _ matrix for sentence length: corresponding to the sentence alignment mode expressed in the foregoing, the invention defines a sentence length penalty matrix of an alignment mode of 5 × 5, each element of the matrix represents a sentence length penalty coefficient in a certain alignment mode, and is expressed as follows:

wherein P will be in this example₁₀P₁₁P₁₂P₁₃……P₄₁Abstraction as P_abP, the number of source sentences in sentence a and the number of target sentences in sentence b in a certain alignment pattern in the above window 5 × 5 are represented by a and b_abAnd (4) a sentence length penalty factor in the alignment pattern match (a-b). The long penalty factor is related to the alignment pattern, and is an expression of the probability of the alignment pattern appearing in the manually aligned corpus, which is also called prior probability, and is generally a constant, and the occurrence probability prob (match) of the alignment pattern can be counted by the manually aligned parallel corpus. That is, the probabilities of the 12 alignment patterns can be counted in advance through ten thousand manually aligned sentence sets (for example, chinese aligned english).

Wherein a sentence length penalty length _ penalty is defined_sentenceThe calculation formula of (2) is as follows:

wherein, xlen represents P_abCorresponding Source language sentence subset Length in Window in aligned mode (e.g., under 5 × 5 Window, P_abWhen a is 2 and b is 3, the alignment is 2-3, and xlen is 2) in the alignment mode, ylen represents P_abThe length of the subset of target language sentences in the window in the corresponding alignment mode, M, N, represents the weighted weight ratio, which is an empirical value. And L is a standard alignment sentence length critical value, namely the L can be used as an expected value of the length of the aligned sentence set and is obtained through statistics. Is the average number of characters corresponding to a source language character in a target language, (where the source language and the target language are words)The characters in the symbol ratio may be words, terms, or characters, for example, if the source language is french and the target language is english, the characters may be english characters corresponding to french or words corresponding to english, and if the source language is chinese and the target language is english, the characters may be one chinese character corresponding to several english characters, one chinese character corresponding to several english words, one chinese term corresponding to several english characters, or one chinese term corresponding to several english words).

This can be found by the following statistics:

obey a normal distribution. Statistical demonstration of a large corpus of data, language l₁In the language l₂The corresponding character number C is a random variable, and the random variable C follows normal distribution N (C, s)²) On this basis, random variables are defined. Wherein l₁And l₂Are two variables, c and s²The two parameters of the normal distribution can be obtained by nonlinear regression method through the sampling statistics data of the corpus.

The formula for calculating the evaluation score calculates the evaluation score in each alignment mode in the current window, implicitly calculates the accuracy values (including the accuracy value exceeding the set threshold and the accuracy value lower than the set threshold) of different sentence alignments in each alignment mode in the current window in the calculation process of calculating the evaluation score, and records the sentence pair corresponding to the accuracy value exceeding the set threshold and the accuracy value corresponding to the sentence pair.

For example, when looping to j — 2 and i — 3, it is assumed that there are S5 pairs and T4' of sentences of sentence pairs corresponding to the accuracy value exceeding the set threshold value in the alignment mode 1-1; s6 pair T5'; and S7 for T6', corresponding sentence pair identifications are recorded correspondingly.

If the same sentence exists in a plurality of sentence pairs in the same window (i.e. in this step, corresponding to this single cycle), or between different windows (see step S103 for details), a sentence pair with a higher accuracy value is selected and recorded accordingly.

For example, a window in which there are S5 pair T4 ', S5 pair T5', S5 pair (T4 'and T5'), (S5 and S6) pair (T4 ', T5', and T6 '), all of the four sentence pairs including S5, selects one of the sentence pairs with the highest alignment accuracy to be retained, and retains S5 pair T4' as in this embodiment.

In conjunction with the above specific example, when looping to j 2 and i 3, there is a correspondence for each window of 5 × 5

That is, when the source language sentence set includes sentences S3, S4 … … S8, and the target language sentence set includes sentences T2 ', T3 ' … … T7 ', it is assumed that the evaluation score values of this step S1026 in each of the above-described alignment modes are as follows:

in the alignment pattern 1-0, the obtained evaluation score was 0.167; the combination of the source language sentence aligned with the target language sentence in the alignment mode is S8 to 0;

in the alignment pattern 0-1, an evaluation score of 0.167 was obtained; aligning the combination of the target language sentences into 0 pairs of T2' in the alignment source language sentences;

in the alignment pattern 1-1, the obtained evaluation score was 0.5; the combination of the source language sentence aligned to the target language sentence in the alignment mode is: s5 pair T4'; s6 pair T5'; s7 pair T6';

in the alignment pattern 2-1, the obtained evaluation score was 0.167; the combination of the source language sentence aligned to the target language sentence in the alignment mode is: s3, S4 vs T3';

the evaluation score in the remaining alignment patterns was 0.

After the execution of S102 is completed (the nested loop is also sequentially completed), the loop is repeated 3 × 5 to 15 times (the number of times is the values of j and i determined in steps S1022 and S1023) in accordance with the present example where n is 10 and m is 8, and the evaluation score in each alignment mode in each 5 × 5 window can be calculated by the evaluation function. And recording the alignment accuracy values of the corresponding sentence pairs in each alignment mode and the different sentence pairs in each alignment mode.

S103, according to the M determined in the S102_AlignEach cell of (2)_ijAnd acquiring the evaluation scores of the records, and acquiring each alignment path of the record corresponding to each alignment mode with the evaluation score larger than a specified threshold value.

For the optimal value in each 5 × 5 evaluation score matrix obtained above, the aligned paths in the dynamic programming matrix, i.e., the values of lang1_ path and lang2_ path in each triple of cells (score, lang1_ path, lang2_ path), are set. For example, when aligning to mode 2-1, assuming that the evaluation function score in this mode is 0.167, the values of score, lang1_ path and lang2_ path in the triple of cell (score, lang1_ path, lang2_ path) are respectively

score＝0.167,lang1_path＝2,lang2_path＝1，

cell (score, lang1_ path, lang2_ path) is set to

cell (score 0.167, lang1_ path 2, lang2_ path 1). Meanwhile, the alignment path also records which two source languages are aligned with which target language (i.e. records the sentence identifiers) in the alignment mode 2-1. For example, the combination of the source language sentence aligned to the target language sentence in the alignment mode is: s3, S4 vs T3'. For example in cells of the aforementioned dynamic programming matrix₃₃And cell₄₃Points are respectively set to be cells (score is 0.167, lang1_ path is 2, lang2_ path is 1) and sentence pairs S3, S4 and T3' are respectively recorded.

For example, if four sentence pairs including S5 in the S5 pair T4 ', the S5 pair T5', the S5 pair (T4 'and T5'), and the (S5 and S6) pair (T4 ', T5', and T6 ') appear in 4 5 × windows, it selects a sentence pair with the highest sentence alignment accuracy value to be retained, e.g., S5 is selected to be retained to T4' in this example, that is, when the alignment mode 1-1 evaluation score is 0.5, for a cell₅₄Is set to cell (score ═ cell)0.5, lang1_ path 1, lang2_ path 1) and records the sentence pair T3 'S5 versus T4'.

The other cellijs are set accordingly according to the above example.

From above, setting is completed

Of cellij (c) is calculated.

And S104, acquiring an alignment path sequence of the source language sentence to be aligned and the target language sentence to be aligned according to each alignment path.

The alignment path sequence obtained by the last acquisition is an array. For example, 10 sentences of the source language to be aligned of the specific example are respectively labeled as S1, S2, S3, S4, S5, S6, S7, S8, S9, S10; the 8 sentences of the target language to be aligned are labeled as T1 ', T2', T3 ', T4', T5 ', T6', T7 'and T8', respectively. Assuming that an alignment pattern 1-1, an alignment pattern 1-0 and an alignment pattern 2-1 with higher evaluation scores are obtained from the alignment path, and at the same time, which source language(s) and which target language(s) are aligned specifically in each alignment pattern is recorded in the alignment path; for example, the sentences aligned between the source language and the target language in the alignment mode 1-1 are: s1 pair T1'; s2 pair T2'; s5 pair T4'; s6 pair T5'; s7 pair T6'; s9 pair T7'; s10 pair T8'; the sentences aligned between the source language and the target language in the alignment mode 1-0 are: s8 vs 0; the sentences aligned between the source language and the target language in the alignment mode 2-1 are: s3, S4 vs T3'. And according to the above information, each cell in the dynamic programming matrix has been set accordingly in step S103_ijThe alignment path of (1). The final alignment sequence of the source language sentence and the target language sentence to be aligned can be derived from the dynamic programming matrix as follows:

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

For example, the window size used in the above step may also be 4 × 4, 4 × 5, 6 × 6, etc., and the corresponding alignment mode is also changed correspondingly to match the window size.

In addition, even if the window size is not changed, different alignment modes can be selected according to the source and target languages, for example, 5 × 5 windows in the example of the present invention, and alignment modes such as 3-4(3 source language sentences to 4 target language sentences), 4-3(4 source language sentences to 3 target language sentences) and the like can be added, or a certain alignment mode listed in the above steps is not selected. As described above, the alignment patterns may be selected according to the statistical probability of the occurrence of a certain alignment pattern in the corresponding source and target languages.

Claims

1. A sentence alignment method of bilingual parallel corpus is characterized by comprising the following steps:

2. The method of claim 1, wherein step a further comprises:

3. The method of claim 2, wherein step B comprises:

cell_ijrepresenting an element in the matrix, n being the number of source language sentences and m being the number of target language sentences; cell_ijIs a triple (score, lang1_ path, lang2_ path), score is used for recording the alignment pattern evaluation score at the current position, lang1_ path is used for recording the alignment path of the source language at the position, and lang2_ path is used for recording the alignment path of the target language at the position; the alignment path is used for recording an alignment mode and corresponding sentence identifications of a source language and a target language;

b2, setting a two-dimensional window smaller than nxm;

and respectively calculating evaluation scores of the source language sentence set and the target language sentence set in the window range in different alignment modes according to the sentence length, the word set, the word number of the sentence set, the word occurrence frequency and the bilingual probability distribution dictionary of the source language and the target language of the text to be aligned in the window, the word set in the sentence set, the word number of the sentence set, the word occurrence frequency and the bilingual probability distribution dictionary.

4. The method according to claim 3, wherein the evaluation scores in different alignment modes in step B3 are calculated by the following formula:

length_penalty_sentencepunishment is carried out for the sentence length; the dependency _ matrix is a sentence length penalty coefficient; xtokens is the word set distribution in the source language sentence, ytokens is the word set distribution in the target language sentence, l_xtokensFor the word number, l, of a set of source language sentences_ytokensNumber of words for a subset of target language sentences, X_wcThe word number is lumped for the source language sentence, xfreq, yfreq are the frequency values of the current source language word and the target language word in the sentence set, y_wfreqIs the translation probability of the current source language word in the loop in the bilingual probability distribution dictionary.

5. The alignment method according to claim 4, wherein the two-dimensional window is a 5 x 5 sized window;

the sentence length penalty coefficient penalty _ matrix is:

P_aband expressing the sentence length penalty coefficient in the alignment mode of the corresponding a sentence source language sentence to the b sentence target language sentence.

6. According to the rightThe alignment method according to claim 4, wherein the sentence length penalty length dependency_sentenceThe calculation formula of (2) is as follows:

wherein, xlen represents the length of the source language sentence subset in the window under the current mode, ylen represents the length of the target language sentence subset in the window under the current mode, and M, N represents the weighted weight ratio; l is a standard alignment sentence-to-sentence length critical value; is the average number of characters corresponding to the source language characters in the target language.

7. The alignment method according to claim 3, further comprising:

8. The alignment method according to claim 3, further comprising: