CN111259652A

CN111259652A - Bilingual corpus sentence alignment method and device, readable storage medium and computer equipment

Info

Publication number: CN111259652A
Application number: CN202010084543.1A
Authority: CN
Inventors: 鲁思祈
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-02-10
Filing date: 2020-02-10
Publication date: 2020-06-09
Anticipated expiration: 2040-02-10
Also published as: CN111259652B

Abstract

The application relates to a bilingual corpus sentence alignment method, a bilingual corpus sentence alignment device, a computer-readable storage medium and computer equipment, wherein the method comprises the following steps: acquiring the language types of a parallel text to be aligned, an original text and a translated text; preprocessing the parallel texts to be aligned to obtain parallel sentence pairs to be aligned; calling a single-language word segmentation model corresponding to the language type of the original text and the translated text from the single-language word segmentation model group trained by the SentecePiece algorithm, and performing word segmentation processing to obtain a sentence fragment group of the original text to be aligned and a sentence fragment group of the translated text to be aligned; and carrying out format processing on the sentence fragment groups of the original text to be aligned and the translated text to be aligned according to a preset format processing mode to obtain a bilingual sentence pair group, calling a sentence alignment tool, and carrying out sentence alignment processing on the bilingual sentence pair group according to a bilingual dictionary to obtain sentence alignment parallel corpora. The unilingual participle models of various languages trained by the Sentence piece algorithm reduce the coupling degree and the maintenance difficulty of codes and reduce the maintenance cost.

Description

Bilingual corpus sentence alignment method and device, readable storage medium and computer equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a bilingual corpus sentence alignment method, an apparatus, a computer-readable storage medium, and a computer device.

Background

When sentence-level alignment is performed on discourse-level aligned bilingual parallel corpuses, a feasible method is to judge the similarity degree of each sentence in the two language parallel corpuses by using the sentence length information and the vocabulary information.

For example, if the length of two sentences is different greatly, the similarity of the two sentences is low, and the probability of parallel sentence pairs is low. Alternatively, if two sentences contain the same number or the same letter string, the degree of similarity between the two sentences is high, and the probability that the two sentences are a parallel sentence pair is high. And, when two sentences contain words of the same concept in two languages, the similarity of the two languages is also higher, for example, an english sentence contains "Framework" and a chinese sentence contains "frame". Based on the alignment logic, the general processing flow is to perform tokenization on sentence pairs in two languages respectively, the tokenization operation is equivalent to performing word segmentation operation on the sentences, namely, to disassemble consecutive sentences into words, and to provide a bilingual dictionary generated or extracted in advance as auxiliary information for alignment. If the existing bilingual dictionary cannot be provided, the bilingual dictionary can be extracted from the primarily aligned corpus after the corpus is initially aligned by using a sentence length method, and the bilingual dictionary is used for performing secondary alignment.

However, when aligning the corpora to be aligned in a plurality of languages, word segmentation tools corresponding to the languages need to be deployed on one server to extract bilingual dictionaries in different languages and align the corpora to be aligned in different languages. Taking Python as an example, Chinese can use jieba, Japanese can use mecab, Korean can use ko expansion of mecab, and the like. Different word segmentation tools not only depend on different running environments (for example, mecab needs additional C + + support, and ko expansion of mecab can only run under python version 3.7), but also need to load different dependent dictionary files respectively. Therefore, the coupling degree of the codes is greatly improved, and the maintenance cost is higher.

Disclosure of Invention

Based on this, it is necessary to provide a bilingual corpus sentence alignment method, apparatus, computer-readable storage medium, and computer device for solving the problem of high maintenance cost of bilingual corpus sentence alignment.

A bilingual corpus sentence alignment method comprises the following steps:

acquiring a parallel text to be aligned, and the language type of an original text and the language type of a translated text in the parallel text to be aligned;

preprocessing the parallel texts to be aligned to obtain parallel sentence pairs to be aligned;

calling a monolingual word segmentation model corresponding to the language type of the original text from a monolingual word segmentation model group, and performing word segmentation processing on the original text in the parallel sentence pair to be aligned to obtain a sentence fragment group of the original text to be aligned;

calling a monolingual word segmentation model corresponding to the language type of the translation text from the monolingual word segmentation model group, and performing word segmentation processing on the translation text in the parallel sentence pair to be aligned to obtain a sentence fragment group of the translation to be aligned;

performing format processing on the sentence fragment group of the original text to be aligned and the sentence fragment group of the translated text to be aligned according to a preset format processing mode to obtain a bilingual sentence pair group;

acquiring a bilingual dictionary corresponding to the language type of the original text and the language type of the translated text based on the preset format processing mode;

calling a sentence alignment tool, and performing sentence alignment processing on the bilingual sentence pair group according to the bilingual dictionary to obtain sentence alignment parallel corpora;

the training mode of the monolingual word segmentation model comprises the following steps:

acquiring monolingual data corresponding to the language type of the monolingual participle model to be trained;

preprocessing the monolingual data to obtain a monolingual data sample;

and carrying out monolingual word segmentation model training based on the monolingual data sample through a Sentence piece algorithm to obtain a monolingual word segmentation model.

In one embodiment, the training mode of the bilingual dictionary comprises:

acquiring a sentence alignment parallel corpus sample corresponding to the language type of a bilingual dictionary to be trained from a sentence alignment parallel corpus, wherein the language type of the bilingual dictionary to be trained comprises the language type of an original language corpus and the language type of a translated language corpus;

preprocessing the sentence alignment parallel corpus sample to obtain a sentence alignment parallel corpus pair;

calling a monolingual participle model corresponding to the language type of the original corpus from the monolingual participle model group, and performing participle processing on the original corpus in the sentence alignment parallel corpus pair to obtain a sentence fragment group of a sample original text;

calling a monolingual participle model corresponding to the language type of the translation corpus from the monolingual participle model group, and performing participle processing on the translation corpus in the sentence alignment parallel corpus pair to obtain a sentence fragment group of the sample translation;

according to the preset format processing mode, carrying out format processing on the sentence fragment group of the sample original text and the sentence fragment group of the sample translation text to obtain a bilingual sentence pair sample group;

and aligning the bilingual sentence pair sample group through a bilingual word pair extraction algorithm to obtain a bilingual dictionary.

In one embodiment, the preset format processing method includes:

obtaining a sentence fragment group to be formatted;

and detecting underline characters in the sentence fragment group, and removing the detected underline characters from the sentence fragment group.

In one embodiment, the preset format processing method includes:

obtaining a sentence fragment group to be formatted and a corresponding language type;

determining whether the sentence fragment group belongs to a format processing object according to the language type of the sentence fragment group;

when the sentence fragment group belongs to the format processing object, underline characters in the sentence fragment group are detected, and the detected underline characters are removed from the sentence fragment group.

In one embodiment, after the step of calling the sentence alignment tool to perform sentence alignment processing on the bilingual sentence pair group according to the bilingual dictionary to obtain the sentence-aligned parallel corpus, the method further includes:

and filtering the sentence alignment parallel corpus based on a preset filtering condition to obtain the filtered sentence alignment parallel corpus.

In one embodiment, the preset filtering condition comprises at least one of the following conditions:

analyzing whether a blank sentence exists in the sentence alignment parallel corpus or not, and filtering the blank sentence in the sentence alignment parallel corpus;

filtering sentences with scores smaller than a preset value in the sentence alignment parallel corpus according to the preset value;

filtering sentences of which the language types are not consistent with each other in the sentence alignment parallel linguistic data according to the language type of the original text and the language type of the translated text;

and filtering sentences which are not matched with the characteristics of numbers and the like in the sentence alignment parallel corpus according to the characteristics of numbers and the like.

In one embodiment, the method further comprises the following steps:

and adding the sentence alignment parallel corpus into the sentence alignment parallel corpus.

A bilingual corpus sentence alignment apparatus, comprising:

the parallel text acquisition module is used for acquiring parallel texts to be aligned, and the language type of an original text and the language type of a translated text in the parallel texts to be aligned;

the preprocessing module is used for preprocessing the parallel texts to be aligned to obtain parallel sentence pairs to be aligned;

the first word segmentation processing module is used for calling a monolingual word segmentation model corresponding to the language type of the original text from the monolingual word segmentation model group, and performing word segmentation processing on the original text in the parallel sentence pair to be aligned to obtain a sentence fragment group of the original text to be aligned;

the second word segmentation processing module is used for calling a monolingual word segmentation model corresponding to the language type of the translation text from the monolingual word segmentation model group, and performing word segmentation processing on the translation text in the parallel sentence pair to be aligned to obtain a sentence fragment group of the translation to be aligned;

the format processing module is used for carrying out format processing on the sentence fragment group of the original text to be aligned and the sentence fragment group of the translated text to be aligned according to a preset format processing mode to obtain a bilingual sentence pair group;

a bilingual dictionary obtaining module, configured to obtain a bilingual dictionary corresponding to the language type of the original text and the language type of the translated text based on the preset format processing manner;

the sentence alignment processing module is used for calling a sentence alignment tool and carrying out sentence alignment processing on the bilingual sentence pair group according to the bilingual dictionary to obtain sentence alignment parallel linguistic data;

preprocessing the monolingual data to obtain a monolingual data sample;

A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to perform the steps of the method.

A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of the method.

According to the bilingual corpus sentence alignment method, the bilingual corpus sentence alignment device, the computer-readable storage medium and the computer equipment, the language type of the original text and the language type of the translated text in the parallel text to be aligned are obtained; preprocessing the parallel texts to be aligned to obtain parallel sentence pairs to be aligned; calling a monolingual participle model corresponding to the language type of the original text from a monolingual participle model group trained through a Sentence piece algorithm, performing participle processing on the original text in the parallel sentence pair to be aligned to obtain a sentence fragment group of the original text to be aligned, calling a monolingual participle model corresponding to the language type of the translation text, performing participle processing on the translation text in the parallel sentence pair to be aligned to obtain a sentence fragment group of the translation to be aligned; performing format processing on the sentence fragment group of the original text to be aligned and the sentence fragment group of the translated text to be aligned according to a preset format processing mode to obtain a bilingual sentence pair group, so that the sentence alignment precision of a sentence alignment tool can be improved, and a bilingual dictionary corresponding to the language type of the original text and the language type of the translated text is obtained based on the preset format processing mode; and calling a sentence alignment tool, and performing sentence alignment processing on the bilingual sentence pair group according to the bilingual dictionary to obtain sentence alignment parallel linguistic data. The method has the advantages that through the monolingual word segmentation models of various languages trained through the Sentence piece algorithm, a set of processing flow can simultaneously process bilingual corpus sentence alignment of all needed languages, the design difficulty and the code complexity are greatly simplified, the code coupling degree and the maintenance difficulty are reduced, the maintenance cost is reduced, and through the format processing of sentence fragments, the word and sentence alignment precision can be improved, and the obtained sentence alignment parallel corpus result is more accurate.

Drawings

FIG. 1 is a diagram of an exemplary environment in which a bilingual corpus sentence alignment method may be implemented;

FIG. 2 is a flow diagram illustrating a bilingual corpus sentence alignment method according to an embodiment;

FIG. 3 is a schematic diagram illustrating a flowchart of training a monolingual segmentation model according to an embodiment;

FIG. 4 is a flow diagram illustrating training of a bilingual dictionary in an embodiment;

FIG. 5 is a diagram illustrating an application of the bilingual corpus sentence alignment method in an embodiment;

FIG. 6 is a flowchart illustrating a bilingual corpus sentence alignment method according to an embodiment;

FIG. 7 is a block diagram showing the structure of a bilingual corpus sentence alignment apparatus according to an embodiment;

FIG. 8 is a block diagram showing the structure of a bilingual corpus alignment apparatus according to another embodiment;

FIG. 9 is a block diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

FIG. 1 is a diagram of an exemplary environment in which a bilingual corpus sentence alignment method may be implemented. The application environment relates to the terminal 110, or to the terminal 110 and the server 120. The terminal 110 and the server 120 are connected through a network. When the terminal 110 is involved, the terminal 110 acquires the parallel text to be aligned, the language type of the original text and the language type of the translated text in the parallel text to be aligned; preprocessing the parallel texts to be aligned, calling a monolingual word segmentation model, and performing word segmentation on the original text and the translated text to obtain a sentence fragment group; carrying out format processing on the sentence fragment group to obtain a bilingual sentence pair group; and calling a sentence alignment tool, and performing sentence alignment processing on the bilingual sentence pair group based on the bilingual dictionary to obtain sentence alignment parallel corpora. When the terminal 110 and the server 120 are involved, the server 120 obtains the parallel text to be aligned sent by the terminal 110, and the language type of the original text and the language type of the translated text in the parallel text to be aligned; the server 120 preprocesses the parallel text to be aligned, calls a monolingual word segmentation model, and performs word segmentation on the original text and the translated text to obtain a sentence fragment group; carrying out format processing on the sentence fragment group to obtain a bilingual sentence pair group; and calling a sentence alignment tool, and performing sentence alignment processing on the bilingual sentence pair group based on the bilingual dictionary to obtain sentence alignment parallel corpora. The terminal 110 may specifically be a desktop terminal or a mobile terminal, and the mobile terminal may specifically be at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers.

In one embodiment, as shown in FIG. 2, a bilingual corpus sentence alignment method is provided. The embodiment is mainly illustrated by applying the method to the terminal 110 in fig. 1. Referring to fig. 2, the bilingual corpus sentence alignment method specifically includes the following steps:

step S220, acquiring a parallel text to be aligned, and a language type of an original text and a language type of a translated text in the parallel text to be aligned.

The parallel text to be aligned is a bilingual text composed of an original text and a translated text corresponding to the original text, which need to be sentence-aligned, and can be a chapter-level parallel text, and chapter-level parallel texts can be obtained from various channels, for example: and various websites are captured, various public data are acquired in a centralized manner, or manual translation is performed. The original text can be any one of the parallel texts to be aligned, and when one of the parallel texts to be aligned is determined to be the original text, the other text is the translated text. The language type of the original text and the language type of the translated text may be any two of chinese, english, japanese, korean, spanish, hindi, vietnam, french, russian, german, arabic, italian, portuguese, turkish, thai, mare, and the like.

Determining the language type of the original text or the language type of the translated text of the parallel text to be aligned according to the language used by the parallel text to be aligned, such as: the parallel texts to be aligned are a Chinese text and an English text which have the same meaning and are translated with each other, wherein the Chinese text can be used as an original text, the English text is a translated text, the language type of the original text is Chinese, and the language type of the translated text is English; or the English text can be used as the original text, the Chinese text is the translated text, the language type of the original text is English, and the language type of the translated text is Chinese. The language type of the original text and the language type of the translated text of the parallel text to be aligned can be determined by the user through language selection of the terminal, and can also be determined by detecting the parallel text to be aligned.

In one embodiment, based on a trained language detection model, language type detection is performed on the acquired parallel texts to be aligned, and the language type of an original text and the language type of a translated text in the parallel texts to be aligned are determined. The language detection model can be obtained by training based on a naive Bayes algorithm. By automatically detecting the language type of the original text and the language type of the translated text of the parallel text to be aligned, the working efficiency can be improved.

And S240, preprocessing the parallel texts to be aligned to obtain parallel sentence pairs to be aligned.

Wherein the pre-processing may comprise: and (3) sentence splitting, namely splitting the original text and the translated text of the parallel text to be aligned into a plurality of sentences, wherein the sentences can be split by using the existing sentence splitting method to obtain the parallel sentence pairs to be aligned. The pre-processing may further include: sentence splitting, illegal character cleaning, inferior corpus removal, full half-angle conversion, and the like. The characters, expression symbols and the like of the control type of the illegal characters cleaned from the parallel texts to be aligned are deleted, and the illegal characters can be searched for and deleted according to the Unicode character table. Removing the inferior corpus of the parallel text to be aligned comprises the following steps: removing corpora which have a lot of disordered numbers and punctuations and obviously do not accord with daily use logic, and filtering by using the proportion of the numbers and the punctuations in sentences; the language material with different languages, i.e. the language material of the language B mixed in the language material of the language A, can be removed by using tools such as language identification. The full half-angle conversion of the parallel text to be aligned is to convert punctuation marks, numbers and the like in an original text and a translated text of the parallel text to be aligned, for example: the Chinese uses English punctuation marks, and the English punctuation marks in the Chinese are converted into Chinese punctuation marks by full half-angle conversion. And (4) performing sentence splitting, illegal character cleaning, inferior corpus removal, full half-angle conversion and other treatments on the parallel text to be aligned, and then obtaining the parallel sentence pair to be aligned.

Step S260, a monolingual word segmentation model corresponding to the language type of the original text is called from the monolingual word segmentation model group, and word segmentation processing is carried out on the original text in the parallel sentence pair to be aligned, so that a sentence fragment group of the original text to be aligned is obtained.

The monolingual word segmentation model group comprises monolingual word segmentation models of languages such as Chinese, English, Japanese, Korean, Spanish, Hindi, Vietnam, French, Russian, German, Arabic, Italian, Portuguese, Turkish, Thai, and Malaysia, and the monolingual word segmentation models are respectively trained according to monolingual data of corresponding language types to obtain word segmentation models of corresponding language types. Calling a monolingual word segmentation model corresponding to the language type of the original text from the monolingual word segmentation model group, for example: and if the language type of the original text of the parallel sentence pairs to be aligned is Chinese, calling a Chinese monolingual participle model from the monolingual participle model group. Calling a monolingual word segmentation model, operating the monolingual word segmentation model, performing word segmentation processing on an original text of a parallel sentence pair to be aligned according to a word segmentation word list of the monolingual word segmentation model, segmenting words in the original text to obtain each sentence segment, wherein each sentence segment forms a sentence segment group of the original text to be aligned, and the steps are as follows: the input "I work in the flight in Beijing" is to the monolingual word segmentation model, and the output sentence segment group may be "I work in the flight in Beijing".

In one embodiment, as shown in fig. 3, the training method of the monolingual word segmentation model includes:

step S262, acquiring monolingual data corresponding to the language type of the monolingual participle model to be trained.

The monolingual data is a text only containing one language, monolingual data corresponding to the language type is captured through various websites, and the language type of the monolingual participle model to be trained can be any one of Chinese, English, Japanese, Korean, Spanish, Hindi, Vietnam, French, Russian, German, Arabic, Italian, Portuguese, Turkish, Thai, Malaysia and the like. For any common language, whether large or small, the monolingual language inventory is infinite, and the monolingual data volume can be obtained according to the accuracy.

And step S264, preprocessing the monolingual data to obtain a monolingual data sample.

Wherein the pretreatment comprises: cleaning illegal characters, removing bad corpora, full half-angle conversion, etc. The characters, expression symbols and the like of the control type of the illegal characters cleaned from the monolingual data are deleted, and the illegal characters can be found out and deleted by correspondingly searching according to a Unicode character table. The method for removing the inferior corpus of the monolingual data comprises the following steps: removing corpora which have a lot of disordered numbers and punctuations and obviously do not accord with daily use logic, and filtering by using the proportion of the numbers and the punctuations in sentences; the language material with different languages, i.e. the language material of the language B mixed in the language material of the language A, can be removed by using tools such as language identification. The method comprises the following steps of performing full half-angle conversion on single-language data, namely converting punctuation marks, numbers and the like, wherein the full half-angle conversion comprises the following steps: the Chinese monolingual data uses English punctuation marks, and the English punctuation marks in the Chinese monolingual data are converted into Chinese punctuation marks by full half-angle conversion. And (3) performing processing such as sentence splitting, illegal character cleaning, inferior corpus removal, full-half-angle conversion and the like on the monolingual data to obtain a monolingual data sample.

And step S268, carrying out monolingual participle model training based on the monolingual data sample through a Sentence piece algorithm to obtain a monolingual participle model corresponding to the language type.

Wherein, the SentencPiece algorithm is a word segmentation algorithm. And carrying out monolingual participle model training based on monolingual data through a Sentence piece algorithm, such as: performing a single-language word segmentation model by using a Chinese single-language data sample by adopting a SenntecPiece algorithm to obtain a Chinese single-language word segmentation model and a word segmentation word list of the single-language word segmentation model; and (3) carrying out the monolingual word segmentation model by using the English monolingual data sample by adopting a SentenCE piece algorithm, and then obtaining the English monolingual word segmentation model and the word segmentation word list of the monolingual word segmentation model.

In one embodiment, the step of monolingual participle model training based on monolingual data samples by the SenterePiece algorithm comprises: randomly initializing a monolingual data sample into a sufficiently large vocabulary; circularly executing the following steps until the participle word list reaches the specified size: (1) fixing a word list, and optimizing the word probability p by using an EM algorithm; (2) calculating the loss caused by the removal of each word in the word list; (3) the least lossy 20% Of the words are removed, while keeping all words from OOV (Out Of vocabularies, meaning words not in the Vocabulary, and keeping all words to avoid words in the text that are not in the Vocabulary). Wherein, the coverage parameter of the vocabulary can be adjusted according to the efficiency problem caused by the capacity of the vocabulary and the precision problem caused by the frequency of occurrence of an unknown token (token, which can be a word, a word segment or a sentence segment, and is determined according to the word segmentation mode), such as: the coverage parameter of the participle vocabulary of the languages which are difficult to traverse and constitute the letters, such as Chinese, Japanese and the like, is set to be 0.9995, the coverage of the other languages is set to be 1, and the like. The size of the participle word list can also be properly amplified according to translation languages (such as Chinese and English) mainly involved, and finally, the size parameter of the participle word list can be generated, such as: chinese and english are set to 48000, while the remaining languages are set for participle word list size 32000, and so on.

The problem that different word segmentation tools are used among various languages when sentence alignment is carried out is solved, namely j ieba (Chinese word segmentation tool named as jieba) is needed in Chinese, mecab (Japanese word segmentation tool named as mecab) is needed in Japanese, ko expansion of mecab is needed in Korean, and the like, the problem that the ko expansion of mecab can only be operated under python3.7 version is needed in Korean, the problem that the bilingual word segmentation models with few related tools in small languages can be trained through the Senecare algorithm, so that a multilingual single word segmentation model group trained based on the Senecare algorithm can simultaneously process bilingual word material alignment of all needed languages, the design difficulty and the code complexity of a system are greatly simplified, the code coupling degree and the maintenance difficulty of the monolingual word segmentation model are reduced, the speed of word segmentation processing can be improved, and the speed of bilingual corpus sentence alignment is further improved.

Step S280, a monolingual word segmentation model corresponding to the language type of the translation text is called from the monolingual word segmentation model group, word segmentation processing is carried out on the translation text in the parallel sentence pair to be aligned, and a sentence fragment group of the translation to be aligned is obtained.

The monolingual word segmentation model group comprises monolingual word segmentation models of languages such as Chinese, English, Japanese, Korean, Spanish, Hindi, Vietnam, French, Russian, German, Arabic, Italian, Portuguese, Turkish, Thai, Malaysia and the like, the monolingual word segmentation models are obtained by respectively training according to monolingual data of corresponding language types, and the construction mode is not repeated. Calling a monolingual participle model corresponding to the language type of the translation text from the monolingual participle model group, for example: and calling the English monolingual participle model from the monolingual participle model group if the language type of the translation text of the parallel sentence pair to be aligned is English. And calling a monolingual word segmentation model, operating the monolingual word segmentation model, performing word segmentation on the translated text of the parallel sentence pairs to be aligned according to a word segmentation word list of the monolingual word segmentation model, segmenting words in the translated text to obtain each sentence segment, wherein each sentence segment forms a sentence segment group of the translated text to be aligned.

Step S300, according to a preset format processing mode, carrying out format processing on the sentence fragment group of the original text to be aligned and the sentence fragment group of the translated text to be aligned to obtain a bilingual sentence pair group.

The bilingual sentence pair group is a sentence fragment group of the original text to be aligned and a sentence fragment group of the translated text to be aligned after format processing. After the sentence fragment group of the original text to be aligned and the sentence fragment group of the translated text to be aligned are subjected to format processing, the problem that the sentence fragment group is subjected to word segmentation processing by a monolingual word segmentation model and then is kept with a _: for languages that do not typically contain spaces in their own right in some languages (e.g., Chinese, Japanese, etc.), the beginning of a sentence may have the same word correspond to two different tokens, such as: after the word segmentation processing is performed on the 'me', the sentence fragment groups of the original text to be aligned and the sentence fragment groups of the translated text to be aligned are caused to have lower sentence alignment accuracy corresponding to the 'me' and the 'me'.

In one embodiment, the preset format processing manner may be: obtaining a sentence fragment group to be formatted; and detecting underlines in the sentence fragment group, and removing the detected underlines from the sentence fragment group.

The method comprises the steps of taking a sentence fragment group of an original text to be aligned and a sentence fragment group of a translated text to be aligned as sentence fragment groups to be subjected to format processing, respectively detecting underline characters in the sentence fragment group of the original text to be aligned and the sentence fragment group of the translated text to be aligned, and removing the detected underline characters from the sentence fragment groups. Underline symbol detection is carried out on the sentence fragment group of the original text to be aligned, the detected underline symbol is removed from the sentence fragment group of the original text to be aligned, underline symbol detection is carried out on the sentence fragment group of the translated text to be aligned, and the detected underline symbol is removed from the sentence fragment group of the translated text to be aligned. The found underline symbol can be replaced or deleted in the sentence fragment group by finding the underline symbol in the sentence fragment group. Underline symbol detection is carried out on the sentence fragment group of the original text to be aligned and the sentence fragment group of the translated text to be aligned, and the detected underline symbol is removed from the sentence fragment group of the original text to be aligned and the sentence fragment group of the translated text to be aligned, so that the bilingual sentence pair group is obtained. The precision of sentence alignment of the bilingual sentence pair group can be improved.

In one embodiment, the preset format processing manner may also be: obtaining a sentence fragment group to be formatted and a corresponding language type; determining whether the sentence fragment group belongs to a format processing object according to the language type of the sentence fragment group; when the sentence fragment group belongs to the format processing object, underline characters in the sentence fragment group are detected, and the detected underline characters are removed from the sentence fragment group.

The method comprises the steps that a sentence fragment group of an original text to be aligned and a sentence fragment group of a translated text to be aligned are respectively used as sentence fragment groups to be subjected to format processing, and whether the sentence fragment groups of the original text to be aligned belong to format processing objects or not is determined according to the language type of the sentence fragment groups of the original text to be aligned; when the sentence fragment group of the original text to be aligned belongs to the format processing object, underline symbols in the sentence fragment group of the original text to be aligned are detected, and the detected underline symbols are removed from the sentence fragment group of the original text to be aligned. Determining whether the sentence fragment group of the translation to be aligned belongs to a format processing object or not according to the language type of the sentence fragment group of the translation to be aligned; when the sentence fragment group of the translation to be aligned belongs to the format processing object, underline symbols in the sentence fragment group of the translation to be aligned are detected, and the detected underline symbols are removed from the sentence fragment group of the translation to be aligned.

Whether the language type belongs to the format processing object or not is determined according to the language type, whether the language type belongs to the format processing object or not can be determined according to whether each language type does not usually contain a space, and the language type does not usually contain a space and belongs to the format processing object, for example: chinese, japanese, korean, thai, etc., the language type itself usually contains spaces that do not belong to the format processing object, such as: english, french, etc. The found underline symbol can be replaced or deleted in the sentence fragment group by searching the underline symbol in the sentence fragment group belonging to the format processing object. Underline symbol detection is carried out through sentence fragment groups belonging to the format processing object, and the detected underline symbol is removed from the sentence fragment groups belonging to the format processing object, so that a bilingual sentence pair group is obtained. The precision of sentence alignment of the bilingual sentence pair group can be improved.

Step S320, based on the preset format processing mode, obtains the bilingual dictionary corresponding to the language type of the original text and the language type of the translated text.

The obtained bilingual type of the bilingual dictionary corresponds to the language type of the original text and the language type of the translated text, and sentence segments in the bilingual dictionary are processed in the same way by adopting a preset format processing mode. Each two languages in language type respectively have a bilingual dictionary corresponding to the language type of the original text and the language type of the translated text, such as: the language type of the original text and the language type of the translated text are Chinese and English, the obtained bilingual dictionary is a dictionary of inter-translation between Chinese and English, the language type of the original text and the language type of the translated text are Chinese and Korean, the obtained bilingual dictionary is a dictionary of inter-translation between Chinese and Korean, and the like.

In one embodiment, as shown in fig. 4, the bilingual dictionary is constructed in a manner including steps S322 to S332:

step S322, a sentence alignment parallel corpus sample corresponding to the language type of the bilingual dictionary to be trained is obtained from the sentence alignment parallel corpus, and the language type of the bilingual dictionary to be trained comprises the language type of the original language corpus and the language type of the translated language corpus.

The sentence alignment parallel corpus is a database storing sentence alignment parallel corpora of each bilingual language, and each language in the sentence alignment parallel corpus constitutes the sentence alignment parallel corpora of the bilingual language, for example: Chinese-English inter-translated sentence alignment parallel corpus, Chinese-Japanese inter-translated sentence alignment parallel corpus, Japanese-English inter-translated sentence alignment parallel corpus, Korean-English inter-translated sentence alignment parallel corpus, and the like. The bilingual dictionary to be trained is a bilingual dictionary to be trained, the language type of the bilingual dictionary to be trained comprises the language type of the original corpus and the language type of the translated corpus, and the language type of the bilingual dictionary to be trained can be any two of Chinese, English, Japanese, Korean, Spanish, Hindi, Vietnam, French, Russian, German, Arabic, Italian, Portuguese, Turkish, Thai, Malayne and the like. The sentence alignment parallel corpus samples are used for training a sentence alignment parallel corpus of the bilingual dictionary to be trained, the sentence alignment parallel corpus is two sentences which are translations of each other, and the number of the sentence alignment parallel corpus samples can be determined according to actual construction accuracy. Sentence-aligned parallel corpora are two sentence corpora that are translations of each other.

Obtaining a sentence alignment parallel corpus sample corresponding to the language type of the bilingual dictionary to be trained from the sentence alignment parallel corpus, for example: when a Japanese-English translation bilingual dictionary needs to be trained, sentence alignment parallel linguistic data of the Japanese translation is obtained from the sentence alignment parallel linguistic database and is used as a sentence alignment parallel linguistic data sample.

Step S324, preprocessing the sentence alignment parallel corpus samples to obtain sentence alignment parallel corpus pairs.

Wherein the pretreatment comprises: cleaning illegal characters, removing bad corpora, full half-angle conversion, etc. The characters, expression symbols and the like of which the illegal characters are control characters are cleaned from both the original corpus and the translated corpus of the sentence-aligned parallel corpus sample are deleted, and the characters, the expression symbols and the like can be correspondingly searched according to the Unicode character table, so that the illegal characters are searched and deleted. Removing inferior corpora from both the original corpus and the translated corpus of the sentence-aligned parallel corpus sample comprises: removing corpora which have a lot of disordered numbers and punctuations and obviously do not accord with daily use logic, and filtering by using the proportion of the numbers and the punctuations in sentences; the language material with different languages, i.e. the language material of the language B mixed in the language material of the language A, can be removed by using tools such as language identification. The full half-angle conversion is carried out on the original language material and the translated language material of the sentence alignment parallel language material sample, namely, the conversion is carried out on punctuation marks, numbers and the like, for example: the Chinese uses English punctuation marks, and the English punctuation marks in the Chinese are converted into Chinese punctuation marks by full half-angle conversion. And (4) carrying out treatments such as illegal character cleaning, inferior corpus removal, full half-angle conversion and the like on the original corpus and the translated corpus of the sentence alignment parallel corpus sample to obtain a sentence alignment parallel corpus pair.

Step S326, a monolingual participle model corresponding to the language type of the original corpus is called from the monolingual participle model group, and participle processing is performed on the original corpus in the sentence-aligned parallel corpus pair to obtain a sentence fragment group of the sample original corpus.

The monolingual word segmentation model group comprises monolingual word segmentation models of languages such as Chinese, English, Japanese, Korean, Spanish, Hindi, Vietnam, French, Russian, German, Arabic, Italian, Portuguese, Turkish, Thai, Malaysia and the like, the monolingual word segmentation models are obtained by respectively training according to monolingual data of corresponding language types, and the construction mode is not repeated. Calling a monolingual participle model corresponding to the language type of the original corpus from the monolingual participle model group, for example: and if the language type of the original text corpus of the sentence alignment parallel corpus is Chinese, calling a Chinese monolingual participle model from the monolingual participle model group. And calling a monolingual word segmentation model, operating the monolingual word segmentation model, performing word segmentation processing on the original text corpus of the sentence alignment parallel corpus according to a word segmentation word list of the monolingual word segmentation model, segmenting words in the original text corpus to obtain each sentence segment, wherein each sentence segment forms a sentence segment group of the sample original text.

Step 328, the monolingual participle model corresponding to the language type of the translation corpus is called from the monolingual participle model group, and the translation corpus in the sentence-aligned parallel corpus pair is participled to obtain a sentence fragment group of the sample translation.

The monolingual word segmentation model group comprises monolingual word segmentation models of languages such as Chinese, English, Japanese, Korean, Spanish, Hindi, Vietnam, French, Russian, German, Arabic, Italian, Portuguese, Turkish, Thai, Malaysia and the like, the monolingual word segmentation models are obtained by respectively training according to monolingual data of corresponding language types, and the construction mode is not repeated. Calling the monolingual participle model corresponding to the language type of the translation corpus from the monolingual participle model group, for example: and if the language type of the translation corpus in the sentence alignment parallel corpus pair is English, calling the English monolingual participle model from the monolingual participle model group. And calling a monolingual word segmentation model, operating the monolingual word segmentation model, performing word segmentation processing on the translation corpus in the sentence alignment parallel corpus pair according to a word segmentation word list of the monolingual word segmentation model, segmenting words in the translation corpus to obtain each sentence fragment, wherein each sentence fragment forms a sentence fragment group of the sample translation.

Step S330, according to the preset format processing mode, format processing is carried out on the sentence fragment group of the sample original text and the sentence fragment group of the sample translation text, and a bilingual sentence pair sample group is obtained.

The bilingual sentence pair sample set is a sentence fragment set of the original sample text and a sentence fragment set of the translated sample text after format processing. After the sentence fragment group of the sample original text and the sentence fragment group of the sample translated text are subjected to format processing, the sentence fragment group can be prevented from being subjected to the word segmentation processing of the monolingual word segmentation model and then being reserved with a _ "(the symbol is a character numbered 2581 in a Unicode character table) for marking the beginning of the text and an underline in front of the word fragment (the underline is obtained by replacing a space in the sentence fragment group when the monolingual word segmentation model performs the word segmentation processing), such as: for languages that do not typically contain spaces in their own right in some languages (e.g., Chinese, Japanese, etc.), the beginning of a sentence may have the same word correspond to two different tokens, such as: after word segmentation processing is carried out on the 'I' and the 'I', the problem that the probability of word pair translation conditions cannot be accurately counted in the follow-up process of constructing the bilingual dictionary, and the practicability is low is caused.

In one embodiment, the preset format processing mode may be to obtain a sentence fragment group to be format-processed; and detecting underlines in the sentence fragment group, and removing the detected underlines from the sentence fragment group.

The sentence fragment group of the sample original text and the sentence fragment group of the sample translation are used as sentence fragment groups to be subjected to format processing, underline characters in the sentence fragment group of the sample original text and the sentence fragment group of the sample translation are respectively detected, and the detected underline characters are removed from the sentence fragment groups. Underlining detection is performed on the sentence fragment group of the sample original text, the detected underlining is removed from the sentence fragment group of the sample original text, underlining detection is performed on the sentence fragment group of the sample translation, and the detected underlining is removed from the sentence fragment group of the sample translation. The found underline symbol can be replaced or deleted in the sentence fragment group by finding the underline symbol in the sentence fragment group. Underline symbol detection is carried out on the sentence fragment group of the sample original text and the sentence fragment group of the sample translation, and the detected underline symbol is removed from the sentence fragment group of the sample original text and the sentence fragment group of the sample translation, so that the bilingual sentence pair group is obtained. The problem that the probability of the word pair translation condition cannot be accurately counted subsequently when the bilingual dictionary is constructed, and the practicability is low can be solved, and the precision of sentence alignment based on the bilingual dictionary is further improved.

In one embodiment, the preset format processing mode may be to obtain a sentence fragment group to be format processed and a corresponding language type; determining whether the sentence fragment group belongs to a format processing object according to the language type of the sentence fragment group; when the sentence fragment group belongs to the format processing object, underline characters in the sentence fragment group are detected, and the detected underline characters are removed from the sentence fragment group.

The method comprises the steps that a sentence fragment group of a sample original text and a sentence fragment group of a sample translated text are respectively used as sentence fragment groups to be subjected to format processing, and whether the sentence fragment groups of the sample original text belong to format processing objects or not is determined according to the language type of the sentence fragment groups of the sample original text; when the sentence fragment group of the sample original text belongs to the format processing object, underlining characters in the sentence fragment group of the sample original text are detected, and the detected underlining characters are removed from the sentence fragment group of the sample original text. Determining whether the sentence fragment group of the sample translation belongs to a format processing object according to the language type of the sentence fragment group of the sample translation; when the sentence fragment group of the sample translation belongs to the format processing object, underline symbols in the sentence fragment group of the sample translation are detected, and the detected underline symbols are removed from the sentence fragment group of the sample translation.

Wherein, according to the language type, determine whether to belong to the format processing object, and may determine whether to belong to the format processing object according to whether each language type itself usually does not contain a space, and the language type itself usually does not contain a space and belongs to the format processing object, such as: chinese, japanese, korean, thai, etc., the language type itself usually contains spaces that do not belong to the format processing object, such as: english, french, etc. The found underline symbol can be replaced or deleted in the sentence fragment group by searching the underline symbol in the sentence fragment group belonging to the format processing object. Underline symbol detection is carried out through sentence fragment groups belonging to the format processing object, and the detected underline symbol is removed from the sentence fragment groups belonging to the format processing object, so that a bilingual sentence pair group is obtained. The problem that the probability of the word pair translation condition cannot be accurately counted subsequently when the bilingual dictionary is constructed, and the practicability is low can be solved, and the precision of sentence alignment by adopting the bilingual dictionary is further improved.

Step S332, aligning the bilingual sentence pair sample group through the bilingual word pair extraction algorithm to obtain a bilingual dictionary.

The bilingual word pair extraction algorithm may be a FastAlign algorithm, a hunt dict algorithm, or the like. The language type of the obtained bilingual dictionary corresponds to the sentence alignment parallel corpus sample, such as: when the sentence alignment parallel corpus sample is the sentence alignment parallel corpus of the Chinese-English translation, a bilingual dictionary of the Chinese-English translation is obtained, and when the sentence alignment parallel corpus sample is the sentence alignment parallel corpus of the Chinese-Japanese translation, a bilingual dictionary of the Chinese-Japanese translation is obtained.

Aligning the bilingual sentence to the sample group by using a FastAlign algorithm, and outputting a conditional probability matrix, namely a bilingual dictionary, such as: the specific content is a table that the probability of translating the word a in the language A into the word B in the language B, such as the probability of translating English "I" into Chinese "I" is 0.95. And outputting a conditional probability matrix, and removing word pairs with low occurrence probability (a relative threshold is used by FastAlign) according to an absolute threshold which can be set to be-9.0, and cleaning the word pairs which do not meet the language requirement to further accurately obtain the bilingual dictionary.

Step S340, calling a sentence alignment tool, and performing sentence alignment processing on the bilingual sentence pair group according to the bilingual dictionary to obtain sentence alignment parallel corpus.

Among other things, the sentence alignment tool may be a HunAlign tool. The sentence alignment tool models and calculates a score for the bilingual sentence pair group line by utilizing the characteristics of sentence length, the number of words translated between two sentences and the like based on the corresponding bilingual dictionary in the multilingual bilingual dictionary subjected to format processing, searches a specific sentence alignment relation, and obtains the alignment result with the maximum score to output. The output alignment result comprises aligned sentence pairs and corresponding scores, and the aligned sentence pairs are sentence alignment parallel linguistic data.

The bilingual corpus sentence alignment method comprises the steps of obtaining parallel texts to be aligned, and the language types of an original text and a translated text in the parallel texts to be aligned; preprocessing the parallel texts to be aligned to obtain parallel sentence pairs to be aligned; calling a monolingual word segmentation model corresponding to the language type of the original text from a monolingual word segmentation model group trained through a Sentence piece algorithm, and performing word segmentation processing on the original text in the parallel sentence pair to be aligned to obtain a sentence fragment group of the original text to be aligned; calling a monolingual word segmentation model corresponding to the language type of the translation text from the monolingual word segmentation model group, and performing word segmentation processing on the translation text in the parallel sentence pair to be aligned to obtain a sentence fragment group of the translation to be aligned; performing format processing on the sentence fragment group of the original text to be aligned and the sentence fragment group of the translated text to be aligned according to a preset format processing mode to obtain a bilingual sentence pair group, so that the sentence alignment precision of a sentence alignment tool can be improved, and a bilingual dictionary corresponding to the language type of the original text and the language type of the translated text is obtained based on the preset format processing mode; and calling a sentence alignment tool, and performing sentence alignment processing on the bilingual sentence pair group according to the bilingual dictionary to obtain sentence alignment parallel linguistic data. The method has the advantages that through the monolingual word segmentation models of various languages trained through the Sentence piece algorithm, a set of processing flow can simultaneously process the bilingual corpus sentence alignment of all needed languages, the design difficulty and the code complexity are greatly simplified, the code coupling degree and the maintenance difficulty are reduced, the maintenance cost is reduced, the sentence fragments are subjected to format processing, the bilingual dictionaries processed in the same format are used for sentence alignment, the word and sentence alignment precision can be improved, and the obtained sentence alignment parallel corpus result is more accurate.

In one embodiment, after the step of calling a sentence alignment tool to perform sentence alignment processing on a bilingual sentence pair group according to a bilingual dictionary to obtain sentence-aligned parallel corpora, the method further includes: and filtering the sentence alignment parallel corpus based on a preset filtering condition to obtain the filtered sentence alignment parallel corpus.

Wherein the preset filtering conditions include: (1) and analyzing whether empty sentences exist in the sentence alignment parallel corpus or not, and filtering the sentences aligned with the empty sentences in the parallel corpus, namely the sentences without the corresponding sentences. (2) And filtering sentences according to a preset value to align sentences with scores smaller than the preset value in the parallel linguistic data. The sentences with lower scores are removed according to the HunAlign score, and the HunAlign score reflects the alignment quality of the sentences, which is also a place for reflecting the effect of improving the precision of the bilingual dictionary. There are two processing strategies here: if a relatively large number of filters follow (e.g., other filters based on the machine translation engine, etc.), then a relatively relaxed threshold, such as 0.1 or 0.2, may be selected here, which removes only the pairs of sentences that are significantly misaligned in anticipation of providing as many pairs of sentences as possible for post-filtering. Because HunAlign defaults to an alignment score of 0.3 for two sentences with no alignment relationship and a sentence token number of 1, sentence pairs with a score of 0.3 can be additionally removed to reduce possible errors. If a large number of filtering programs do not exist subsequently, a conservative filtering threshold value, such as 0.8, can be selected, most of the retained sentence pairs have high accuracy, but the recall rate of the sentence pairs is correspondingly reduced. (3) According to the language type of the original text and the language type of the translated text, filtering out sentences which are inconsistent with the language type in the sentence alignment parallel corpus, such as: the language type of an original text in the parallel texts to be aligned is Chinese, the language type of a translated text is English, sentences of other language types are mixed in the text, sentences which are possibly not translated of Chinese and English and are aligned are obtained in the sentence alignment parallel linguistic data, the sentences in the sentence alignment parallel linguistic data are identified, and sentences which are not translated of Chinese and English and are aligned in the sentence alignment parallel linguistic data are filtered. (4) According to the characteristics of numbers and the like, the filtering sentences are aligned with sentences which do not accord with the characteristics of numbers and the like in the parallel linguistic data, sentences with different numbers in a pair of sentences are removed, and the parallel sentences are further cleaned.

In one embodiment, further comprising: and adding the sentence alignment parallel corpus into the sentence alignment parallel corpus.

The sentence alignment parallel corpus is obtained by aligning sentences according to the parallel texts to be aligned, the language types of the original texts and the language types of the translated texts in the parallel texts to be aligned, and the sentence alignment parallel corpus corresponding to the language types of the original texts and the language types of the translated texts in the parallel texts to be aligned. And further training the bilingual dictionary to obtain a more accurate bilingual dictionary.

In an embodiment, please refer to fig. 5, a bilingual corpus sentence alignment method is specifically applied to obtain sentence alignment parallel corpora of multilingual different language pairs and fill the sentence alignment parallel corpus, where the sentence alignment parallel corpora in the sentence alignment parallel corpus can be used as a sample required by a training, verifying or testing machine translation model to obtain a machine translation model, and a product for implementing multilingual mutual translation, such as translation software, simultaneous interpretation software, education software, and the like, including but not limited to a multilingual machine translation model such as chinese-english-japanese-korean-russian-western-tai-seal land.

Referring to fig. 6, before bilingual corpus sentence alignment, a monolingual word segmentation model group including monolingual word segmentation models of each language is obtained by training in the following monolingual word segmentation model training manner: acquiring monolingual data corresponding to the language type of the monolingual word segmentation model to be trained, preprocessing the monolingual data to obtain a monolingual data sample, and performing monolingual word segmentation model training based on the monolingual data sample through a Sentence piece algorithm to obtain the monolingual word segmentation model.

The bilingual dictionary of the bilingual variety is obtained by training through the following training mode of the bilingual dictionary: acquiring a sentence alignment parallel corpus sample corresponding to the language type of the bilingual dictionary to be trained from the sentence alignment parallel corpus, wherein the language type of the bilingual dictionary to be trained comprises the language type of the original language corpus and the language type of the translated language corpus; preprocessing the sentence alignment parallel corpus sample to obtain a sentence alignment parallel corpus pair; calling a monolingual participle model corresponding to the language type of the original corpus from the monolingual participle model group, and performing participle processing on the original corpus in the sentence alignment parallel corpus pair to obtain a sentence fragment group of the sample original text; calling a monolingual participle model corresponding to the language type of the translation corpus from the monolingual participle model group, and performing participle processing on the translation corpus in the sentence alignment parallel corpus pair to obtain a sentence fragment group of the sample translation; performing format processing on the sentence fragment group of the sample original text and the sentence fragment group of the sample translated text according to a preset format processing mode to obtain a bilingual sentence pair sample group; and aligning the bilingual sentence pair sample group through a FastAlign algorithm to obtain a bilingual dictionary.

When bilingual language material sentence alignment is carried out, acquiring a parallel text to be aligned, a language type of an original text and a language type of a translated text in the parallel text to be aligned; preprocessing the parallel texts to be aligned to obtain parallel sentence pairs to be aligned; calling a monolingual word segmentation model corresponding to the language type of the original text from the monolingual word segmentation model group, and performing word segmentation processing on the original text in the parallel sentence pair to be aligned to obtain a sentence fragment group of the original text to be aligned; calling a monolingual word segmentation model corresponding to the language type of the translation text from the monolingual word segmentation model group, and performing word segmentation processing on the translation text in the parallel sentence pair to be aligned to obtain a sentence fragment group of the translation to be aligned; performing format processing on the sentence fragment group of the original text to be aligned and the sentence fragment group of the translated text to be aligned according to a preset format processing mode to obtain a bilingual sentence pair group; acquiring a bilingual dictionary corresponding to the language type of the original text and the language type of the translated text based on a preset format processing mode; and calling a sentence alignment tool, and performing sentence alignment processing on the bilingual sentence pair group according to the bilingual dictionary to obtain sentence alignment parallel linguistic data.

The method has the advantages that the method uses the monolingual participle models of various languages of the Senteceperee algorithm, performs sentence alignment on the extraction algorithm through bilingual words based on the sentence fragment granularity, so that the bilingual word alignment on the extraction algorithm is not limited by different participle tools among various languages, and can be universal for small languages with few related tools, so that one set of processing flow can simultaneously process sentence alignment parallel corpora of all required languages, the design difficulty of the system and the complexity of codes are greatly simplified, and the coupling degree and the maintenance difficulty of the codes are reduced. When the bilingual dictionary training is carried out, the sentence fragment group is subjected to format processing, so that the method is more suitable for the extraction process of the bilingual dictionary, the statistics of word pair co-occurrence probability and the calculation of conditional probability are more accurate, and the precision of the bilingual dictionary is improved. When bilingual corpus sentence alignment is carried out, format processing is carried out on the sentence fragment group, the alignment precision of the sentence alignment tool in sentence alignment is further improved, and the accuracy of sentence alignment parallel corpora is improved.

Fig. 2 is a flowchart illustrating a bilingual corpus sentence alignment method according to an embodiment. It should be understood that, although the steps in the flowchart of fig. 2 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 2 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

Referring to fig. 7, a bilingual corpus sentence alignment apparatus includes: a parallel text acquisition module 310, a preprocessing module 320, a first participle processing module 330, a second participle processing module 340, a format processing module 350, a bilingual dictionary acquisition module 360, and a sentence alignment processing module 370.

The parallel text obtaining module 310 is configured to obtain a parallel text to be aligned, and a language type of an original text and a language type of a translated text in the parallel text to be aligned.

The preprocessing module 320 is configured to preprocess the parallel texts to be aligned to obtain a pair of parallel sentences to be aligned.

The first word segmentation processing module 330 is configured to call a monolingual word segmentation model corresponding to the language type of the original text from the monolingual word segmentation model group, perform word segmentation processing on the original text in the parallel sentence pair to be aligned, and obtain a sentence fragment group of the original text to be aligned.

And the second word segmentation processing module 340 is configured to call a monolingual word segmentation model corresponding to the language type of the translation text from the monolingual word segmentation model group, perform word segmentation processing on the translation text in the parallel sentence pair to be aligned, and obtain a sentence fragment group of the translation to be aligned.

And a format processing module 350, configured to perform format processing on the sentence fragment group of the original text to be aligned and the sentence fragment group of the translated text to be aligned according to a preset format processing manner, so as to obtain a bilingual sentence pair group.

The bilingual dictionary obtaining module 360 is configured to obtain a bilingual dictionary corresponding to the language type of the original text and the language type of the translated text based on a preset format processing manner.

And a sentence alignment processing module 370, configured to invoke a sentence alignment tool, perform sentence alignment processing on the bilingual sentence pair group according to the bilingual dictionary, and obtain sentence alignment parallel corpus.

acquiring monolingual data corresponding to the language type of the monolingual participle model to be trained; preprocessing the monolingual data to obtain a monolingual data sample; and carrying out monolingual word segmentation model training based on the monolingual data sample through a Sentence piece algorithm to obtain a monolingual word segmentation model.

Referring to fig. 8, in an embodiment, the bilingual corpus sentence alignment apparatus further includes a bilingual dictionary training module 380, configured to obtain a sentence alignment parallel corpus sample corresponding to a language type of a bilingual dictionary to be trained from the sentence alignment parallel corpus, where the language type of the bilingual dictionary to be trained includes a language type of an original corpus and a language type of a translated corpus; preprocessing the sentence alignment parallel corpus sample to obtain a sentence alignment parallel corpus pair; calling a monolingual participle model corresponding to the language type of the original corpus from the monolingual participle model group, and performing participle processing on the original corpus in the sentence alignment parallel corpus pair to obtain a sentence fragment group of the sample original text; calling a monolingual participle model corresponding to the language type of the translation corpus from the monolingual participle model group, and performing participle processing on the translation corpus in the sentence alignment parallel corpus pair to obtain a sentence fragment group of the sample translation; carrying out format processing on the sentence fragment group of the sample original text and the sentence fragment group of the sample translated text to obtain a bilingual sentence pair sample group; and aligning the bilingual sentence pair sample group through a bilingual word pair extraction algorithm to obtain a bilingual dictionary.

In one embodiment, the format processing module 350 is further configured to obtain a sentence fragment set to be formatted; and detecting underlines in the sentence fragment group, and removing the detected underlines from the sentence fragment group.

In one embodiment, the format processing module 350 is further configured to obtain a sentence fragment group to be formatted and a corresponding language type; determining whether the sentence fragment group belongs to a format processing object according to the language type of the sentence fragment group; when the sentence fragment group belongs to the format processing object, underline characters in the sentence fragment group are detected, and the detected underline characters are removed from the sentence fragment group.

In an embodiment, the bilingual corpus sentence alignment apparatus further includes a corpus filtering module 390, configured to filter the sentence alignment parallel corpus based on a preset filtering condition, so as to obtain a filtered sentence alignment parallel corpus.

In one embodiment, the corpus filtering module 390 is further configured to analyze whether there are empty sentences in the sentence alignment parallel corpus, and filter sentences aligned with empty sentences in the parallel corpus; filtering sentences according to a preset value to align sentences with scores smaller than the preset value in the parallel linguistic data; filtering sentences which are inconsistent with the language types in the sentence alignment parallel linguistic data according to the language types of the original text and the translated text; according to the characteristics of numbers and the like, the filtering sentences are aligned with sentences which do not accord with the characteristics of numbers and the like in the parallel linguistic data.

In one embodiment, the bilingual corpus sentence alignment apparatus further includes an adding module 400 for adding sentence-aligned parallel corpora to the sentence-aligned parallel corpus.

FIG. 9 is a diagram illustrating an internal structure of a computer device in one embodiment. The computer device may specifically be the terminal 110 (or the server 120) in fig. 1. As shown in fig. 9, the computer apparatus includes a processor, a memory, a network interface, an input device, and a display screen connected through a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program that, when executed by the processor, causes the processor to implement a bilingual corpus sentence alignment method. The internal memory may also have a computer program stored therein, which when executed by the processor, causes the processor to perform a bilingual corpus alignment method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, the bilingual corpus sentence alignment apparatus provided in the present application may be implemented in the form of a computer program, which is executable on a computer device as shown in fig. 9. The memory of the computer device may store various program modules constituting the bilingual corpus sentence alignment apparatus, such as the parallel text acquisition module 310, the preprocessing module 320, the first segmentation processing module 330, the second segmentation processing module 340, the format processing module 350, the bilingual dictionary acquisition module 360, and the sentence alignment processing module 370 shown in fig. 7. The program modules constitute computer programs that cause the processor to perform the steps of the bilingual corpus sentence alignment method of the present application described in the present specification in accordance with various embodiments.

For example, the computer device shown in fig. 9 may perform step S220 through the parallel text acquisition module 310 in the bilingual corpus sentence alignment apparatus shown in fig. 7. The computer device may perform step S240 through the preprocessing module 320. The computer device may perform step S260 through the first participle processing module 330. The computer device may perform step S280 through the second segmentation processing module 340. The computer device may perform step S300 through the format processing module 350. The computer device may perform step S320 through the bilingual dictionary acquisition module 360. The computer device may perform step S340 by the sentence alignment processing module 370.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the above-described bilingual corpus sentence alignment method. Here, the steps of the bilingual corpus sentence alignment method may be steps in the bilingual corpus sentence alignment methods of the above embodiments.

In one embodiment, a computer readable storage medium is provided, storing a computer program that, when executed by a processor, causes the processor to perform the steps of the above bilingual corpus sentence alignment method. Here, the steps of the bilingual corpus sentence alignment method may be steps in the bilingual corpus sentence alignment methods of the above embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A bilingual corpus sentence alignment method comprises the following steps:

according to a preset format processing mode, carrying out format processing on the sentence fragment group of the original text to be aligned and the sentence fragment group of the translated text to be aligned to obtain a bilingual sentence pair group;

preprocessing the monolingual data to obtain a monolingual data sample;

2. The method of claim 1, wherein the bilingual dictionary training method comprises:

3. The method according to claim 1 or 2, wherein the predetermined format processing manner comprises:

obtaining a sentence fragment group to be formatted;

4. The method according to claim 1 or 2, wherein the predetermined format processing manner comprises:

5. The method according to claim 1, wherein said invoking a sentence alignment tool for performing sentence alignment processing on said set of bilingual sentence pairs according to said bilingual dictionary to obtain sentence-aligned parallel corpus further comprises:

6. The method of claim 5, wherein the preset filtering condition comprises at least one of the following conditions:

7. The method of claim 2, further comprising:

8. A bilingual corpus sentence alignment apparatus, comprising:

preprocessing the monolingual data to obtain a monolingual data sample;

9. A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 7.

10. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method according to any one of claims 1 to 7.