CN110489624B

CN110489624B - Method for extracting Hanyue pseudo parallel sentence pair based on sentence characteristic vector

Info

Publication number: CN110489624B
Application number: CN201910628354.3A
Authority: CN
Inventors: 余正涛; 黄继豪; 线岩团; 郭军军; 翟家欣; 文永华; 高盛祥
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2019-07-12
Filing date: 2019-07-12
Publication date: 2022-07-19
Anticipated expiration: 2039-07-12
Also published as: CN110489624A

Abstract

The invention relates to a method for extracting pseudo parallel sentence pairs of crossing over from Chinese based on sentence characteristic vectors, belonging to the technical field of natural language processing. Firstly, collecting and preprocessing parallel and non-parallel training corpora and testing corpora of the Hanyue sentence pair and comparable corpora for extracting pseudo parallel sentence pairs; marking parts of speech with large difference in the syntax of Hanyue; then, the external characteristics of sentences and the difference characteristics of the Hanyue syntax are blended into the embedding layer; the output of the embedding layer obtains a sentence characteristic vector through a neural network, and then a pseudo parallel corpus extraction model is trained through the calculation of the classification layer; and extracting the Chinese-character-to-character parallel corpus. The invention can effectively extract the more and more pseudo parallel sentence pairs of the Chinese from the more and more comparable corpus of the Chinese, and has high accuracy.

Description

Method for extracting Hanyue pseudo parallel sentence pair based on sentence characteristic vector

Technical Field

The invention relates to a method for extracting pseudo parallel sentence pairs of crossing over from Chinese based on sentence characteristic vectors, belonging to the technical field of natural language processing.

Background

Data-driven machine translation (statistical machine translation, neural machine translation) is a more demanding requirement for the amount of data used to train the model. In particular, neural machine translation has achieved a good result in machine translation with large-scale corpus, such as English-to-French, Chinese-to-English, etc. But, for neural machine translation with scarce resources and small corpus scale, such as Chinese-Yuan neural machine translation, the translation performance is not very ideal. Therefore, how to extract the Han-Yuan pseudo parallel sentence pair has very important application prospect

At present, parallel sentence pairs are extracted from a monolingual corpus based on word embedding on the basis of extracting the parallel sentence pairs by utilizing a neural network structure, so that the translation performance of a neural machine is improved, the parallel sentence pairs relevant to the outside field and the inside field are screened based on sentence vectors, and the translation performance of the machine in the field is improved. The above methods effectively extract pseudo parallel sentence pairs and improve the performance of machine translation, but most of them compare two sentences from the level of words to see whether the two sentences are parallel or not, and the method is not easy to capture some characteristics of the sentences.

A large amount of Chinese-Vietnamese comparable linguistic data, such as Wikipedia data of Chinese and Vietnamese, can be crawled on the network. And in these comparable corpora, there are pairs of pseudo-parallel sentences in hanse. Therefore, under the condition of giving comparable corpuses, how to obtain the pseudo parallel sentence pairs from the comparable corpuses becomes one of the difficulties and key technologies of the task. Therefore, the invention aims to solve the problem of how to extract the Chinese-character-crossing pseudo parallel sentence pairs from the Chinese-character-crossing comparable corpus. The method for extracting the pseudo-parallel Chinese-cross sentence pair based on the sentence characteristic vector is provided.

Disclosure of Invention

The invention provides a method for extracting a Chinese-Yue pseudo parallel sentence pair based on a sentence characteristic vector, which is used for solving the problem of low accuracy in extracting the Chinese-Yue pseudo parallel sentence pair from a Chinese-Yue comparable corpus.

The technical scheme of the invention is as follows: the method for extracting the Hanyue pseudo parallel sentence pair based on the sentence characteristic vector comprises the following specific steps:

step1, corpus collection and pretreatment: collecting and preprocessing parallel and non-parallel training corpora and testing corpora of the Hanyue sentence pair and comparable corpora for extracting pseudo-parallel sentence pairs;

as a preferred embodiment of the present invention, the Step1 specifically comprises the following steps:

step1.1, crawling Chinese-overtaking parallel sentence pairs and non-parallel Chinese-overtaking sentence pairs on a certain scale from the Internet to serve as training data of a Chinese-overtaking pseudo parallel sentence pair extraction model, and enabling a classification label whether the sentence pairs are parallel or not to exist behind each sentence pair. Extracting a small part from the training data to be used as a test set; then crawling comparable linguistic data;

step1.2, manually screening the crawled corpus, then carrying out position labeling on the crawled corpus, and marking sentence labels; and screening comparable linguistic data to achieve the effects of reducing the calculation times of the model and reducing the time complexity.

In Step1.2, the specific process of screening comparable corpora is as follows:

the pseudo-parallel corpus extraction model for the Chinese language is used for converting the extraction problem of the pseudo-parallel sentence pairs into a two-classification problem, and the Chinese language is larger than the corpus in scale, so that pre-trained Chinese words are embedded and projected into a Vietnam word embedding space, so that the Chinese language and the Vietnam language can be represented in the same space;

equation 1 is a sentence-embedded representation, where | S | is the length of the sentence,

is the word embedding of the ith word of the sentence S in the same Chinese-Yuan language space;

S(x，y)＝Φ(x^emb，y^emb) (2)

equation 2In, phi (x)^emb，y^emb) Cosine similarity of the sentence x and the sentence y in the same language space; expressing each sentence by using an average of word embedding of words constituting the sentence, namely sentence embedding S, calculating the similarity of each pair of Chinese-Vietnamese candidate parallel sentence pairs by using the sentence embedding to obtain a score S (x, y), and reserving 10 closest Vietnamese sentences for each Chinese sentence;

because the length ratio of the parallel Chinese-cross sentence pairs is within a certain range, if the length of the Chinese-cross sentence pairs is not within the range, the parallel probability of the Chinese-cross sentence pairs is low, the length ratio range of the Chinese-cross parallel sentence pairs is counted according to the Chinese-cross parallel corpus, and the sentence pairs beyond the range are removed, so that the comparable corpus is screened out.

And screening comparable corpora to achieve the effects of reducing the calculation times of the model and reducing the time complexity. The model converts the extraction problem of pseudo parallel sentence pair into a binary classification problem, if the Chinese-Yuan comparable corpus is 10⁶Chinese sentences and 10⁶The Vietnamese sentence composition needs to be carried out by 10⁶×10⁶And (4) secondary classification calculation and complex sentence feature vector calculation, so that a Chinese-pseudo parallel sentence pair cannot be extracted from the Chinese-pseudo comparable corpus quickly. Through screening, the Chinese-Yue sentence pairs which are more likely to be in a mutual translation relation can be screened from the Chinese-Yue sentence than the corpus, so that the calculation times of the model are greatly reduced, and the time complexity is reduced.

Step2, selecting the Chinese-Yue syntax difference characteristics: according to the characteristic that the modifying words in the syntactic difference of the more and more Chinese are arranged at the back, the part of speech with large difference in the syntactic difference of the more and more Chinese is marked;

the Chinese sentences and Vietnamese sentences with simple syntactic structures have basically consistent word sequences, and the most basic arrangement sequence of the syntactic components is active guest SVO or active supplementary SVP; the biggest difference between Chinese and Vietnamese is that the arrangement sequence of modifiers and the central language of the two languages is different, wherein the modifiers comprise fixed languages and sigmoid languages; relative to the Chinese language, Vietnamese has the characteristic of postpositional modifiers;

according to the biggest difference between Chinese and Vietnamese, the arrangement sequence of modifiers and a central language of the two languages is different, wherein the modifiers comprise fixed languages and sigmoid languages; the definite languages and the idioms of the Chinese language and the Vietnamese languages comprise verbs, adverbs, adjectives and nouns in the Chinese language and the Vietnamese languages, so that the parts of speech with large difference in the syntax of the Chinese language and the Vietnamese languages can be labeled.

The Chinese sentences and Vietnamese sentences with simple syntactic structures have basically consistent word sequences, and the most basic arrangement sequence of syntactic components is active guest (SVO) or active complement (SVP). The biggest difference between Chinese and Vietnamese is that the arrangement order of modifiers (fixed language and idiosyncratic language) and the central language of the two languages is different. Therefore, the words marked with the parts of speech are more recognizable in sentences. The marks are used as the Chinese-Yue syntax difference characteristics and are merged into the embedding layer, so that the accuracy of extracting the pseudo parallel sentence pairs can be improved.

The example of the chinese vietnamese language as in table 1 shows the difference characteristic of the chinese vietnamese syntax, mainly the difference of the word order. Therefore, Step2 labels the parts of speech of verb, adverb, adjective and noun in Chinese and Vietnamese.

TABLE 1 Hanyue syntax Difference characteristics

Step3, constructing an embedding layer of the pseudo parallel corpus extraction model of the Hanyue, and fusing external features of sentences and the difference features of the Hanyue syntax in the embedding layer;

in the embedding layer, since the neural network does not capture the position information in a sequence when calculating a sequence, that is, two sequences composed of the same elements, although they are different in the arrangement of the elements, the result obtained through the neural network is the same;

as a preferable embodiment of the present invention, the Step3 comprises the following specific steps:

sted3.1, to solve the problem of insensitivity to the position of a word and the model can distinguish two sentences; in the Embedding layer, a Position Embedding layer Position Embedding and a sentence Embedding layer Segment Embedding are added;

step3.2, vectorizing sentence characteristics in the embedding layer, and then fusing the sentence external characteristics and the Hanyue syntax difference characteristics into the embedding layer in a vector addition mode for outputting; the external characteristics of the sentence comprise traditional word embedding, sentence dividing characteristics and position information characteristics, and the Chinese-Vietnamese syntax difference characteristics comprise the part-of-speech characteristics of verbs, adverbs, adjectives and nouns in Chinese and Vietnamese. Four embedding layers of part-of-speech information are fused, and category characteristics can be processed more effectively.

In formula 3, E is the word embedding of each word output through the embedding layer, which is the traditional word embedding E_tokenClause characteristics E_SLocation information characteristic E_PPart-of-speech feature E of Hanyue syntactic difference feature part_POSSum of these four vectors

E＝E_token+E_S+E_P+E_POS (3)

Step4, training a Hanyue pseudo parallel corpus extraction model: the output of the Step3 embedding layer obtains a sentence characteristic vector through a neural network, and then a pseudo parallel corpus extraction model is trained through the calculation of a classification layer; the model can judge whether the sentence pairs are pseudo parallel sentence pairs or not, and finally, the pseudo parallel linguistic data of the more Chinese words can be extracted from the comparable linguistic data of the more Chinese words;

the method adopts a neural network structure based on a self-attention mechanism, and the structure can quickly extract important features of sparse data. The sentence characteristics can be extracted, and the sentence characterization can be applied to more downstream tasks and achieve better effect. It can be derived which information in the target language sentence is more relevant to the elements in the target language sentence and the source language sentence, giving higher weight to the more effective information.

Step5, extracting the Chinese-Yue pseudo-parallel sentence pair from the comparable corpus: and (3) extracting the Chinese-crossing pseudo parallel sentence pairs from the Chinese-crossing comparable corpus by using the trained Chinese-crossing pseudo parallel corpus extraction model.

The beneficial effects of the invention are:

1. firstly, judging whether two Chinese-cross sentences are parallel or not, mainly comparing the characteristics of the sentences in a Chinese-cross bilingual space, wherein the characteristic vectors of the sentences can often contain some characteristics of the sentences, and the accuracy rate of judging whether the Chinese-cross sentence pairs are parallel or not is improved;

2. the method splices the Chinese-to-Yue sentence pairs as an integral input, integrates the external characteristics of the Chinese-to-Yue sentence pairs and the Chinese-to-Yue syntactic difference characteristics in a model, obtains sentence characteristic vectors through neural network calculation, and finally inputs the sentence characteristic vectors into a classification layer to obtain a final result.

Drawings

FIG. 1 is a detailed flow chart of the present invention;

FIG. 2 is a model diagram of a model embedding layer proposed by the present invention;

fig. 3 is a block diagram of a sentence feature vector-based extraction hanyue pseudo parallel sentence pair model provided by the present invention.

Detailed Description

Example 1: as shown in fig. 1-3, the method for extracting the pseudo-parallel sentence pair based on the sentence feature vector,

step1, crawling 12 ten thousand Han-Yuan parallel sentence pairs and 12 ten thousand non-parallel Han-Yuan sentence pairs from the Internet as the training data of the Han-Yuan pseudo parallel sentence pair extraction model, wherein each sentence pair is followed by a classification label of whether the sentences are parallel or not. Extracting 5000 Hanyue sentences from the training data as a test set, wherein 2500 sentence pairs are Hanyue parallel sentence pairs, and 2500 sentence pairs are non-parallel sentence pairs, as shown in Table 2; then crawling comparable linguistic data;

TABLE 2 Experimental data

Manually screening the crawled linguistic data, and then carrying out position labeling and sentence label marking on the crawled linguistic data; and screening comparable linguistic data to achieve the effects of reducing the calculation times of the model and reducing the time complexity.

The specific process for screening the comparable corpora is as follows:

S(x，y)＝Φ(x^emb，y^emb) (2)

in equation 2, phi (x)^emb，y^emb) Cosine similarity of the sentence x and the sentence y in the same language space; expressing each sentence by using an average of word embedding of words constituting the sentence, namely sentence embedding S, calculating the similarity of each pair of Chinese-Vietnamese candidate parallel sentence pairs by using the sentence embedding to obtain a score S (x, y), and reserving 10 closest Vietnamese sentences for each Chinese sentence;

the Chinese sentences and Vietnamese sentences with simple syntactic structures have basically consistent word sequences, and the most basic arrangement sequence of the syntactic components is active guest SVO or active supplementary SVP; the biggest difference between Chinese and Vietnamese is that the arrangement sequence of modifiers and a central language of the two languages is different, wherein the modifiers comprise fixed languages and idioms; compared with Chinese, Vietnamese has the characteristic of postposition of modifier;

according to the biggest difference between Chinese and Vietnamese, the arrangement sequence of modifiers and a central language of the two languages is different, wherein the modifiers comprise fixed languages and sigmoid languages; the definite words and the idioms of the Chinese and the Vietnamese comprise verbs, adverbs, adjectives and nouns in the Chinese and the Vietnamese, and the parts of speech of verbs, adverbs, adjectives and nouns in the Chinese and the Vietnamese can be labeled by labeling parts of speech with large difference in the syntax of the Chinese and the Vietnamese.

in the embedding layer, position information in a sequence cannot be captured when a neural network calculates the sequence, namely, two sequences composed of the same elements are different in arrangement of the elements, but the result obtained by the neural network is the same;

step3.1, in order to solve the problem of insensitivity to the position of a word, and a model can distinguish two sentences; in the Embedding layer, a Position Embedding layer Position Embedding and a sentence Embedding layer Segment Embedding are added;

step3.2, vectorizing sentence characteristics in the embedding layer, and merging the external characteristics of the sentences and the Hanyue syntax difference characteristics into the embedding layer in a vector addition mode for outputting; the external characteristics of the sentence comprise traditional word embedding, sentence dividing characteristics and position information characteristics, and the Chinese-Vietnamese syntax difference characteristics comprise the part-of-speech characteristics of verbs, adverbs, adjectives and nouns in Chinese and Vietnamese.

E＝E_token+E_s+E_P+E_POS (3)

Step4, training a Hanyue pseudo parallel corpus extraction model: the output of the Step3 embedding layer obtains a sentence characteristic vector through a neural network, and then a pseudo parallel corpus extraction model is trained through the calculation of a classification layer;

the method adopts a neural network structure based on a self-attention mechanism, and the structure can quickly extract important features of sparse data.

When a sentence feature vector is obtained through the neural network in Step4, the neural network used is a self-attention mechanism neural network, which can effectively extract the features of the sentence itself, and the generation of the feature vector by using the self-attention mechanism is based on the sentence itself, and the specific process is as follows:

A. the sentence elements formed by splicing the Chinese and the Yue are composed of a series of key, value key value pairs and a series of query;

calculating the similarity between the query and each key to obtain a weight coefficient, wherein common similarity functions comprise dot products, splicing, detectors and the like;

B. regularizing the weighting coefficients using a softmax function;

C. finally, weighting and summing the weights and the corresponding values together to obtain a final attention result;

the calculation formula is as follows:

q, K, V are query, key, value, d in sentence respectively_kIs the dimension of the Q, K, V vector.

Step5, extracting Hanyue pseudo parallel sentence pairs from comparable corpora: and (3) extracting the Chinese-crossing pseudo parallel sentence pairs from the Chinese-crossing comparable corpus by using the trained Chinese-crossing pseudo parallel corpus extraction model. Table 3 shows the result of extracting the chinese-to-overtime pseudo parallel sentences on the chinese-to-overtime comparable corpus, and table 4 shows the accuracy of extracting the chinese-to-overtime pseudo parallel sentences on the chinese-to-overtime comparable corpus with respect to the different methods.

TABLE 3 pairs of pseudo-parallel sentences extracted from Hanyue

TABLE 4 Experimental results for different extraction methods

Method	Rate of accuracy
		LSTM	59.16
LSTM+POS	60.45
		Self-attention	62.94
Self-attention+POS	63.32

From the above data, it can be seen that, regarding different pseudo-parallel sentence pair extraction methods, when only LSTM (Long Short-Term Memory, which is a time-cycle neural network suitable for processing and predicting important events with relatively Long intervals and delays in time series) is used in a model for extracting a hanyu pseudo-parallel sentence pair based on sentence feature vectors, the accuracy is relatively low, when LSTM + POS (Long Short-Term Memory network and part-of-speech features) is used, the accuracy is 60.45%, when Self-attention (Self-attention mechanism) is used, the accuracy is 62.94%, and when Self-attention + POS (Self-attention mechanism and part-of-speech features) of the present invention is used, the accuracy is highest, reaching 63.32%.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. The method for extracting the Hanyue pseudo parallel sentence pair based on the sentence characteristic vector is characterized in that: the method comprises the following specific steps:

step1, corpus collection and pretreatment: collecting and preprocessing parallel and non-parallel training corpora and testing corpora of the Hanyue sentence pair and comparable corpora for extracting pseudo parallel sentence pairs;

step2, selecting the Hanyue syntax difference characteristics: according to the characteristic that the modifying words in the Chinese-Yue syntax difference are arranged at the rear positions, the part of speech with large difference in the Chinese-Yue syntax is marked;

step5, extracting Hanyue pseudo parallel sentence pairs from comparable corpora: extracting a Chinese-crossing pseudo parallel sentence pair from a Chinese-crossing comparable corpus by using a trained Chinese-crossing pseudo parallel corpus extraction model;

the specific steps of Step1 are as follows:

step1.1, crawling a Chinese-more parallel sentence pair and a non-parallel Chinese-more sentence pair as training data of a Chinese-more pseudo parallel sentence pair extraction model by using a crawler, wherein each sentence pair is provided with a classification label for judging whether the sentences are parallel or not, selecting a test set from the training data, and crawling comparable linguistic data;

step1.2, manually screening the crawled corpus, then carrying out position labeling on the crawled corpus, and marking sentence labels; then, comparable linguistic data are screened for achieving the effects of reducing the calculation times of the model and reducing the time complexity;

in Step1.2, the specific process of screening comparable corpora is as follows:

S(x,y)＝Ф(x^emb,y^emb) (2)

in equation 2, phi (x)^emb,y^emb) The cosine similarity of the sentence x and the sentence y in the same language space; expressing each sentence by using an average of word embedding of words constituting the sentence, namely sentence embedding S, calculating the similarity of each pair of Chinese-Vietnamese candidate parallel sentence pairs by using the sentence embedding to obtain a score S (x, y), and reserving 10 closest Vietnamese sentences for each Chinese sentence;

because the length ratio of the parallel Chinese-cross sentence pairs is within a certain range, if the length of the Chinese-cross sentence pairs is not within the range, the parallel probability of the Chinese-cross sentence pairs is low, the length ratio range of the Chinese-cross parallel sentence pairs is counted according to the Chinese-cross parallel corpus, and the sentence pairs beyond the range are removed, so that the comparable corpus is screened out;

in Step 2:

according to the biggest difference between Chinese and Vietnamese, the arrangement sequence of modifiers and a central language of the two languages is different, wherein the modifiers comprise fixed languages and sigmoid languages; the definite languages and the idioms as the Chinese and Vietnamese languages comprise verbs, adverbs, adjectives and nouns in the Chinese and Vietnamese languages, so that the parts of speech with large difference in the syntax of the Chinese and Vietnamese languages are labeled to obtain the parts of speech of the verbs, the adverbs, the adjectives and the nouns in the Chinese and Vietnamese languages;

the specific steps of Step3 are as follows:

step3.1, in order to solve the problem of insensitivity to the Position of a word and a model, two sentences can be distinguished, and a Position Embedding layer Position Embedding and a sentence Embedding layer Segment Embedding are added in an Embedding layer;

step3.2, vectorizing sentence characteristics in the embedding layer, and then merging the external characteristics of the sentences and the Hanyue syntax difference characteristics into the embedding layer in a vector addition mode; the external characteristics of the sentence comprise traditional word embedding, sentence segmentation characteristics and position information characteristics, and the Chinese-Vietnamese syntactic difference characteristics comprise the part-of-speech characteristics of verbs, adverbs, adjectives and nouns in Chinese and Vietnamese;

when a sentence feature vector is obtained through the neural network in Step4, the neural network used is a neural network based on the self-attention mechanism, which can effectively extract the features of the sentence itself, and the feature vector generated by using the self-attention mechanism is based on the sentence itself.