CN114492796A

CN114492796A - Multitask learning sign language translation method based on syntax tree

Info

Publication number: CN114492796A
Application number: CN202210122504.5A
Authority: CN
Inventors: 陈毅东; 张国成; 史晓东
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2022-02-09
Filing date: 2022-02-09
Publication date: 2022-05-13

Abstract

A multi-task learning sign language translation method based on a syntax tree relates to sign language translation. The method comprises the following steps: 1) obtaining a grammar tree of the spoken sentence and constructing a data set; 2) building a neural network which is mainly divided into an encoder and a decoder; after the encoder obtains the input abstract feature representation, the abstract feature representation is input into a decoder for decoding; 3) predicting a sequence of pre-ordered traversals of the syntax tree, a depth of each node of the syntax tree, and a spoken sentence. And the translation performance of the model is improved in a multi-task learning mode. The method is not only suitable for the translation process of sign language translation, but also can be used for neural machine translation tasks. The robustness of the translation is better than that of the basic Transformer model. In the model decoding process, not only the spoken sentences are predicted, but also the corresponding syntax trees are predicted, and through hard parameter sharing, the hidden deep-layer information in the training data set is more fully mined, so that the prediction result of the translation model is more accurate.

Description

Multitask learning sign language translation method based on syntax tree

Technical Field

The invention relates to sign language translation, in particular to a multi-task learning sign language translation method based on a syntax tree.

Background

Sign language is a special visual language, which expresses real semantics through multiple channel information, and can be divided into hand features and non-hand features as a whole: hand features include both hand shape, position, direction and movement; the non-hand features are mainly body posture changes, including facial expressions, movement changes of eyes, mouth, elbows, trunk and the like. Although sign language is quite different from the expression forms of Chinese and English languages, a set of language rules is really provided. In the task of translating English and Chinese into each other, translation can be basically performed one by one from front to back, but sign language translation requires understanding and translation from two dimensions of time and space. For most people, sign language is very difficult to understand without learning of a professional system, so that the deaf-mute can normally communicate with normal people through sign language translation, and the sign language translation is very meaningful.

Sign language translation mainly comprises two processes of recognition and translation: the identification is to identify the sign language video as a sign language vocabulary sequence; the translation is to translate the sign language vocabulary sequence into a spoken sentence. The sign language vocabulary sequence is a comment for the sign language video, and the practical meaning of different sign language gestures in the sign language video is labeled, so that the sign language vocabulary sequence is the most basic spoken word. As shown in fig. 1, a specific german sign language translation example is shown to distinguish relevant concepts in sign language translation.

In the field of sign language translation, researchers have generally recognized that recognition of the effects on translation is important, so much research has focused on the recognition process, and little research has been done on the translation process. In the field of neural machine translation, a great deal of work is carried out to prove that the translation performance of a model can be improved by integrating a grammar tree into a translation model, but no work is carried out to research how to integrate the grammar tree into sign language translation. In recent two years, multi-task learning becomes one of the popular researches, and on the premise of not adding extra data, by designing a plurality of related tasks, the generalization capability of each task is assisted and improved by fully utilizing the information implicit in a data set. In the field of sign language translation, although there is work on simultaneous recognition and translation, there is no work on attempting multitask learning in the translation process.

Disclosure of Invention

The invention aims to provide a Multi-task Learning Sign Language Translation Method (MLSLTBPT) Based on a syntax Tree with stronger robustness by introducing Multi-task Learning in the Translation process. In the model decoding process, not only the spoken sentences are predicted, but also the corresponding syntax trees are predicted, and through hard parameter sharing, the hidden deep-layer information in the training data set is more fully mined, so that the prediction result of the translation model is more accurate.

The invention comprises the following steps:

1) obtaining a grammar tree of the spoken sentence and constructing a data set;

2) building a neural network;

3) predicting a sequence of pre-ordered traversals of the syntax tree, a depth of each node of the syntax tree, and a spoken sentence.

In the step 1), the syntax tree of the spoken sentence is obtained, a Berkeley syntactic analyzer is used for obtaining a precedence traversal sequence of the syntax tree, and the precedence traversal sequence corresponds to a large number of specific structures of the syntax tree, so that the specific structures of the syntax tree can be restored by adopting depth information; the construction data set is a data set required by construction of multi-task learning, is input as a sign language vocabulary sequence, and outputs three contents including a preorder traversal sequence of a grammar tree, the depth of each node of the grammar tree and a spoken sentence;

the input sign language vocabulary sequence is GLS ═ g₁，g₂，g₃，...，g_wIn which g is_iRepresenting the ith word.

In the step 2), the neural network is based on an open-source Transformer model and mainly comprises an encoder and a decoder; firstly, sign language vocabulary sequences are subjected to word embedding and position coding and then input into an encoder; the encoder is formed by stacking N identical layers, each layer is composed of two sublayers, namely a multi-head attention mechanism and a feedforward neural network, residual connection and regularization are added to each sublayer, and assuming that the input is x, the output of each sublayer is represented as:

sub-layer-out＝LayerNorm(x+(SubLayer(x))) (1)

the multi-head attention mechanism is evolved from an attention mechanism which can be represented by the following form:

q, K and V are three vectors related in an attention mechanism, Q represents a query, the similarity between the query and each K is calculated to serve as a weight, then a softmax function is used for normalizing the weight, and finally the weight and the corresponding V are subjected to weighted summation; the multi-head attention mechanism projects Q, K and V through h different linear transformations, and then different attention mechanism results are spliced together:

MultiHead(Q，K，V)＝Concat(head₁，...，head_h)W^O (3)

wherein the head_iCan be expressed as:

head_i＝Attention(QW_i ^Q，KW_i ^K，VW_i ^V) (4)

the feedforward neural network provides nonlinear transformation and converts the output of the previous layer of network into a proper dimensionality;

after the encoder obtains the input abstract feature representation, the abstract feature representation is input into a decoder for decoding; the decoder is similar to the encoder structure, and is different from the encoder structure in that the decoder further comprises a multi-head masking attention mechanism, because the attention mechanism can see all words before and after each position, and the decoding is a process of sequential operation, when the decoder predicts a specific word at the kth time step, only the first k-1 prediction results can be seen, and therefore partial content needs to be masked; also in the multi-headed attention mechanism of the decoder, K, V come from the output of the encoder and Q comes from the last output of the decoder.

In step 3), the specific steps of predicting the preface traversal sequence of the syntax tree, the depth of each node of the syntax tree and the spoken sentence can be as follows:

firstly, converting the output of a decoder into the dimension of the size of a word list through a linear layer, and then calculating the probability distribution of words through a softmax function; for the precedence traversal sequence of the syntax tree, the depth of each node of the syntax tree and the spoken sentence, three branches are needed for prediction, and the final loss function is as follows:

wherein, alpha and beta are two hyper-parameters, and because the nodes and the depth of the syntax tree are inseparable, the loss function weights of the two are set to be the same; and lp, lt represent the length of the predicted syntax tree and spoken sentence, respectively;

respectively predicting the probability distribution of the three at a certain time step; y is_p ⁽ⁱ⁾、y_l ⁽ⁱ⁾、y_t ^(j)Respectively representing the real one-hot vectors of the three;<·，·>represents the inner product; the final training goal is to minimize L.

Compared with the prior art, the invention has the following outstanding advantages and technical effects:

the invention learns sign language translation in multiple tasks based on the grammar tree, namely, the multiple tasks are learned in the translation process. Because the sign language vocabulary sequence is short description of the sign language video, no obvious grammar structure exists, so that grammar tree information cannot be added at the source end of the model, but the target port language sentence has a complete grammar structure, so that the target end not only can predict the spoken language sentence, but also can predict the corresponding grammar tree.

Because the sign language vocabulary sequence has no obvious grammar structure, after the grammar tree of the spoken sentence is obtained through prediction, the grammar tree can be added to the source end of the model and iterative training is carried out, so that the grammar tree information is better utilized. Because the difficulty is higher as the predicted syntax tree is deeper, the invention can further study the effect of the syntax trees predicting different depths on the model performance.

Because the neural machine translation task is basically consistent with the translation process of the sign language translation task, the method is not only suitable for the translation process of sign language translation, but also can be used for the neural machine translation task. Because the source end and the target end have complete grammar structures in the neural machine translation task, grammar tree information can be added at the source end of the model on the basis of the method, and the translation performance of the model is further improved. Experimental results show that the robustness of translation of the method is better than that of a basic Transformer model.

Drawings

FIG. 1 is a diagram of a specific example of the German sign language translation.

FIG. 2 is a schematic diagram of the overall architecture of the present invention.

Detailed Description

The following examples will further illustrate the present invention with reference to the accompanying drawings.

The embodiment of the invention comprises the following steps:

1) obtaining a grammar tree of the spoken sentence and constructing a data set; because a syntax tree is used, a berkeley parser is used to obtain the syntax tree of the spoken sentence. Because the tool gets a sequence of the precedent traversal of the syntax tree, which can correspond to a very large number of specific structures of the syntax tree, depth information is also needed to restore the specific structures of the syntax tree. And a data set required by multi-task learning needs to be constructed, a sign language vocabulary sequence is input, and three contents including a precedence traversal sequence of a grammar tree, the depth of each node of the grammar tree and a spoken sentence are output.

2) And (3) building a neural network, wherein the neural network is based on an open-source Transformer model and mainly comprises an encoder and a decoder. The input sign language vocabulary sequence is firstly subjected to word embedding and position coding and then input into an encoder. The encoder is formed by stacking N identical layers, each layer is composed of two sublayers, namely a multi-head attention mechanism and a feedforward neural network, and residual connection and regularization are added to each sublayer.

The multi-head attention mechanism is evolved from an attention mechanism, the multi-head attention mechanism projects three vectors Q, K and V related in the attention mechanism through h different linear transformations, then different attention mechanism results are spliced to provide nonlinear transformation for the feedforward neural network, and the output of a previous layer of network is converted into a proper dimensionality.

After the encoder obtains the input abstract feature representation, the abstract feature representation is input into a decoder for decoding. The decoder is essentially identical to the encoder structure, but with an additional masking multi-head attention mechanism, where K, V come from the encoder output and Q comes from the last output of the decoder.

3) Predicting an ordered traversal sequence of the syntax tree, a depth of each node of the syntax tree, and a spoken sentence. The output of the decoder is first converted into the dimension of the vocabulary size through the linear layer, and then the probability distribution of the words is calculated through the softmax function. Since the precedence traversal sequence of the syntax tree, the depth of each node of the syntax tree, and the spoken sentence need to be predicted, three branches need to be set for prediction.

As shown in fig. 2, the transform-based prediction method of the present invention mainly comprises three modules, namely an encoder module, a decoder module and a prediction module.

1. Data format

The input is a sign language vocabulary sequence GLS ═ g₁，g₂，g₃，...，g_wIn which g is_iRepresents the ith word;

the output includes three parts: a sequence of prior traversals of the syntax tree, a depth of each node of the syntax tree, and a spoken sentence.

2. Encoder module

The input sign language vocabulary sequence is firstly subjected to word embedding and position coding and then input into an encoder. The encoder is formed by stacking N identical layers, each layer is composed of two sublayers, namely a multi-head attention mechanism and a feedforward neural network, residual error connection and regularization are added to each sublayer, and the output of each sublayer is represented as

sub-layer-out＝LayerNorm(x+(SubLayer(x)))

The multi-head attention mechanism is evolved from the attention mechanism, which is expressed in the following form:

q, K and V are three vectors involved in the attention mechanism, Q represents a query, the similarity of the query and each K is calculated as a weight, then the weight is normalized by using a softmax function, and finally the weight and the corresponding V are subjected to weighted summation. The multi-head attention mechanism projects Q, K and V through h different linear transformations, and then different attention mechanism results are spliced together:

MultiHead(Q，K，V)＝Concat(head₁，...，head_h)W^O

wherein the head_iCan be expressed as:

head_i＝Attention(QW_i ^Q，KW_i ^K，VW_i ^V)

the feed forward neural network provides a non-linear transformation that converts the output of the previous layer network into the appropriate dimensions.

3. Decoder module

After the encoder obtains the input abstract feature representation, the abstract feature representation needs to be input into a decoder for decoding; the decoder is basically consistent with the encoder structure, and has an extra masking multi-head attention mechanism, because the attention mechanism can see all words before and after each position, and the decoding is a process of sequential operation, when the decoder predicts a specific word at the kth time step, only the first k-1 prediction results can be seen, and therefore part of the content needs to be masked. Also in the multi-headed attention mechanism of the decoder, K, V come from the output of the encoder and Q comes from the last output of the decoder.

4. Prediction module

The output of the decoder is first converted into the dimension of the vocabulary size through the linear layer, and then the probability distribution of the words is calculated through the softmax function. Since the precedence traversal sequence of the syntax tree, the depth of each node of the syntax tree and the spoken sentence need to be predicted, three branches are needed for prediction, so the final loss function is:

Table 1 gives part of a sign language translation example study.

TABLE 1

As shown in table 1, three sign language translation examples of chinese, english and german are listed, respectively. Since Transformer is a very widely used baseline model at present, the present invention is to be contrasted with it. In the first and second examples, it can be seen that the predictions of the present invention are more expressively and semantically closer to the target text; in the third example, partial contents are obviously omitted from the prediction result of the Transformer, which is often found in the field of neural machine translation, and the result can be better predicted by the method, which further illustrates that the method can improve the robustness of the translation model.

Claims

1. A multitask learning sign language translation method based on a syntax tree is characterized by comprising the following steps:

1) obtaining a grammar tree of the spoken sentence and constructing a data set;

2) building a neural network;

2. The method as claimed in claim 1, wherein in step 1), the syntax tree of the spoken sentence is obtained, the berkeley parser is used to obtain the preface traversal sequence of the syntax tree, and the depth information is used to restore the specific structure of the syntax tree due to the fact that the preface traversal sequence corresponds to the specific structure of the syntax tree.

3. The method as claimed in claim 1, wherein in step 1), the constructed data set is a data set required for constructing the multitask learning, the constructed data set is input as a sign language vocabulary sequence, and the output comprises three parts of a preorder traversal sequence of the syntax tree, a depth of each node of the syntax tree and a spoken sentence.

4. The method as claimed in claim 3, wherein the sign language vocabulary sequence is GLS { g ═ g-₁，g₂，g₃，...，g_wIn which g is_iRepresenting the ith word.

5. The method as claimed in claim 1, wherein in step 2), the neural network is based on an open-source transform model and is divided into two parts, namely an encoder and a decoder; firstly, sign language vocabulary sequences are subjected to word embedding and position coding and then input into an encoder; the encoder is formed by stacking N identical layers, each layer is composed of two sublayers, namely a multi-head attention mechanism and a feedforward neural network, residual connection and regularization are added to each sublayer, and assuming that the input is x, the output of each sublayer is represented as:

sub_layer_out＝LayerNorm(x+(SubLayer(x))) (1)

MultiHead(Q，K，V)＝Concat(head₁，...，head_h)W^o (3)

wherein the head is_iCan be expressed as:

head_i＝Attention(QW_i ^Q，KW_i ^K，VW_i ^V) (4)

after the encoder obtains the input abstract feature representation, the abstract feature representation is input into a decoder for decoding.

6. The method of claim 5 wherein the decoder is structurally similar to the encoder except that it further comprises a masked multi-attention mechanism in which K, V are from the output of the encoder and Q is from the previous output of the decoder.

7. The method as claimed in claim 1, wherein in step 3), the steps of predicting the traversal sequence of the syntax tree, the depth of each node of the syntax tree and the spoken sentence are as follows: