CN114492796A - Multitask learning sign language translation method based on syntax tree - Google Patents

Multitask learning sign language translation method based on syntax tree Download PDF

Info

Publication number
CN114492796A
CN114492796A CN202210122504.5A CN202210122504A CN114492796A CN 114492796 A CN114492796 A CN 114492796A CN 202210122504 A CN202210122504 A CN 202210122504A CN 114492796 A CN114492796 A CN 114492796A
Authority
CN
China
Prior art keywords
syntax tree
sign language
translation
encoder
decoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210122504.5A
Other languages
Chinese (zh)
Inventor
陈毅东
张国成
史晓东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen University
Original Assignee
Xiamen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen University filed Critical Xiamen University
Priority to CN202210122504.5A priority Critical patent/CN114492796A/en
Publication of CN114492796A publication Critical patent/CN114492796A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B21/00Teaching, or communicating with, the blind, deaf or mute

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Business, Economics & Management (AREA)
  • Educational Administration (AREA)
  • Educational Technology (AREA)
  • Machine Translation (AREA)

Abstract

A multi-task learning sign language translation method based on a syntax tree relates to sign language translation. The method comprises the following steps: 1) obtaining a grammar tree of the spoken sentence and constructing a data set; 2) building a neural network which is mainly divided into an encoder and a decoder; after the encoder obtains the input abstract feature representation, the abstract feature representation is input into a decoder for decoding; 3) predicting a sequence of pre-ordered traversals of the syntax tree, a depth of each node of the syntax tree, and a spoken sentence. And the translation performance of the model is improved in a multi-task learning mode. The method is not only suitable for the translation process of sign language translation, but also can be used for neural machine translation tasks. The robustness of the translation is better than that of the basic Transformer model. In the model decoding process, not only the spoken sentences are predicted, but also the corresponding syntax trees are predicted, and through hard parameter sharing, the hidden deep-layer information in the training data set is more fully mined, so that the prediction result of the translation model is more accurate.

Description

Multitask learning sign language translation method based on syntax tree
Technical Field
The invention relates to sign language translation, in particular to a multi-task learning sign language translation method based on a syntax tree.
Background
Sign language is a special visual language, which expresses real semantics through multiple channel information, and can be divided into hand features and non-hand features as a whole: hand features include both hand shape, position, direction and movement; the non-hand features are mainly body posture changes, including facial expressions, movement changes of eyes, mouth, elbows, trunk and the like. Although sign language is quite different from the expression forms of Chinese and English languages, a set of language rules is really provided. In the task of translating English and Chinese into each other, translation can be basically performed one by one from front to back, but sign language translation requires understanding and translation from two dimensions of time and space. For most people, sign language is very difficult to understand without learning of a professional system, so that the deaf-mute can normally communicate with normal people through sign language translation, and the sign language translation is very meaningful.
Sign language translation mainly comprises two processes of recognition and translation: the identification is to identify the sign language video as a sign language vocabulary sequence; the translation is to translate the sign language vocabulary sequence into a spoken sentence. The sign language vocabulary sequence is a comment for the sign language video, and the practical meaning of different sign language gestures in the sign language video is labeled, so that the sign language vocabulary sequence is the most basic spoken word. As shown in fig. 1, a specific german sign language translation example is shown to distinguish relevant concepts in sign language translation.
In the field of sign language translation, researchers have generally recognized that recognition of the effects on translation is important, so much research has focused on the recognition process, and little research has been done on the translation process. In the field of neural machine translation, a great deal of work is carried out to prove that the translation performance of a model can be improved by integrating a grammar tree into a translation model, but no work is carried out to research how to integrate the grammar tree into sign language translation. In recent two years, multi-task learning becomes one of the popular researches, and on the premise of not adding extra data, by designing a plurality of related tasks, the generalization capability of each task is assisted and improved by fully utilizing the information implicit in a data set. In the field of sign language translation, although there is work on simultaneous recognition and translation, there is no work on attempting multitask learning in the translation process.
Disclosure of Invention
The invention aims to provide a Multi-task Learning Sign Language Translation Method (MLSLTBPT) Based on a syntax Tree with stronger robustness by introducing Multi-task Learning in the Translation process. In the model decoding process, not only the spoken sentences are predicted, but also the corresponding syntax trees are predicted, and through hard parameter sharing, the hidden deep-layer information in the training data set is more fully mined, so that the prediction result of the translation model is more accurate.
The invention comprises the following steps:
1) obtaining a grammar tree of the spoken sentence and constructing a data set;
2) building a neural network;
3) predicting a sequence of pre-ordered traversals of the syntax tree, a depth of each node of the syntax tree, and a spoken sentence.
In the step 1), the syntax tree of the spoken sentence is obtained, a Berkeley syntactic analyzer is used for obtaining a precedence traversal sequence of the syntax tree, and the precedence traversal sequence corresponds to a large number of specific structures of the syntax tree, so that the specific structures of the syntax tree can be restored by adopting depth information; the construction data set is a data set required by construction of multi-task learning, is input as a sign language vocabulary sequence, and outputs three contents including a preorder traversal sequence of a grammar tree, the depth of each node of the grammar tree and a spoken sentence;
the input sign language vocabulary sequence is GLS ═ g1,g2,g3,...,gwIn which g isiRepresenting the ith word.
In the step 2), the neural network is based on an open-source Transformer model and mainly comprises an encoder and a decoder; firstly, sign language vocabulary sequences are subjected to word embedding and position coding and then input into an encoder; the encoder is formed by stacking N identical layers, each layer is composed of two sublayers, namely a multi-head attention mechanism and a feedforward neural network, residual connection and regularization are added to each sublayer, and assuming that the input is x, the output of each sublayer is represented as:
sub-layer-out=LayerNorm(x+(SubLayer(x))) (1)
the multi-head attention mechanism is evolved from an attention mechanism which can be represented by the following form:
Figure BDA0003499021550000021
q, K and V are three vectors related in an attention mechanism, Q represents a query, the similarity between the query and each K is calculated to serve as a weight, then a softmax function is used for normalizing the weight, and finally the weight and the corresponding V are subjected to weighted summation; the multi-head attention mechanism projects Q, K and V through h different linear transformations, and then different attention mechanism results are spliced together:
MultiHead(Q,K,V)=Concat(head1,...,headh)WO (3)
wherein the headiCan be expressed as:
headi=Attention(QWi Q,KWi K,VWi V) (4)
the feedforward neural network provides nonlinear transformation and converts the output of the previous layer of network into a proper dimensionality;
after the encoder obtains the input abstract feature representation, the abstract feature representation is input into a decoder for decoding; the decoder is similar to the encoder structure, and is different from the encoder structure in that the decoder further comprises a multi-head masking attention mechanism, because the attention mechanism can see all words before and after each position, and the decoding is a process of sequential operation, when the decoder predicts a specific word at the kth time step, only the first k-1 prediction results can be seen, and therefore partial content needs to be masked; also in the multi-headed attention mechanism of the decoder, K, V come from the output of the encoder and Q comes from the last output of the decoder.
In step 3), the specific steps of predicting the preface traversal sequence of the syntax tree, the depth of each node of the syntax tree and the spoken sentence can be as follows:
firstly, converting the output of a decoder into the dimension of the size of a word list through a linear layer, and then calculating the probability distribution of words through a softmax function; for the precedence traversal sequence of the syntax tree, the depth of each node of the syntax tree and the spoken sentence, three branches are needed for prediction, and the final loss function is as follows:
Figure BDA0003499021550000031
wherein, alpha and beta are two hyper-parameters, and because the nodes and the depth of the syntax tree are inseparable, the loss function weights of the two are set to be the same; and lp, lt represent the length of the predicted syntax tree and spoken sentence, respectively;
Figure BDA0003499021550000032
respectively predicting the probability distribution of the three at a certain time step; y isp (i)、yl (i)、yt (j)Respectively representing the real one-hot vectors of the three;<·,·>represents the inner product; the final training goal is to minimize L.
Compared with the prior art, the invention has the following outstanding advantages and technical effects:
the invention learns sign language translation in multiple tasks based on the grammar tree, namely, the multiple tasks are learned in the translation process. Because the sign language vocabulary sequence is short description of the sign language video, no obvious grammar structure exists, so that grammar tree information cannot be added at the source end of the model, but the target port language sentence has a complete grammar structure, so that the target end not only can predict the spoken language sentence, but also can predict the corresponding grammar tree.
Because the sign language vocabulary sequence has no obvious grammar structure, after the grammar tree of the spoken sentence is obtained through prediction, the grammar tree can be added to the source end of the model and iterative training is carried out, so that the grammar tree information is better utilized. Because the difficulty is higher as the predicted syntax tree is deeper, the invention can further study the effect of the syntax trees predicting different depths on the model performance.
Because the neural machine translation task is basically consistent with the translation process of the sign language translation task, the method is not only suitable for the translation process of sign language translation, but also can be used for the neural machine translation task. Because the source end and the target end have complete grammar structures in the neural machine translation task, grammar tree information can be added at the source end of the model on the basis of the method, and the translation performance of the model is further improved. Experimental results show that the robustness of translation of the method is better than that of a basic Transformer model.
Drawings
FIG. 1 is a diagram of a specific example of the German sign language translation.
FIG. 2 is a schematic diagram of the overall architecture of the present invention.
Detailed Description
The following examples will further illustrate the present invention with reference to the accompanying drawings.
The embodiment of the invention comprises the following steps:
1) obtaining a grammar tree of the spoken sentence and constructing a data set; because a syntax tree is used, a berkeley parser is used to obtain the syntax tree of the spoken sentence. Because the tool gets a sequence of the precedent traversal of the syntax tree, which can correspond to a very large number of specific structures of the syntax tree, depth information is also needed to restore the specific structures of the syntax tree. And a data set required by multi-task learning needs to be constructed, a sign language vocabulary sequence is input, and three contents including a precedence traversal sequence of a grammar tree, the depth of each node of the grammar tree and a spoken sentence are output.
2) And (3) building a neural network, wherein the neural network is based on an open-source Transformer model and mainly comprises an encoder and a decoder. The input sign language vocabulary sequence is firstly subjected to word embedding and position coding and then input into an encoder. The encoder is formed by stacking N identical layers, each layer is composed of two sublayers, namely a multi-head attention mechanism and a feedforward neural network, and residual connection and regularization are added to each sublayer.
The multi-head attention mechanism is evolved from an attention mechanism, the multi-head attention mechanism projects three vectors Q, K and V related in the attention mechanism through h different linear transformations, then different attention mechanism results are spliced to provide nonlinear transformation for the feedforward neural network, and the output of a previous layer of network is converted into a proper dimensionality.
After the encoder obtains the input abstract feature representation, the abstract feature representation is input into a decoder for decoding. The decoder is essentially identical to the encoder structure, but with an additional masking multi-head attention mechanism, where K, V come from the encoder output and Q comes from the last output of the decoder.
3) Predicting an ordered traversal sequence of the syntax tree, a depth of each node of the syntax tree, and a spoken sentence. The output of the decoder is first converted into the dimension of the vocabulary size through the linear layer, and then the probability distribution of the words is calculated through the softmax function. Since the precedence traversal sequence of the syntax tree, the depth of each node of the syntax tree, and the spoken sentence need to be predicted, three branches need to be set for prediction.
As shown in fig. 2, the transform-based prediction method of the present invention mainly comprises three modules, namely an encoder module, a decoder module and a prediction module.
1. Data format
The input is a sign language vocabulary sequence GLS ═ g1,g2,g3,...,gwIn which g isiRepresents the ith word;
the output includes three parts: a sequence of prior traversals of the syntax tree, a depth of each node of the syntax tree, and a spoken sentence.
2. Encoder module
The input sign language vocabulary sequence is firstly subjected to word embedding and position coding and then input into an encoder. The encoder is formed by stacking N identical layers, each layer is composed of two sublayers, namely a multi-head attention mechanism and a feedforward neural network, residual error connection and regularization are added to each sublayer, and the output of each sublayer is represented as
sub-layer-out=LayerNorm(x+(SubLayer(x)))
The multi-head attention mechanism is evolved from the attention mechanism, which is expressed in the following form:
Figure BDA0003499021550000051
q, K and V are three vectors involved in the attention mechanism, Q represents a query, the similarity of the query and each K is calculated as a weight, then the weight is normalized by using a softmax function, and finally the weight and the corresponding V are subjected to weighted summation. The multi-head attention mechanism projects Q, K and V through h different linear transformations, and then different attention mechanism results are spliced together:
MultiHead(Q,K,V)=Concat(head1,...,headh)WO
wherein the headiCan be expressed as:
headi=Attention(QWi Q,KWi K,VWi V)
the feed forward neural network provides a non-linear transformation that converts the output of the previous layer network into the appropriate dimensions.
3. Decoder module
After the encoder obtains the input abstract feature representation, the abstract feature representation needs to be input into a decoder for decoding; the decoder is basically consistent with the encoder structure, and has an extra masking multi-head attention mechanism, because the attention mechanism can see all words before and after each position, and the decoding is a process of sequential operation, when the decoder predicts a specific word at the kth time step, only the first k-1 prediction results can be seen, and therefore part of the content needs to be masked. Also in the multi-headed attention mechanism of the decoder, K, V come from the output of the encoder and Q comes from the last output of the decoder.
4. Prediction module
The output of the decoder is first converted into the dimension of the vocabulary size through the linear layer, and then the probability distribution of the words is calculated through the softmax function. Since the precedence traversal sequence of the syntax tree, the depth of each node of the syntax tree and the spoken sentence need to be predicted, three branches are needed for prediction, so the final loss function is:
Figure BDA0003499021550000052
wherein, alpha and beta are two hyper-parameters, and because the nodes and the depth of the syntax tree are inseparable, the loss function weights of the two are set to be the same; and lp, lt represent the length of the predicted syntax tree and spoken sentence, respectively;
Figure BDA0003499021550000061
respectively predicting the probability distribution of the three at a certain time step; y isp (i)、yl (i)、yt (j)Respectively representing the real one-hot vectors of the three;<·,·>represents the inner product; the final training goal is to minimize L.
Table 1 gives part of a sign language translation example study.
TABLE 1
Figure BDA0003499021550000062
As shown in table 1, three sign language translation examples of chinese, english and german are listed, respectively. Since Transformer is a very widely used baseline model at present, the present invention is to be contrasted with it. In the first and second examples, it can be seen that the predictions of the present invention are more expressively and semantically closer to the target text; in the third example, partial contents are obviously omitted from the prediction result of the Transformer, which is often found in the field of neural machine translation, and the result can be better predicted by the method, which further illustrates that the method can improve the robustness of the translation model.

Claims (7)

1. A multitask learning sign language translation method based on a syntax tree is characterized by comprising the following steps:
1) obtaining a grammar tree of the spoken sentence and constructing a data set;
2) building a neural network;
3) predicting a sequence of pre-ordered traversals of the syntax tree, a depth of each node of the syntax tree, and a spoken sentence.
2. The method as claimed in claim 1, wherein in step 1), the syntax tree of the spoken sentence is obtained, the berkeley parser is used to obtain the preface traversal sequence of the syntax tree, and the depth information is used to restore the specific structure of the syntax tree due to the fact that the preface traversal sequence corresponds to the specific structure of the syntax tree.
3. The method as claimed in claim 1, wherein in step 1), the constructed data set is a data set required for constructing the multitask learning, the constructed data set is input as a sign language vocabulary sequence, and the output comprises three parts of a preorder traversal sequence of the syntax tree, a depth of each node of the syntax tree and a spoken sentence.
4. The method as claimed in claim 3, wherein the sign language vocabulary sequence is GLS { g ═ g-1,g2,g3,...,gwIn which g isiRepresenting the ith word.
5. The method as claimed in claim 1, wherein in step 2), the neural network is based on an open-source transform model and is divided into two parts, namely an encoder and a decoder; firstly, sign language vocabulary sequences are subjected to word embedding and position coding and then input into an encoder; the encoder is formed by stacking N identical layers, each layer is composed of two sublayers, namely a multi-head attention mechanism and a feedforward neural network, residual connection and regularization are added to each sublayer, and assuming that the input is x, the output of each sublayer is represented as:
sub_layer_out=LayerNorm(x+(SubLayer(x))) (1)
the multi-head attention mechanism is evolved from an attention mechanism which can be represented by the following form:
Figure FDA0003499021540000011
q, K and V are three vectors related in an attention mechanism, Q represents a query, the similarity between the query and each K is calculated to serve as a weight, then a softmax function is used for normalizing the weight, and finally the weight and the corresponding V are subjected to weighted summation; the multi-head attention mechanism projects Q, K and V through h different linear transformations, and then different attention mechanism results are spliced together:
MultiHead(Q,K,V)=Concat(head1,...,headh)Wo (3)
wherein the head isiCan be expressed as:
headi=Attention(QWi Q,KWi K,VWi V) (4)
the feedforward neural network provides nonlinear transformation and converts the output of the previous layer of network into a proper dimensionality;
after the encoder obtains the input abstract feature representation, the abstract feature representation is input into a decoder for decoding.
6. The method of claim 5 wherein the decoder is structurally similar to the encoder except that it further comprises a masked multi-attention mechanism in which K, V are from the output of the encoder and Q is from the previous output of the decoder.
7. The method as claimed in claim 1, wherein in step 3), the steps of predicting the traversal sequence of the syntax tree, the depth of each node of the syntax tree and the spoken sentence are as follows:
firstly, converting the output of a decoder into the dimension of the size of a word list through a linear layer, and then calculating the probability distribution of words through a softmax function; for the precedence traversal sequence of the syntax tree, the depth of each node of the syntax tree and the spoken sentence, three branches are needed for prediction, and the final loss function is as follows:
Figure FDA0003499021540000021
wherein, alpha and beta are two hyper-parameters, and because the nodes and the depth of the syntax tree are inseparable, the loss function weights of the two are set to be the same; and lp, lt represent the length of the predicted syntax tree and spoken sentence, respectively;
Figure FDA0003499021540000022
respectively predicting the probability distribution of the three at a certain time step; y isp (i)、yl (i)、yt (j)Respectively representing the real one-hot vectors of the three;<·,·>represents the inner product; the final training goal is to minimize L.
CN202210122504.5A 2022-02-09 2022-02-09 Multitask learning sign language translation method based on syntax tree Pending CN114492796A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210122504.5A CN114492796A (en) 2022-02-09 2022-02-09 Multitask learning sign language translation method based on syntax tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210122504.5A CN114492796A (en) 2022-02-09 2022-02-09 Multitask learning sign language translation method based on syntax tree

Publications (1)

Publication Number Publication Date
CN114492796A true CN114492796A (en) 2022-05-13

Family

ID=81478861

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210122504.5A Pending CN114492796A (en) 2022-02-09 2022-02-09 Multitask learning sign language translation method based on syntax tree

Country Status (1)

Country Link
CN (1) CN114492796A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115392360A (en) * 2022-08-11 2022-11-25 哈尔滨工业大学 Transformer-based large bridge temperature-response related pattern recognition and health diagnosis method
CN117275461A (en) * 2023-11-23 2023-12-22 上海蜜度科技股份有限公司 Multitasking audio processing method, system, storage medium and electronic equipment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115392360A (en) * 2022-08-11 2022-11-25 哈尔滨工业大学 Transformer-based large bridge temperature-response related pattern recognition and health diagnosis method
CN115392360B (en) * 2022-08-11 2023-04-07 哈尔滨工业大学 Transformer-based large bridge temperature-response related pattern recognition and health diagnosis method
CN117275461A (en) * 2023-11-23 2023-12-22 上海蜜度科技股份有限公司 Multitasking audio processing method, system, storage medium and electronic equipment
CN117275461B (en) * 2023-11-23 2024-03-15 上海蜜度科技股份有限公司 Multitasking audio processing method, system, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN108733792B (en) Entity relation extraction method
Zhang et al. Deep Neural Networks in Machine Translation: An Overview.
CN111737496A (en) Power equipment fault knowledge map construction method
Zhang et al. SG-Net: Syntax guided transformer for language representation
CN113642330A (en) Rail transit standard entity identification method based on catalog topic classification
CN112989834A (en) Named entity identification method and system based on flat grid enhanced linear converter
CN113535957B (en) Conversation emotion recognition network model system based on dual knowledge interaction and multitask learning, construction method, equipment and storage medium
CN116662582B (en) Specific domain business knowledge retrieval method and retrieval device based on natural language
CN112037773B (en) N-optimal spoken language semantic recognition method and device and electronic equipment
CN114492796A (en) Multitask learning sign language translation method based on syntax tree
CN110442880B (en) Translation method, device and storage medium for machine translation
CN110852089B (en) Operation and maintenance project management method based on intelligent word segmentation and deep learning
CN114881042B (en) Chinese emotion analysis method based on graph-convolution network fusion of syntactic dependency and part of speech
CN116910086B (en) Database query method and system based on self-attention syntax sensing
CN114020906A (en) Chinese medical text information matching method and system based on twin neural network
Zhu et al. Robust spoken language understanding with unsupervised asr-error adaptation
CN115831102A (en) Speech recognition method and device based on pre-training feature representation and electronic equipment
CN115272908A (en) Multi-modal emotion recognition method and system based on improved Transformer
Yao Attention-based BiLSTM neural networks for sentiment classification of short texts
CN113076421A (en) Social noise text entity relation extraction optimization method and system
CN117033423A (en) SQL generating method for injecting optimal mode item and historical interaction information
Cai et al. Multi-view and attention-based bi-lstm for weibo emotion recognition
CN116611436A (en) Threat information-based network security named entity identification method
CN113536741B (en) Method and device for converting Chinese natural language into database language
CN114925695A (en) Named entity identification method, system, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination