CN115393948A

CN115393948A - Sign language video generation method based on improved Transformer model

Info

Publication number: CN115393948A
Application number: CN202210821012.5A
Authority: CN
Inventors: 崔振超; 陈子昂; 齐静
Original assignee: Hebei University
Current assignee: Hebei University
Priority date: 2022-07-13
Filing date: 2022-07-13
Publication date: 2022-11-25

Abstract

The invention provides a sign language video generation method and device based on an improved Transformer model. The method firstly extracts the skeleton attitude sequence in the sign language video and removes redundant information to reduce the calculated amount. In addition, considering the importance of the time-space information on the precision of generating the sign language video, the semantic-rich embedding module is designed to encode the position and speed information into the same high-dimensional space as the input of a model, so that the coordination of joint motion is improved, and the precision of feature expression is improved. Finally, a coder-decoder model in a pyramid structure is constructed. The encoder accepts as input a spoken sentence and encodes the information in the sequence into an intermediate representation. The decoder then decodes the intermediate representation as a sequence of target sign language poses in a semi-autoregressive manner. The method and the device can effectively improve the utilization rate of semantic information and the integral expression capability of actions, thereby obviously improving the accuracy and speed of generating the sign language video.

Description

Sign language video generation method based on improved Transformer model

Technical Field

The invention relates to a man-machine interaction method, in particular to a sign language video generation method based on an improved Transformer model.

Background

Sign language is a visual gesture language, transmits information through gestures and spatial movement of limbs, and is the most natural way for a hearing-impaired person to communicate with the outside world through man-machine interaction. Sign language real-time communication has become an important topic in the field of computer vision and natural language processing at present. The accurate automatic sign language generation can obviously improve the communication quality of deaf-dumb people and normal people and help the disabled people to better integrate into the current society. The goal of sign language video generation is to translate spoken sentences into personalized sign language video that humans can understand.

Currently, sign language video generation is mainly divided into two types, namely animation synthesis based and deep learning based. For sign language animation video synthesis based on a traditional method, a sign language word motion database is mostly constructed based on a motion tracking principle and a human body motion editing method, sign language motion segments corresponding to input texts are searched in the database and spliced and synthesized, and videos are vividly displayed based on a virtual reality modeling language method. In a comprehensive view, the sign language video generated based on the animation synthesis method has the advantages of convenience in operation and high efficiency, but the method depends on the construction of a large-scale sign language animation database, the animation video lacks vivid details for executing actions, and the understandability of the synthesized animation is still influenced by the appearance and the actions of manual design. Therefore, more and more research is beginning to explore more flexible and natural sign language generation schemes.

In recent years, as the traditional method cannot meet the requirement of gradually increasing data scale, deep learning continuously achieves excellent results in the field of sign language video generation, in 2020, the first sign language synthesis method based on deep learning is proposed, a coder-decoder structure based on a recurrent neural network is used for coding sign language video features into potential tokens and combining with a generative confrontation network, a human body posture skeleton and appearance are jointly used as conditions to be sent into a generator, and a generated vivid video frame is evaluated by a discriminator. However, the model based on the recurrent neural network has the defect of insufficient feature extraction due to the limitation of local receptive fields. Saunders et al propose a progressive transform-based sign language generation model that converts discrete spoken sentences to a sequence of continuous sign language gestures in an end-to-end manner for the first time, wherein a tokenized transform and a progressive transform are constructed to perform the translation of spoken sentences into a sequence of spoken words and the conversion of a sequence of words into a human skeletal video, respectively. Since the decoder generates the current frame depending on the previously generated gesture sequence, error accumulation and high inference delay are caused, and the phenomena that the network convergence speed is slow and the accuracy rate of generating sign language is not high occur.

At present, many models in a sign language video generation task based on deep learning are based on an autoregressive decoding mode, and each frame which is generated before each model needs to be considered when each target sign language pose is generated (namely, decoding), so that a translation generation process is serial, and error accumulation and high inference delay become an important bottleneck of the models. And a non-autoregressive model is a machine translation model that uses parallel decoding. Different from a word-by-word output mode of an autoregressive model from left to right, a non-autoregressive model can output at all times in parallel, has higher efficient decoding speed, but can damage precision. Although some studies have used non-autoregressive decoding approaches to generate sign language video, the problem of bone sequence duplication or loss due to discarding of dependencies between target sign language pose sequences is not considered.

Disclosure of Invention

The invention aims to provide a sign language video generation method and a sign language video generation device based on an improved Transformer model, so as to solve the problem that the existing method is insufficient in searching attitude information.

The invention is realized by the following steps: a sign language video generation method based on an improved Transformer model comprises the following steps:

a. extracting a two-dimensional skeleton sequence of a target sign language posture in a target sign language video by adopting openposition, intercepting 8 joint points of an upper body and 21 joint points of a left hand and a right hand, and performing model training; and (3) promoting the two-dimensional data representing the gesture language posture into three-dimensional data, and performing data cleaning on the skeleton information of abnormal and wrong joints by observing the distribution of the three-dimensional data to form a target gesture language posture sequence.

b. Inputting the spoken sentence and the target sign language gesture sequence into an encoder-decoder model, and training the encoder-decoder model to establish a mapping relation between the spoken sentence and the target sign language gesture sequence; and after the mapping relation is established, a trained sign language video generation network model is formed.

c. The trained sign language video generation network model is used for processing the input spoken sentences, the output of the model is the probability distribution of the sign language corresponding to each moment, and finally the spoken sentences are translated into the personalized sign language video expressed in human skeleton and graphic format from end to end.

In the step a, data cleaning comprises data processing modes such as data discarding and weighted linear interpolation.

The coder-decoder model includes a text feature coder with sign language length prediction and a pyramidal semi-autoregressive decoder incorporating a rich semantic embedding layer.

In the step b, the coder-decoder model is trained in a way that a spoken sentence is input into a text feature coder to learn semantic features and is transmitted to a pyramid semi-autoregressive decoder, and a convolutional neural network and a softmax classifier are added to the last layer of the coder to predict the sign language length; inputting the target sign language attitude sequence into a pyramid semi-autoregressive decoder for extracting space-time characteristics, and decoding the target sign language sequence in a semi-autoregressive mode by introducing a Relaxedleased-attention mechanism; and establishing a mapping relation between the spoken sentence and the sign language action through model training.

In the step b, the extraction of the space-time characteristics is to encode the information on the time dimension and the space displacement into the same space as the input of a model so as to solve the deficiency of semantic information exploration; the pyramid semi-autoregressive decoder is used for grouping the target sign language attitude sequences from coarse to fine, the cascade characteristic is kept among groups, and a target frame is generated in each group in parallel.

A spoken sentence containing N words is represented as: s = (S) ₁ ,…,s _N )；

The target sign language gesture sequence is represented as: t = (T) ₁ ,…,t _M )；

Wherein s is _i Is the ith in the spoken sentenceWords, N is the number of words in the spoken sentence, t _i The gesture language gesture of the ith frame is shown, and M is the video frame number.

The goal is to fit a parametric model that maximizes the conditional probability P (T | S) for translation of text into a sequence of hand gestures.

The joint set for fusing joint velocity information between adjacent frames into a bone sequence is represented as:

wherein the content of the first and second substances,

three-dimensional coordinate information representing the joint u at the t-th frame

And the speed information of the joint u in the t frame is obtained by subtracting the three-dimensional coordinate information of the t frame and the t-1 frame.

The length of the target sign language pose sequence is L, P (L | S) is modeled separately, and the maximum value of L is set to 100.

The text feature encoder comprises a plurality of layers with the same structure but different training parameters, wherein each layer comprises two sublayers, namely a multi-head attention mechanism and a feed-forward network which is fully connected position by position; where each sub-layer uses residual connection (residual connection) and layer normalization (layer normalization) to ensure that the gradient is not 0, mitigating the appearance of gradient vanishing, the output of each sub-layer is represented as:

wherein the content of the first and second substances,

is composed ofA source spoken sentence S is coded by a word embedding layer to obtain a characteristic vector;

the word embedding layer uses a two-layer fully-connected network (FC) and a ReLU activation function ReLU, w ₁ Weight matrix for a fully connected network of the first layer, b ₁ Bias term for a fully connected network of the first layer, w ₂ Weight matrix for a second layer fully connected network, b ₂ For the bias item of the second layer full-connection network, multiplying the weight matrix by the input vector, adding the multiplied weight matrix and the bias, and introducing a posionalencoding module to keep the word sequence information as follows:

wherein S is _n Representing the nth word in the sentence.

The multi-head attention mechanism is that h different linear transformations are used for projecting query (Q), key (K) and value (V), and finally different attention results are spliced; wherein Q, K and V are each independently

Multiplying the three trainable parameter matrixes; the expression formula of the multi-head attention mechanism is as follows:

MultiHead(Q,K,V)＝Concat(head ₁ ,…,head _h )W ^O (4)

wherein Q, K and V are each independently

Multiplied by three trainable parameter matrices to obtain head _i Denotes the ith head; w ^O The weight matrix is obtained from a full connection layer and participates in model training along with other parameters.

The calculation formula for the ith head is as follows:

head _i ＝Attention(QW _i ^Q ,KW _i ^K ,VW _i ^V ) (5)

wherein, W _i ^Q 、W _i ^K And W _i ^V Three trainable parameter matrices corresponding to Q, K, V; attention is the Attention of zooming dot product, the higher the similarity of two vectors is, the larger the dot product result is, the more Attention the model is, and the calculation formula is as follows:

wherein is divided by

Keeping variance control at 1; the Softmax function is a normalized exponential function such that each element ranges between (0, 1) and the sum of all elements is 1.

The rich semantic embedding layer maps position and velocity information to the same vector space using a two-layer fully connected network (FC) and a ReLU activation function:

wherein, the first and the second end of the pipe are connected with each other,

indicating position information after the joint u is encoded at the t-th frame;

velocity information indicating the joint u at the t-th frame; w is a ₁ Weight matrix for a fully connected network of the first layer, b ₁ Bias term for a fully connected network of the first layer, w ₂ Weight matrix for a second layer fully connected network, b ₂ And for the bias item of the second layer of the fully-connected network, multiplying the weight matrix by the input vector, adding the multiplied weight matrix to the bias, and introducing a position encoding module to keep attitude sequence information.

The gesture language attitude sequence T is obtained by a rich semantic embedding layer

Expressed as:

wherein the content of the first and second substances,

is a collection of encoded position information and velocity information;

the semi-autoregressive decoder with a pyramid structure keeps global autoregressive characteristics and local non-autoregressive characteristics through a RelaxedASKED-attention mechanism

Is divided into

Groups, where each group contains d frames, the conditional probability can be expressed as:

wherein P is a spoken sentence

Generating target sign language poses

The conditional probability of (a); g represents grouping the target sign language gesture sequence, and the specific division is as follows:

wherein M is the total frame number of the video, and d is the number of frames contained in each group.

Predicting Group using Relaxedmasked-attention mechanism model _k When the inner posture frame is in, the Group can be seen _1tok All the frame information of (2); given the target number of frames M and the division level d,

is defined as follows:

wherein, the delayed _ mask is a two-dimensional matrix with the length of M, and d is the number of frames contained in each group;

the Relaxedmasked-attention with residual concatenation is defined as follows:

wherein is divided by

Further keeping the variance control, the output of the sign language video generation network model firstly passes through a Linear layer and then passes through a Softmax layer; and obtaining logits through the Linear layer, wherein the value of the logits is changed into a probability value through the Softmax layer, and the maximum Softmax value is the gesture of the sign language corresponding to the current moment.

In order to better mine semantic information from sign language attitude videos and reduce the calculation amount and resource consumption, the invention adopts a pyramid semi-autoregressive mode from coarse to fine based on a pyramid semi-autoregressive model with rich semantic information compared with the traditional transformer model so as to relieve the problem of gradient explosion caused by error accumulation when a decoder is trained, thereby maintaining the autoregressive characteristic globally and generating the sign language video generation model of the sign language attitude sequence locally and in parallel.

The invention codes the position and speed information into the same high-dimensional space as the input of the model decoder, improves the coordination of joint movement and leads the generated hand action to be more natural. Experiments prove that the speed information contained in the gesture language gesture sequence is an important characteristic which is ignored in the previous research.

Aiming at the phenomena of finger joint overlapping, confusion and the like caused by the fact that the rotation angles of the joints of the hand are various due to complexity, the method and the device encode position and speed information into the same high-dimensional space by adding the rich semantic embedding layer, so that the decoding speed and precision are well balanced, the decoder of the Transformer is improved to combine the information of the sign language action on the time dimension and the space displacement together, the redundant information in a video sequence is filtered, and the sign language synthesis precision is improved.

The invention adopts a coarse-to-fine pyramid semi-autoregressive decoder to generate the target sign language gesture sequence, which is different from a mode of sequentially outputting autoregressive Transformer models from left to right, and the pyramid semi-autoregressive Transformer models are grouped and output in parallel during decoding, so that the decoding speed is higher. Better balance is carried out between an autoregressive mode and a non-autoregressive mode, and the gesture language attitude can be generated in a local parallel mode on the basis of globally keeping the autoregressive characteristic. The method improves the reasoning speed by 3.6 times on the premise of not influencing the precision.

The invention has better overall performance than the general mainstream algorithm and is more suitable for man-machine interaction. The improved Transformer model has the beneficial effects that: compared with the original network method, the method can effectively improve the accuracy of generating the sign language video, and simultaneously achieves high-efficiency generation speed, thereby facilitating the ground real-time deployment of the device.

Drawings

Fig. 1 is a flow diagram of sign language video generation.

FIG. 2 is a diagram of the overall structural framework of the improved Transformer model.

FIG. 3 is a diagram of a dynamic information mining framework in accordance with the present invention.

FIG. 4 is a unconstrained mask attention framework diagram of the present invention.

FIG. 5 is a graph of inferred delay versus predicted sign language length for different models; wherein, (a) is a comparison graph of the configuration of the autoregressive model, the non-autoregressive model and the semi-autoregressive model, and (b) is a comparison graph of the grouping sizes of the semi-autoregressive models of 1, 2, 4 and 8 respectively.

Fig. 6 is a sign language video result generated by different models on the chinese sign language data set CSL.

FIG. 7 is a sign language video result generated by different models on the German sign language data set RWHH-PHOENIX-Weather 2014T.

Detailed Description

As shown in FIG. 1, the flow of sign language video generation is realized by a sign language synthesis model and a sign language skeleton extraction module together to predict sign language actions from texts.

As shown in FIG. 2, the overall structure of the improved Transformer model is composed of a text feature encoder with sign language length prediction and a pyramid semi-autoregressive decoder combined with a rich semantic embedding layer.

The sign language video generation method comprises the following three steps:

step 1: firstly, data preprocessing is carried out, openposition is adopted to extract a two-dimensional skeleton sequence of a target sign language posture in a target sign language video, in order to remove redundant information and reduce the calculated amount, 8 joint points of the upper body and 21 joint points of the left hand and the right hand are intercepted, and 50 joints are used for model training. And secondly, promoting the two-dimensional information of the gesture language gesture sequence to be three-dimensional by using the 2D to 3D Inverse Kinematics tools. And by observing the distribution of the three-dimensional data and the problem of data imbalance, carrying out data cleaning on the skeleton information at abnormal and wrong joints, and carrying out processing such as discarding or weighted linear interpolation.

Step 2: the method comprises two processes of encoding of a spoken sentence and decoding of a target sign language gesture sequence.

The encoding process is to input the spoken sentence into a text feature encoder of the improved Transformer network, learn semantic features and transmit the semantic features to a decoder, and add an additional convolutional neural network and a softmax classifier at the last layer to perform sign language length prediction.

The decoding process comprises the steps of inputting the sign language attitude sequence extracted in the step 1 into a pyramid semi-autoregressive decoder of the improved Transformer network, extracting space-time characteristics from the sign language attitude sequence by using a rich semantic embedding layer, and coding information on time dimension and space displacement into the same space as input of a model to solve the defect of semantic information exploration; the semi-autoregressive decoder in the pyramid structure groups the target sign language gesture sequence from coarse to fine, and cascade characteristics are kept among groups, but target frames are generated in parallel in each group.

And step 3: the model output is the probability distribution of the sign language corresponding to each moment, and finally, the spoken sentence is translated into the personalized sign language video expressed by human skeleton and graphic format from end to end.

The text feature encoder in step 2 mainly comprises four parts, which are: word Embedding layer (Word Embedding), multi-head self-attention layer (self-attention), feed-forward layer (feed-forward), and sign language length prediction layer.

In conjunction with fig. 2, the text feature encoder in step 2 learns semantic features from the input spoken sentence and passes them to the decoder. Encoding source input S into feature vectors using a two-layer fully-connected network (FC) and a ReLU activation function at the word embedding layer

w ₁ And b ₁ Weight matrix and bias term for fully-connected network of first layer, w ₂ And b ₂ For a weight matrix and an offset item of a second-layer fully-connected network, multiplying the weight matrix by an input vector, adding the multiplied weight matrix and the input vector with an offset, and finally introducing a position encoding module to keep word sequence information, wherein the formula is as follows:

wherein S is _n Representing the nth word in the sentence.

The encoder consists of a plurality of layers with the same structure but different training parameters, wherein each layer consists of two sublayers, namely a multi-head attention mechanism and a position-wise feed-forward network. Wherein each sublayer uses residual concatenation and layerormalization to ensure that the gradient is not 0, mitigating the appearance of gradient disappearance, and thus the output of the sublayer can be expressed as:

wherein the content of the first and second substances,

the method comprises the steps of obtaining a characteristic vector by encoding a source spoken sentence S through a word embedding layer;

assuming that the number of the multi-head attention is h, the multi-head is equivalent to inputting original information into different spaces, and it is ensured that a transformer can capture feature information of different subspaces. And (3) projecting the query Q, the key K and the value V through h different linear transformations, and finally splicing h matrixes into a final afferent feedforward neural network layer.

MultiHead(Q,K,V)＝Concat(head ₁ ,…,head _h )W ^O (4)

Wherein Q, K and V are each independently

Multiplied by three trainable parameter matrices to obtain head _i Denotes the ith head; w is a group of ^O The weight matrix is obtained from a full connection layer and participates in model training along with other parameters.

The calculation formula for the ith head is as follows:

head _i ＝Attention(QW _i ^Q ,KW _i ^K ,VW _i ^V ) (5)

wherein is divided by

Assuming that x is the output after passing through the multi-head attention layer, position-wise feed-forward networks provide a non-linear transformation, which is so because the transformation parameters at each Position i are the same when passing through the linear layer.

Sign language length prediction is performed on the output of the last layer of the encoder using a single-layer neural network and a softmax classifier. Assuming that the length of the target gesture language pose sequence is L, the translated text is divided into L/d segments at different decoder layers, the segments are generated in an autoregressive mode from left to right, and d frame target pose sequences are generated in parallel in a non-autoregressive mode in each segment. Therefore, the length of the target sign language sequence is an important latent variable, and the final translation quality is influenced. P (L | S) was modeled separately and set to a maximum value of L of 100 in the experiment. It is worth noting that the present invention uses predicted length-assisted reasoning only in testing sign language synthesis models, which use the length of a reference target sequence during training.

The pyramid semi-autoregressive decoder in the step 2 mainly comprises four parts, which are sequentially as follows: a rich semantic embedding layer, an unconstrained mask attention (releasedmasked-attention), an unconstrained mask codec attention (releasedecoder-attention), and a feed-forward layer (feed-forward).

In conjunction with fig. 3, the present invention mines dynamic information, with the dark arrows representing thumb joint movement information and the light arrows representing elbow joint movement information. And 3, learning features from the gesture language attitude sequence extracted in the step 1 by a semantic-rich embedding layer. Wherein in three dimensionsPassing position in coordinate system

And velocity

Represents the dynamic information of the joint u in the t-th frame, and the joint velocity information and the position information between adjacent frames are merged into the joint set of the skeleton sequence, and the dynamic information is represented as follows:

and mapping the position and speed information to the same vector space by using a two-layer fully-connected network (FC) and a ReLU activation function at a rich semantic embedding layer, wherein the formula is as follows:

wherein the content of the first and second substances,

indicating position information after the joint u is encoded at the t-th frame;

velocity information indicating the joint u at the t-th frame; w is a ₁ Weight matrix for a fully connected network of the first layer, b ₁ Bias term for a fully connected network of the first layer, w ₂ Weight matrix for a second layer fully connected network, b ₂ And for the bias item of the second layer of the full-connection network, multiplying the weight matrix by the input vector, adding the multiplied weight matrix and the bias, and introducing a posionalencoding module to keep attitude sequence information.

Expressed as:

wherein the content of the first and second substances,

a set of encoded position information and velocity information;

in the pyramid semi-autoregressive decoder in the step 2, the grade d is divided into pyramid structures gradually and finely, and the model generates the target sign language sequence in an autoregressive mode of global series and local parallel. And taking the sign language attitude and the spoken sentence of the t-d frame as input through the output of the encoder, and outputting the predicted sign language attitude probability. The structure is similar to the encoder and is also composed of multiple layers. In addition to the two sublayers in each encoder layer, the decoder also adds a third sublayer, relaxedEncoder-decoding. In RelaxedEncoder-decoder, the output (K, V) of each encoder interacts with the input Q of all the decoders, causing the decoder to focus on the appropriate location in the encoder output. Thus, the middle attribute is different from self-attribute, with K, V from the encoder and Q from the output of the last position decoder.

By setting the unconstrained mask attention, the invention realizes a pyramid semi-autoregressive mechanism, aims to avoid the time-lapse property of an autoregressive model, and can solve the problem that the generation quality of a non-autoregressive model is weaker than that of an autoregressive model due to discarding the dependency relationship among target sequences. To balance the inference speed and the quality of the generation, a pyramidal semi-autoregressive model was built in a coarse-to-fine manner, setting four different packet sizes in each layer of the decoder: d =8, d =4, d =2, d =1. It is to be noted here that when d =1, it is actually equivalent to an autoregressive model. The fusion of the low-level special interest remote semantic information with the short-range global semantic information captured at the top level enables the decoder to maintain long-term dependencies while having a more efficient decoding speed.

The semi-autoregressive decoder with pyramid structure maintains the global autoregressive characteristic and the local non-autoregressive characteristic through an unconstrained mask attention (Relaxedmasked-attention) mechanism, and the semi-autoregressive decoder with pyramid structure maintains the global autoregressive characteristic and the local non-autoregressive characteristic

Is divided into

wherein P is a spoken sentence

Generating target sign language poses

wherein, M is the total frame number of the video, and d is the number of frames contained in each group.

With reference to fig. 4, the mask mechanism in the unconstrained mask attention is to prevent the leakage of context information and to avoid the model from peeping the sequence to be predicted in advance. Unlike the standard self-attention, the present invention creates a self-attention mechanism with a relaxed mask mechanism. The auto-regressive property is maintained in the global as the pyramid-type semi-auto-regressive decoder is generated in parallel within a group while capturing the pose frames in the previous group and within the current group. The token of the future frame to be masked is set to the invalid feature- ∞, so the location to be masked is added to- ∞andthen softhe output after tmax is 0.Relaxedmasked-attention model predictive Group _k When the inner posture frame is displayed, the Group can be seen _1tok All the frame information of (1). Given the target frame number M and the division level d,

is defined as follows:

wherein, the delayed _ mask is a two-dimensional matrix with the length of M, and d is the number of frames contained in each group; therefore, the Relaxedmasked-attention with residual concatenation is defined as follows:

wherein is divided by

Further keeping the variance control, the output of the sign language video generation network model firstly passes through a Linear layer and then passes through a Softmax layer; and obtaining logits through the Linear layer, wherein the logits value becomes a probability value through the Softmax layer, and the maximum Softmax value is the gesture corresponding to the current moment.

The darker color in the matrix of FIG. 4 is the- ∞ portion representing masked information, the lighter color represents unmasked information, the ordinate represents the position of the target vocabulary and the abscissa represents viewable position. This matrix is applied to each sequence to achieve a semi-autoregressive effect.

To obtain the optimal pyramid semi-autoregressive model, the invention sets different sized grouping levels and compares them with two layers (N = 2) of light autoregressive transformers. d = {2,2} denotes two decoder layers, both using a delayed masked authentication mechanism of d = 2. The relationship between the coarse-to-fine packet level and the inference delay is described in conjunction with fig. 5 (b), which shows that the packet level and the inference time are inversely proportional, and the inference delay is shorter as d gradually increases. Table 1 summarizes the speed and accuracy of the pyramid semi-autoregressive model to generate the target sign language in different configurations. The results show that the BLEU score of the predicted sign language sequence gradually decreases as d increases. When d = {2,2}, the decoding speed of the pyramid semi-autoregressive model is 1.90 times faster than that of the Transformer, and the BLEU score is only reduced by 0.09. In the case of d = {8,8}, the pyramid semi-autoregressive model can achieve 7.32 times acceleration, while the BLEU score drops by 4.84.

TABLE 1 ablation experiments of pyramid semi-autoregressive model on RWHH-PHOENIX-Weather 2014T dataset

Table 1 shows the results of the ablation experiments of the pyramid semi-autoregressive model for sign language synthesis task on RWTH-pheennix-Weather 2014T data set, and it can be seen that the grouping level of a group of pyramid structures from coarse to fine is further improved than the performance of a single configuration. d = {1,1,1 } indicates that the division level of all layers is 1, and actually corresponds to an autoregressive transform. d = {8,4,2,1} means that the bottom layer uses a replayed masked entry of d =8, the second layer uses d =4, the third layer uses d =2, until the top layer uses a resettledself-entry output in an autoregressive manner. d = {2,2,2 } and d = {8,6,4,1} are comparable in BLEU fraction, but the acceleration ratio of the latter is 3.60 times, far exceeding 1.84 x of the former. The present invention also experimented with d = {8,6,4,2}, the acceleration ratio is 4.75 x, but the BLEU score is sacrificed too much to be below the configured d = {8,4,2,1}2.48 BLEU scores. Therefore, the present invention sets d = {8,4,2,1} to the grouping level of the optimal pyramid structure.

And 3, training the model in the step 3, and learning the mapping relation between the spoken sentences and the sign language actions. The present invention uses the following hyper-parameter configuration during training: embed diameter =512, high diameter =512, feed-forward diameter =2048, number of heads =8, number of layers =4, batch size =16, drop =0.1. The invention uses Adam as an optimizer, the total iteration number is 40k, and particularly, a cosine annealing algorithm is used for preheating the learning rate lr:

the invention leads the maximum initial learning rate lr to be _max Set to 1e-3, minimum learning rate lr _min Set to 1e-4.step (c) _warmup The total number of steps of preheating is indicated, where a progressive preheating strategy is applied for the first 100 steps. step _total Equal to the total epoch times the total number of training samples, and finally divided by the set batchsize.

In order to accelerate the gradient descending convergence speed of the pyramid semi-autoregressive model and obtain a model with low generalization error, the knowledge learned from the pre-trained autoregressive model is transferred into the pyramid semi-autoregressive model to carry out model initialization, wherein the model initialization comprises all parameters in a text feature encoder, a rich semantic embedding layer, partial parameters of a pyramid semi-autoregressive decoder and random initialization of other parameters. The method has the advantages that the acceleration convergence effect is achieved, the problem of gradient disappearance or gradient explosion caused by non-initialization or improper initialization is avoided, and the accuracy of sign language synthesis is slightly improved.

And 3, after the training of the model is finished, the model outputs probability distribution of the sign language corresponding to each moment, and finally end-to-end translation of the spoken sentence into the personalized sign language video expressed in the human skeleton and graphic format is realized.

The invention verifies the effectiveness of the model on sign language synthesis from two public sign language data sets. RWTH-pheennix-Weather 2014T records a sign language video commentary of daily news and Weather forecast of the german public television station pheennix. RWHH-PHOENIX-Weather 2014T contains 8257 video samples, and 2887 words are combined into 5 356 continuous sentences related to Weather forecast. Meanwhile, the invention trains and scores the model on a Chinese sign language data set (CSL). Wherein, there are 100 sentences, there are 5000 continuous sign language videos in total, and each sentence contains 4-8 words on average.

The invention uses a transformer-based sign language translation model as a reverse translation evaluation method to evaluate the accuracy of sign language synthesis results. The invention changes the input of the sign language translation model from a sign language video frame to a sign language attitude sequence and trains a reverse translation evaluation model. The model scores are presented by standard indicators, including BLEU-1/4 and ROUGH, which measure the quality of translation in terms of Precision (Precision) and Recall (Recall), respectively. DTW treatment: and measuring the similarity of the true value and the predicted sequence.

In order to further prove the effectiveness of the sign language video generation method and device based on the improved transform model, the sign language video generation method and device are compared with other sign language video generation algorithms based on deep learning on RTH-PHOENIX-Weather 2014T data set and Chinese sign language data set (CSL), and the sign language video generation method and device comprise an autoregressive transform and a non-autoregressive transform. Table 2 summarizes the results of sign language generation on data set RWTH pheonix Weather 2014T.

As shown in Table 2, the integral improvement of precision after combining the autoregressive Transformer with the rich semantic embedding layer (SR) of the invention can obtain that the BLEU-1 and ROUGE scores are respectively increased by 0.15 and 0.38, which illustrates the importance of combining the position characteristic and the motion characteristic. Meanwhile, a non-autoregressive Transformer is set for a comparison experiment, so as to explore the influence of different regression modes of a decoder on the precision, and table 2 shows that the non-autoregressive Transformer greatly sacrifices the precision of the generated sign language video. In addition, compared with the autoregressive Transformer and the pyramid semi-autoregressive Transformer, although the accuracy score on the verification set is slightly reduced, it can be observed that the BLEU-1 score and the ROUGE score are respectively improved by 0.68 and 0.79 in the test set, which verifies that the pyramid semi-autoregressive model effectively relieves the error accumulation in the decoding process. Finally, according to the result, the fact that the precision is optimized based on the rich-semantic-embedded pyramid semi-autoregressive Transformer is proved. Compared with an autoregressive Transformer, the method improves the BLEU score of 0.91 and the ROUGE score of 1.42 on a test set.

TABLE 2 comparison of accuracy on RWHH-PHOENIX-Weather 2014T dataset

As shown in table 3, two factors affect the inference efficiency of the model, one is the time complexity of the decoder regression mode, and the other is the length of the target sequence. In the transform model, only one time step is moved each time when reasoning, and the time complexity of the step is represented by O (ts) in the invention. For the probability distribution obtained in the inference phase, the present invention uses greedy search to take the probability of its maximum probability from the probability distribution of each sign language frame, whose time complexity is represented by O (gs). The time complexity of each model is shown in Table 2, where L represents the target sign language sequence length, N represents the number of decoder layers, Σ d ^* Representing the sum of the packet sizes in a pyramid structure. As shown in fig. 5 (a), the inference delay of the autoregressive model increases linearly with the increase of the predicted length, and the parallel characteristic of the non-autoregressive Transformer makes the inference delay independent of the target length, and the speed is increased by 18.4 times. On the premise of not influencing the precision, the pyramid semi-autoregressive Transformer improves the reasoning speed by 3.6 times.

TABLE 3 inference delay, model acceleration ratio, time complexity on RWHH-PHOENIX-Weather 2014T dataset

FIG. 5 is a relationship between the inference delay and the predicted sign language length under different model configurations. Randomly selecting 20 sign language sequences with the length within the range of [31,163] for prediction, and obtaining experimental results of autoregressive transformers, non-autoregressive transformers and pyramid semi-autoregressive transformers in the step (a) in FIG. 5; fig. 5 (b) on the right shows the experimental results in the case of the pyramidal semi-autoregressive transform in the single configurations d =1, 2, 4, 8, respectively. In order to simplify the calculation amount, two layers of decoders are selected for experiment.

In order to more intuitively show the performance of the method, the method is respectively applied to RTH-PHOThe sign language pose sequences generated by different models are visualized on the ENIX-Weather-2014T and the Chinese sign language data set CSL. As shown in FIG. 6, 10 frames of sign language sequences of "our national folk richness" are sampled sequentially from left to right for comparison, wherein each column represents the attitude frame generated by different models at a certain time, the first row of attitude sequences is the result generated based on autoregressive Transformer, and it can be seen that at the t < th > point ₈ 、t ₉ The frame deviates significantly from the true tag (GT) and even distorts due to the accumulation of errors caused by the over-dependency between frames. Generation of a second behavioral non-autoregressive Transformer, in which t is lost ₅ Frame and at t ₇ 、t ₈ 、t ₉ A repeat frame is generated because it excessively discards the correlation between sign language pose frames, and is output in parallel for each time instant. The third row is a more reasonable and natural sign language gesture sequence generated by the pyramid semi-autoregressive Transformer of the present invention.

FIG. 7 shows the visualization on the RWHH-PHOENIX-Weather-2014T dataset. At t ₈ 、t ₉ In the frame, the autoregressive Transformer has the problem of unreality, and the non-autoregressive Transformer has the problem of frame loss. In contrast, the results of the present invention, which improves the Transformer model, yield a more stable sequence of sign language poses. This shows that the present invention takes advantage of the properties of semi-autoregressive decoding, thereby avoiding the disadvantages of autoregressive transformers and non-autoregressive transformers.

Claims

1. A sign language video generation method based on an improved Transformer model is characterized by comprising the following steps:

a. extracting a two-dimensional skeleton sequence of a target sign language posture in a target sign language video by adopting openposition, intercepting 8 joint points of an upper body and 21 joint points of a left hand and a right hand, and performing model training; lifting the two-dimensional data representing the gesture language posture into three-dimensional data, and performing data cleaning on the skeleton information of abnormal and wrong joints by observing the distribution of the three-dimensional data to form a target gesture language posture sequence;

b. inputting the spoken sentence and the target sign language gesture sequence into an encoder-decoder model, and training the encoder-decoder model to establish a mapping relation between the spoken sentence and the target sign language gesture sequence; after the mapping relation is established, a trained sign language video generation network model is formed;

2. The method of claim 1, wherein the coder-decoder model comprises a text feature coder with sign language length prediction and a pyramidal semi-autoregressive decoder with a rich semantic embedding layer.

3. The sign language video generating method of claim 2, wherein in the step b, the coder-decoder model is trained by inputting the spoken sentence into a text feature coder to learn semantic features and transmitting the learned semantic features to a pyramid semi-autoregressive decoder, and adding a convolutional neural network and a softmax classifier to the last layer of the coder to predict the sign language length; inputting the target sign language attitude sequence into a pyramid semi-autoregressive decoder for extracting space-time characteristics, and decoding the target sign language sequence in a semi-autoregressive mode by introducing a Relaxedmasked-attention mechanism; and establishing a mapping relation between the spoken sentence and the sign language action through model training.

4. A sign language video generating method as claimed in claim 3, wherein in step b, the extraction of the spatio-temporal features is to encode the information on the time dimension and the spatial displacement into the same space as the input of the model; grouping the target sign language attitude sequence by a pyramid semi-autoregressive decoder, keeping the cascade characteristic among groups, and generating a target frame in parallel in each group;

a spoken sentence containing N words is represented as: s = (S) ₁ ，...，s _N )；

The target sign language gesture sequence is represented as: t = (T) ₁ ，...，t _M )；

Wherein s is _i Is the ith word in the spoken sentence, N is the number of words in the spoken sentence, t _i The gesture language gesture of the ith frame is shown, and M is the video frame number;

the goal is to fit a parametric model that maximizes the conditional probability P (T | S) for translation of text into a sequence of spoken poses;

the joint set that incorporates joint velocity information between adjacent frames into the skeleton sequence is represented as:

wherein the content of the first and second substances,

5. A method as claimed in claim 4, wherein the text feature encoder comprises a plurality of layers having the same structure but different training parameters, each layer comprising two sub-layers, respectively a multi-headed attention mechanism and a feed-forward network fully connected position by position; where each sub-layer uses residual connection and layer normalization to ensure that the gradient is not 0, mitigating the appearance of gradient disappearance, the output of each sub-layer is represented as:

the word embedding layer uses a two-layer fully-connected network (FC) and a ReLU activation function ReLU, w ₁ Weight matrix for a fully connected network of the first layer, b ₁ Bias term for a fully connected network of the first layer, w ₂ Weight matrix for a second layer fully connected network, b ₂ For the bias item of the second layer full-connection network, multiplying the weight matrix by the input vector, adding the weight matrix to the bias, and introducing a position encoding module to keep the word sequence information as follows:

wherein S is _n Representing the nth word in the sentence.

6. The method of claim 5, wherein the multi-head attention mechanism is to project query (Q), key (K) and value (V) through h different linear transformations, and finally concatenate different attention results; the expression formula of the multi-head attention mechanism is as follows:

MultiHead(Q，K，V)＝Concat(head ₁ ，...，head _h )W ^O (4)

wherein Q, K and V are each independently

Multiplied by three trainable parameter matrices to obtain head _i Denotes the ith head; w ^O The weight matrix is obtained from a full connection layer and participates in model training along with other parameters;

the calculation formula for the ith head is as follows:

head _i ＝Attention(QW _i ^Q ，KW _i ^K ，VW _i ^V ) (5)

wherein is divided by

7. The sign language video generation method of claim 6, wherein the rich semantic embedding layer maps position and velocity information to the same vector space using two layers of fully connected networks (FC) and ReLU activation functions:

wherein the content of the first and second substances,

indicating position information after the joint u is encoded at the t-th frame;

velocity information indicating the joint u at the t-th frame; w is a ₁ Weight matrix for a fully connected network of the first layer, b ₁ Bias term for a first layer fully connected network, w ₂ Weight matrix for a second layer fully connected network, b ₂ And for the offset item of the second layer of the full-connection network, multiplying the weight matrix by the input vector, adding the multiplied weight matrix and the offset, and introducing a position 1encoding module to keep attitude sequence information.

Expressed as:

wherein the content of the first and second substances,

is a collection of encoded position information and velocity information;

Is divided into

wherein P is a spoken sentence

Generating target sign language poses

8. The sign language video generating method as claimed in claim 7, wherein the Group is predicted by using a Relaxedmasked-attention mechanism model _k When the inner posture frame is in, the Group can be seen _{1 tok} All the frame information of (2); given the target number of frames M and the division level d,

is defined as follows:

the Relaxedmasked-attention with residual concatenation is defined as follows:

wherein is divided by

Keeping variance control at 1; the Softmax function is a normalized exponential function.

9. The method for generating a sign language video according to claim 8, wherein the output of the network model for generating the sign language video passes through a Linear layer and then passes through a Softmax layer; and obtaining logits through the Linear layer, wherein the logits value becomes a probability value through the Softmax layer, and the maximum Softmax value is the gesture corresponding to the current moment.