CN115393948A - Sign language video generation method based on improved Transformer model - Google Patents

Sign language video generation method based on improved Transformer model Download PDF

Info

Publication number
CN115393948A
CN115393948A CN202210821012.5A CN202210821012A CN115393948A CN 115393948 A CN115393948 A CN 115393948A CN 202210821012 A CN202210821012 A CN 202210821012A CN 115393948 A CN115393948 A CN 115393948A
Authority
CN
China
Prior art keywords
sign language
layer
model
sequence
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210821012.5A
Other languages
Chinese (zh)
Inventor
崔振超
陈子昂
齐静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hebei University
Original Assignee
Hebei University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hebei University filed Critical Hebei University
Priority to CN202210821012.5A priority Critical patent/CN115393948A/en
Publication of CN115393948A publication Critical patent/CN115393948A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Social Psychology (AREA)
  • Multimedia (AREA)
  • Psychiatry (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a sign language video generation method and device based on an improved Transformer model. The method firstly extracts the skeleton attitude sequence in the sign language video and removes redundant information to reduce the calculated amount. In addition, considering the importance of the time-space information on the precision of generating the sign language video, the semantic-rich embedding module is designed to encode the position and speed information into the same high-dimensional space as the input of a model, so that the coordination of joint motion is improved, and the precision of feature expression is improved. Finally, a coder-decoder model in a pyramid structure is constructed. The encoder accepts as input a spoken sentence and encodes the information in the sequence into an intermediate representation. The decoder then decodes the intermediate representation as a sequence of target sign language poses in a semi-autoregressive manner. The method and the device can effectively improve the utilization rate of semantic information and the integral expression capability of actions, thereby obviously improving the accuracy and speed of generating the sign language video.

Description

Sign language video generation method based on improved Transformer model
Technical Field
The invention relates to a man-machine interaction method, in particular to a sign language video generation method based on an improved Transformer model.
Background
Sign language is a visual gesture language, transmits information through gestures and spatial movement of limbs, and is the most natural way for a hearing-impaired person to communicate with the outside world through man-machine interaction. Sign language real-time communication has become an important topic in the field of computer vision and natural language processing at present. The accurate automatic sign language generation can obviously improve the communication quality of deaf-dumb people and normal people and help the disabled people to better integrate into the current society. The goal of sign language video generation is to translate spoken sentences into personalized sign language video that humans can understand.
Currently, sign language video generation is mainly divided into two types, namely animation synthesis based and deep learning based. For sign language animation video synthesis based on a traditional method, a sign language word motion database is mostly constructed based on a motion tracking principle and a human body motion editing method, sign language motion segments corresponding to input texts are searched in the database and spliced and synthesized, and videos are vividly displayed based on a virtual reality modeling language method. In a comprehensive view, the sign language video generated based on the animation synthesis method has the advantages of convenience in operation and high efficiency, but the method depends on the construction of a large-scale sign language animation database, the animation video lacks vivid details for executing actions, and the understandability of the synthesized animation is still influenced by the appearance and the actions of manual design. Therefore, more and more research is beginning to explore more flexible and natural sign language generation schemes.
In recent years, as the traditional method cannot meet the requirement of gradually increasing data scale, deep learning continuously achieves excellent results in the field of sign language video generation, in 2020, the first sign language synthesis method based on deep learning is proposed, a coder-decoder structure based on a recurrent neural network is used for coding sign language video features into potential tokens and combining with a generative confrontation network, a human body posture skeleton and appearance are jointly used as conditions to be sent into a generator, and a generated vivid video frame is evaluated by a discriminator. However, the model based on the recurrent neural network has the defect of insufficient feature extraction due to the limitation of local receptive fields. Saunders et al propose a progressive transform-based sign language generation model that converts discrete spoken sentences to a sequence of continuous sign language gestures in an end-to-end manner for the first time, wherein a tokenized transform and a progressive transform are constructed to perform the translation of spoken sentences into a sequence of spoken words and the conversion of a sequence of words into a human skeletal video, respectively. Since the decoder generates the current frame depending on the previously generated gesture sequence, error accumulation and high inference delay are caused, and the phenomena that the network convergence speed is slow and the accuracy rate of generating sign language is not high occur.
At present, many models in a sign language video generation task based on deep learning are based on an autoregressive decoding mode, and each frame which is generated before each model needs to be considered when each target sign language pose is generated (namely, decoding), so that a translation generation process is serial, and error accumulation and high inference delay become an important bottleneck of the models. And a non-autoregressive model is a machine translation model that uses parallel decoding. Different from a word-by-word output mode of an autoregressive model from left to right, a non-autoregressive model can output at all times in parallel, has higher efficient decoding speed, but can damage precision. Although some studies have used non-autoregressive decoding approaches to generate sign language video, the problem of bone sequence duplication or loss due to discarding of dependencies between target sign language pose sequences is not considered.
Disclosure of Invention
The invention aims to provide a sign language video generation method and a sign language video generation device based on an improved Transformer model, so as to solve the problem that the existing method is insufficient in searching attitude information.
The invention is realized by the following steps: a sign language video generation method based on an improved Transformer model comprises the following steps:
a. extracting a two-dimensional skeleton sequence of a target sign language posture in a target sign language video by adopting openposition, intercepting 8 joint points of an upper body and 21 joint points of a left hand and a right hand, and performing model training; and (3) promoting the two-dimensional data representing the gesture language posture into three-dimensional data, and performing data cleaning on the skeleton information of abnormal and wrong joints by observing the distribution of the three-dimensional data to form a target gesture language posture sequence.
b. Inputting the spoken sentence and the target sign language gesture sequence into an encoder-decoder model, and training the encoder-decoder model to establish a mapping relation between the spoken sentence and the target sign language gesture sequence; and after the mapping relation is established, a trained sign language video generation network model is formed.
c. The trained sign language video generation network model is used for processing the input spoken sentences, the output of the model is the probability distribution of the sign language corresponding to each moment, and finally the spoken sentences are translated into the personalized sign language video expressed in human skeleton and graphic format from end to end.
In the step a, data cleaning comprises data processing modes such as data discarding and weighted linear interpolation.
The coder-decoder model includes a text feature coder with sign language length prediction and a pyramidal semi-autoregressive decoder incorporating a rich semantic embedding layer.
In the step b, the coder-decoder model is trained in a way that a spoken sentence is input into a text feature coder to learn semantic features and is transmitted to a pyramid semi-autoregressive decoder, and a convolutional neural network and a softmax classifier are added to the last layer of the coder to predict the sign language length; inputting the target sign language attitude sequence into a pyramid semi-autoregressive decoder for extracting space-time characteristics, and decoding the target sign language sequence in a semi-autoregressive mode by introducing a Relaxedleased-attention mechanism; and establishing a mapping relation between the spoken sentence and the sign language action through model training.
In the step b, the extraction of the space-time characteristics is to encode the information on the time dimension and the space displacement into the same space as the input of a model so as to solve the deficiency of semantic information exploration; the pyramid semi-autoregressive decoder is used for grouping the target sign language attitude sequences from coarse to fine, the cascade characteristic is kept among groups, and a target frame is generated in each group in parallel.
A spoken sentence containing N words is represented as: s = (S) 1 ,…,s N );
The target sign language gesture sequence is represented as: t = (T) 1 ,…,t M );
Wherein s is i Is the ith in the spoken sentenceWords, N is the number of words in the spoken sentence, t i The gesture language gesture of the ith frame is shown, and M is the video frame number.
The goal is to fit a parametric model that maximizes the conditional probability P (T | S) for translation of text into a sequence of hand gestures.
The joint set for fusing joint velocity information between adjacent frames into a bone sequence is represented as:
Figure BDA0003744441990000031
wherein the content of the first and second substances,
Figure BDA0003744441990000032
three-dimensional coordinate information representing the joint u at the t-th frame
Figure BDA0003744441990000033
Figure BDA0003744441990000034
And the speed information of the joint u in the t frame is obtained by subtracting the three-dimensional coordinate information of the t frame and the t-1 frame.
The length of the target sign language pose sequence is L, P (L | S) is modeled separately, and the maximum value of L is set to 100.
The text feature encoder comprises a plurality of layers with the same structure but different training parameters, wherein each layer comprises two sublayers, namely a multi-head attention mechanism and a feed-forward network which is fully connected position by position; where each sub-layer uses residual connection (residual connection) and layer normalization (layer normalization) to ensure that the gradient is not 0, mitigating the appearance of gradient vanishing, the output of each sub-layer is represented as:
Figure BDA0003744441990000035
wherein the content of the first and second substances,
Figure BDA0003744441990000036
is composed ofA source spoken sentence S is coded by a word embedding layer to obtain a characteristic vector;
the word embedding layer uses a two-layer fully-connected network (FC) and a ReLU activation function ReLU, w 1 Weight matrix for a fully connected network of the first layer, b 1 Bias term for a fully connected network of the first layer, w 2 Weight matrix for a second layer fully connected network, b 2 For the bias item of the second layer full-connection network, multiplying the weight matrix by the input vector, adding the multiplied weight matrix and the bias, and introducing a posionalencoding module to keep the word sequence information as follows:
Figure BDA0003744441990000037
wherein S is n Representing the nth word in the sentence.
The multi-head attention mechanism is that h different linear transformations are used for projecting query (Q), key (K) and value (V), and finally different attention results are spliced; wherein Q, K and V are each independently
Figure BDA0003744441990000038
Multiplying the three trainable parameter matrixes; the expression formula of the multi-head attention mechanism is as follows:
MultiHead(Q,K,V)=Concat(head 1 ,…,head h )W O (4)
wherein Q, K and V are each independently
Figure BDA0003744441990000039
Multiplied by three trainable parameter matrices to obtain head i Denotes the ith head; w O The weight matrix is obtained from a full connection layer and participates in model training along with other parameters.
The calculation formula for the ith head is as follows:
head i =Attention(QW i Q ,KW i K ,VW i V ) (5)
wherein, W i Q 、W i K And W i V Three trainable parameter matrices corresponding to Q, K, V; attention is the Attention of zooming dot product, the higher the similarity of two vectors is, the larger the dot product result is, the more Attention the model is, and the calculation formula is as follows:
Figure BDA0003744441990000041
wherein is divided by
Figure BDA0003744441990000042
Keeping variance control at 1; the Softmax function is a normalized exponential function such that each element ranges between (0, 1) and the sum of all elements is 1.
The rich semantic embedding layer maps position and velocity information to the same vector space using a two-layer fully connected network (FC) and a ReLU activation function:
Figure BDA0003744441990000043
Figure BDA0003744441990000044
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003744441990000045
indicating position information after the joint u is encoded at the t-th frame;
Figure BDA0003744441990000046
velocity information indicating the joint u at the t-th frame; w is a 1 Weight matrix for a fully connected network of the first layer, b 1 Bias term for a fully connected network of the first layer, w 2 Weight matrix for a second layer fully connected network, b 2 And for the bias item of the second layer of the fully-connected network, multiplying the weight matrix by the input vector, adding the multiplied weight matrix to the bias, and introducing a position encoding module to keep attitude sequence information.
The gesture language attitude sequence T is obtained by a rich semantic embedding layer
Figure BDA0003744441990000047
Expressed as:
Figure BDA0003744441990000048
wherein the content of the first and second substances,
Figure BDA0003744441990000049
is a collection of encoded position information and velocity information;
the semi-autoregressive decoder with a pyramid structure keeps global autoregressive characteristics and local non-autoregressive characteristics through a RelaxedASKED-attention mechanism
Figure BDA00037444419900000410
Is divided into
Figure BDA00037444419900000411
Groups, where each group contains d frames, the conditional probability can be expressed as:
Figure BDA00037444419900000412
wherein P is a spoken sentence
Figure BDA00037444419900000413
Generating target sign language poses
Figure BDA00037444419900000414
The conditional probability of (a); g represents grouping the target sign language gesture sequence, and the specific division is as follows:
Figure BDA00037444419900000415
wherein M is the total frame number of the video, and d is the number of frames contained in each group.
Predicting Group using Relaxedmasked-attention mechanism model k When the inner posture frame is in, the Group can be seen 1tok All the frame information of (2); given the target number of frames M and the division level d,
Figure BDA00037444419900000416
is defined as follows:
Figure BDA0003744441990000051
wherein, the delayed _ mask is a two-dimensional matrix with the length of M, and d is the number of frames contained in each group;
the Relaxedmasked-attention with residual concatenation is defined as follows:
Figure BDA0003744441990000052
wherein is divided by
Figure BDA0003744441990000053
Further keeping the variance control, the output of the sign language video generation network model firstly passes through a Linear layer and then passes through a Softmax layer; and obtaining logits through the Linear layer, wherein the value of the logits is changed into a probability value through the Softmax layer, and the maximum Softmax value is the gesture of the sign language corresponding to the current moment.
In order to better mine semantic information from sign language attitude videos and reduce the calculation amount and resource consumption, the invention adopts a pyramid semi-autoregressive mode from coarse to fine based on a pyramid semi-autoregressive model with rich semantic information compared with the traditional transformer model so as to relieve the problem of gradient explosion caused by error accumulation when a decoder is trained, thereby maintaining the autoregressive characteristic globally and generating the sign language video generation model of the sign language attitude sequence locally and in parallel.
The invention codes the position and speed information into the same high-dimensional space as the input of the model decoder, improves the coordination of joint movement and leads the generated hand action to be more natural. Experiments prove that the speed information contained in the gesture language gesture sequence is an important characteristic which is ignored in the previous research.
Aiming at the phenomena of finger joint overlapping, confusion and the like caused by the fact that the rotation angles of the joints of the hand are various due to complexity, the method and the device encode position and speed information into the same high-dimensional space by adding the rich semantic embedding layer, so that the decoding speed and precision are well balanced, the decoder of the Transformer is improved to combine the information of the sign language action on the time dimension and the space displacement together, the redundant information in a video sequence is filtered, and the sign language synthesis precision is improved.
The invention adopts a coarse-to-fine pyramid semi-autoregressive decoder to generate the target sign language gesture sequence, which is different from a mode of sequentially outputting autoregressive Transformer models from left to right, and the pyramid semi-autoregressive Transformer models are grouped and output in parallel during decoding, so that the decoding speed is higher. Better balance is carried out between an autoregressive mode and a non-autoregressive mode, and the gesture language attitude can be generated in a local parallel mode on the basis of globally keeping the autoregressive characteristic. The method improves the reasoning speed by 3.6 times on the premise of not influencing the precision.
The invention has better overall performance than the general mainstream algorithm and is more suitable for man-machine interaction. The improved Transformer model has the beneficial effects that: compared with the original network method, the method can effectively improve the accuracy of generating the sign language video, and simultaneously achieves high-efficiency generation speed, thereby facilitating the ground real-time deployment of the device.
Drawings
Fig. 1 is a flow diagram of sign language video generation.
FIG. 2 is a diagram of the overall structural framework of the improved Transformer model.
FIG. 3 is a diagram of a dynamic information mining framework in accordance with the present invention.
FIG. 4 is a unconstrained mask attention framework diagram of the present invention.
FIG. 5 is a graph of inferred delay versus predicted sign language length for different models; wherein, (a) is a comparison graph of the configuration of the autoregressive model, the non-autoregressive model and the semi-autoregressive model, and (b) is a comparison graph of the grouping sizes of the semi-autoregressive models of 1, 2, 4 and 8 respectively.
Fig. 6 is a sign language video result generated by different models on the chinese sign language data set CSL.
FIG. 7 is a sign language video result generated by different models on the German sign language data set RWHH-PHOENIX-Weather 2014T.
Detailed Description
As shown in FIG. 1, the flow of sign language video generation is realized by a sign language synthesis model and a sign language skeleton extraction module together to predict sign language actions from texts.
As shown in FIG. 2, the overall structure of the improved Transformer model is composed of a text feature encoder with sign language length prediction and a pyramid semi-autoregressive decoder combined with a rich semantic embedding layer.
The sign language video generation method comprises the following three steps:
step 1: firstly, data preprocessing is carried out, openposition is adopted to extract a two-dimensional skeleton sequence of a target sign language posture in a target sign language video, in order to remove redundant information and reduce the calculated amount, 8 joint points of the upper body and 21 joint points of the left hand and the right hand are intercepted, and 50 joints are used for model training. And secondly, promoting the two-dimensional information of the gesture language gesture sequence to be three-dimensional by using the 2D to 3D Inverse Kinematics tools. And by observing the distribution of the three-dimensional data and the problem of data imbalance, carrying out data cleaning on the skeleton information at abnormal and wrong joints, and carrying out processing such as discarding or weighted linear interpolation.
Step 2: the method comprises two processes of encoding of a spoken sentence and decoding of a target sign language gesture sequence.
The encoding process is to input the spoken sentence into a text feature encoder of the improved Transformer network, learn semantic features and transmit the semantic features to a decoder, and add an additional convolutional neural network and a softmax classifier at the last layer to perform sign language length prediction.
The decoding process comprises the steps of inputting the sign language attitude sequence extracted in the step 1 into a pyramid semi-autoregressive decoder of the improved Transformer network, extracting space-time characteristics from the sign language attitude sequence by using a rich semantic embedding layer, and coding information on time dimension and space displacement into the same space as input of a model to solve the defect of semantic information exploration; the semi-autoregressive decoder in the pyramid structure groups the target sign language gesture sequence from coarse to fine, and cascade characteristics are kept among groups, but target frames are generated in parallel in each group.
And step 3: the model output is the probability distribution of the sign language corresponding to each moment, and finally, the spoken sentence is translated into the personalized sign language video expressed by human skeleton and graphic format from end to end.
The text feature encoder in step 2 mainly comprises four parts, which are: word Embedding layer (Word Embedding), multi-head self-attention layer (self-attention), feed-forward layer (feed-forward), and sign language length prediction layer.
In conjunction with fig. 2, the text feature encoder in step 2 learns semantic features from the input spoken sentence and passes them to the decoder. Encoding source input S into feature vectors using a two-layer fully-connected network (FC) and a ReLU activation function at the word embedding layer
Figure BDA0003744441990000071
w 1 And b 1 Weight matrix and bias term for fully-connected network of first layer, w 2 And b 2 For a weight matrix and an offset item of a second-layer fully-connected network, multiplying the weight matrix by an input vector, adding the multiplied weight matrix and the input vector with an offset, and finally introducing a position encoding module to keep word sequence information, wherein the formula is as follows:
Figure BDA0003744441990000072
wherein S is n Representing the nth word in the sentence.
The encoder consists of a plurality of layers with the same structure but different training parameters, wherein each layer consists of two sublayers, namely a multi-head attention mechanism and a position-wise feed-forward network. Wherein each sublayer uses residual concatenation and layerormalization to ensure that the gradient is not 0, mitigating the appearance of gradient disappearance, and thus the output of the sublayer can be expressed as:
Figure BDA0003744441990000073
wherein the content of the first and second substances,
Figure BDA0003744441990000074
the method comprises the steps of obtaining a characteristic vector by encoding a source spoken sentence S through a word embedding layer;
assuming that the number of the multi-head attention is h, the multi-head is equivalent to inputting original information into different spaces, and it is ensured that a transformer can capture feature information of different subspaces. And (3) projecting the query Q, the key K and the value V through h different linear transformations, and finally splicing h matrixes into a final afferent feedforward neural network layer.
MultiHead(Q,K,V)=Concat(head 1 ,…,head h )W O (4)
Wherein Q, K and V are each independently
Figure BDA0003744441990000075
Multiplied by three trainable parameter matrices to obtain head i Denotes the ith head; w is a group of O The weight matrix is obtained from a full connection layer and participates in model training along with other parameters.
The calculation formula for the ith head is as follows:
head i =Attention(QW i Q ,KW i K ,VW i V ) (5)
wherein, W i Q 、W i K And W i V Three trainable parameter matrices corresponding to Q, K, V; attention is the Attention of zooming dot product, the higher the similarity of two vectors is, the larger the dot product result is, the more Attention the model is, and the calculation formula is as follows:
Figure BDA0003744441990000076
wherein is divided by
Figure BDA0003744441990000077
Keeping variance control at 1; the Softmax function is a normalized exponential function such that each element ranges between (0, 1) and the sum of all elements is 1.
Assuming that x is the output after passing through the multi-head attention layer, position-wise feed-forward networks provide a non-linear transformation, which is so because the transformation parameters at each Position i are the same when passing through the linear layer.
Figure BDA0003744441990000078
Sign language length prediction is performed on the output of the last layer of the encoder using a single-layer neural network and a softmax classifier. Assuming that the length of the target gesture language pose sequence is L, the translated text is divided into L/d segments at different decoder layers, the segments are generated in an autoregressive mode from left to right, and d frame target pose sequences are generated in parallel in a non-autoregressive mode in each segment. Therefore, the length of the target sign language sequence is an important latent variable, and the final translation quality is influenced. P (L | S) was modeled separately and set to a maximum value of L of 100 in the experiment. It is worth noting that the present invention uses predicted length-assisted reasoning only in testing sign language synthesis models, which use the length of a reference target sequence during training.
The pyramid semi-autoregressive decoder in the step 2 mainly comprises four parts, which are sequentially as follows: a rich semantic embedding layer, an unconstrained mask attention (releasedmasked-attention), an unconstrained mask codec attention (releasedecoder-attention), and a feed-forward layer (feed-forward).
In conjunction with fig. 3, the present invention mines dynamic information, with the dark arrows representing thumb joint movement information and the light arrows representing elbow joint movement information. And 3, learning features from the gesture language attitude sequence extracted in the step 1 by a semantic-rich embedding layer. Wherein in three dimensionsPassing position in coordinate system
Figure BDA0003744441990000081
And velocity
Figure BDA0003744441990000082
Represents the dynamic information of the joint u in the t-th frame, and the joint velocity information and the position information between adjacent frames are merged into the joint set of the skeleton sequence, and the dynamic information is represented as follows:
Figure BDA0003744441990000083
and mapping the position and speed information to the same vector space by using a two-layer fully-connected network (FC) and a ReLU activation function at a rich semantic embedding layer, wherein the formula is as follows:
Figure BDA0003744441990000084
Figure BDA0003744441990000085
wherein the content of the first and second substances,
Figure BDA0003744441990000086
indicating position information after the joint u is encoded at the t-th frame;
Figure BDA0003744441990000087
velocity information indicating the joint u at the t-th frame; w is a 1 Weight matrix for a fully connected network of the first layer, b 1 Bias term for a fully connected network of the first layer, w 2 Weight matrix for a second layer fully connected network, b 2 And for the bias item of the second layer of the full-connection network, multiplying the weight matrix by the input vector, adding the multiplied weight matrix and the bias, and introducing a posionalencoding module to keep attitude sequence information.
The gesture language attitude sequence T is obtained by a rich semantic embedding layer
Figure BDA0003744441990000088
Expressed as:
Figure BDA0003744441990000089
wherein the content of the first and second substances,
Figure BDA00037444419900000810
a set of encoded position information and velocity information;
in the pyramid semi-autoregressive decoder in the step 2, the grade d is divided into pyramid structures gradually and finely, and the model generates the target sign language sequence in an autoregressive mode of global series and local parallel. And taking the sign language attitude and the spoken sentence of the t-d frame as input through the output of the encoder, and outputting the predicted sign language attitude probability. The structure is similar to the encoder and is also composed of multiple layers. In addition to the two sublayers in each encoder layer, the decoder also adds a third sublayer, relaxedEncoder-decoding. In RelaxedEncoder-decoder, the output (K, V) of each encoder interacts with the input Q of all the decoders, causing the decoder to focus on the appropriate location in the encoder output. Thus, the middle attribute is different from self-attribute, with K, V from the encoder and Q from the output of the last position decoder.
By setting the unconstrained mask attention, the invention realizes a pyramid semi-autoregressive mechanism, aims to avoid the time-lapse property of an autoregressive model, and can solve the problem that the generation quality of a non-autoregressive model is weaker than that of an autoregressive model due to discarding the dependency relationship among target sequences. To balance the inference speed and the quality of the generation, a pyramidal semi-autoregressive model was built in a coarse-to-fine manner, setting four different packet sizes in each layer of the decoder: d =8, d =4, d =2, d =1. It is to be noted here that when d =1, it is actually equivalent to an autoregressive model. The fusion of the low-level special interest remote semantic information with the short-range global semantic information captured at the top level enables the decoder to maintain long-term dependencies while having a more efficient decoding speed.
The semi-autoregressive decoder with pyramid structure maintains the global autoregressive characteristic and the local non-autoregressive characteristic through an unconstrained mask attention (Relaxedmasked-attention) mechanism, and the semi-autoregressive decoder with pyramid structure maintains the global autoregressive characteristic and the local non-autoregressive characteristic
Figure BDA0003744441990000091
Is divided into
Figure BDA0003744441990000092
Groups, where each group contains d frames, the conditional probability can be expressed as:
Figure BDA0003744441990000093
wherein P is a spoken sentence
Figure BDA0003744441990000094
Generating target sign language poses
Figure BDA0003744441990000095
The conditional probability of (a); g represents grouping the target sign language gesture sequence, and the specific division is as follows:
Figure BDA0003744441990000096
wherein, M is the total frame number of the video, and d is the number of frames contained in each group.
With reference to fig. 4, the mask mechanism in the unconstrained mask attention is to prevent the leakage of context information and to avoid the model from peeping the sequence to be predicted in advance. Unlike the standard self-attention, the present invention creates a self-attention mechanism with a relaxed mask mechanism. The auto-regressive property is maintained in the global as the pyramid-type semi-auto-regressive decoder is generated in parallel within a group while capturing the pose frames in the previous group and within the current group. The token of the future frame to be masked is set to the invalid feature- ∞, so the location to be masked is added to- ∞andthen softhe output after tmax is 0.Relaxedmasked-attention model predictive Group k When the inner posture frame is displayed, the Group can be seen 1tok All the frame information of (1). Given the target frame number M and the division level d,
Figure BDA0003744441990000097
is defined as follows:
Figure BDA0003744441990000098
wherein, the delayed _ mask is a two-dimensional matrix with the length of M, and d is the number of frames contained in each group; therefore, the Relaxedmasked-attention with residual concatenation is defined as follows:
Figure BDA0003744441990000099
wherein is divided by
Figure BDA00037444419900000910
Further keeping the variance control, the output of the sign language video generation network model firstly passes through a Linear layer and then passes through a Softmax layer; and obtaining logits through the Linear layer, wherein the logits value becomes a probability value through the Softmax layer, and the maximum Softmax value is the gesture corresponding to the current moment.
The darker color in the matrix of FIG. 4 is the- ∞ portion representing masked information, the lighter color represents unmasked information, the ordinate represents the position of the target vocabulary and the abscissa represents viewable position. This matrix is applied to each sequence to achieve a semi-autoregressive effect.
To obtain the optimal pyramid semi-autoregressive model, the invention sets different sized grouping levels and compares them with two layers (N = 2) of light autoregressive transformers. d = {2,2} denotes two decoder layers, both using a delayed masked authentication mechanism of d = 2. The relationship between the coarse-to-fine packet level and the inference delay is described in conjunction with fig. 5 (b), which shows that the packet level and the inference time are inversely proportional, and the inference delay is shorter as d gradually increases. Table 1 summarizes the speed and accuracy of the pyramid semi-autoregressive model to generate the target sign language in different configurations. The results show that the BLEU score of the predicted sign language sequence gradually decreases as d increases. When d = {2,2}, the decoding speed of the pyramid semi-autoregressive model is 1.90 times faster than that of the Transformer, and the BLEU score is only reduced by 0.09. In the case of d = {8,8}, the pyramid semi-autoregressive model can achieve 7.32 times acceleration, while the BLEU score drops by 4.84.
TABLE 1 ablation experiments of pyramid semi-autoregressive model on RWHH-PHOENIX-Weather 2014T dataset
Figure BDA0003744441990000101
Table 1 shows the results of the ablation experiments of the pyramid semi-autoregressive model for sign language synthesis task on RWTH-pheennix-Weather 2014T data set, and it can be seen that the grouping level of a group of pyramid structures from coarse to fine is further improved than the performance of a single configuration. d = {1,1,1 } indicates that the division level of all layers is 1, and actually corresponds to an autoregressive transform. d = {8,4,2,1} means that the bottom layer uses a replayed masked entry of d =8, the second layer uses d =4, the third layer uses d =2, until the top layer uses a resettledself-entry output in an autoregressive manner. d = {2,2,2 } and d = {8,6,4,1} are comparable in BLEU fraction, but the acceleration ratio of the latter is 3.60 times, far exceeding 1.84 x of the former. The present invention also experimented with d = {8,6,4,2}, the acceleration ratio is 4.75 x, but the BLEU score is sacrificed too much to be below the configured d = {8,4,2,1}2.48 BLEU scores. Therefore, the present invention sets d = {8,4,2,1} to the grouping level of the optimal pyramid structure.
And 3, training the model in the step 3, and learning the mapping relation between the spoken sentences and the sign language actions. The present invention uses the following hyper-parameter configuration during training: embed diameter =512, high diameter =512, feed-forward diameter =2048, number of heads =8, number of layers =4, batch size =16, drop =0.1. The invention uses Adam as an optimizer, the total iteration number is 40k, and particularly, a cosine annealing algorithm is used for preheating the learning rate lr:
Figure BDA0003744441990000111
the invention leads the maximum initial learning rate lr to be max Set to 1e-3, minimum learning rate lr min Set to 1e-4.step (c) warmup The total number of steps of preheating is indicated, where a progressive preheating strategy is applied for the first 100 steps. step total Equal to the total epoch times the total number of training samples, and finally divided by the set batchsize.
In order to accelerate the gradient descending convergence speed of the pyramid semi-autoregressive model and obtain a model with low generalization error, the knowledge learned from the pre-trained autoregressive model is transferred into the pyramid semi-autoregressive model to carry out model initialization, wherein the model initialization comprises all parameters in a text feature encoder, a rich semantic embedding layer, partial parameters of a pyramid semi-autoregressive decoder and random initialization of other parameters. The method has the advantages that the acceleration convergence effect is achieved, the problem of gradient disappearance or gradient explosion caused by non-initialization or improper initialization is avoided, and the accuracy of sign language synthesis is slightly improved.
And 3, after the training of the model is finished, the model outputs probability distribution of the sign language corresponding to each moment, and finally end-to-end translation of the spoken sentence into the personalized sign language video expressed in the human skeleton and graphic format is realized.
The invention verifies the effectiveness of the model on sign language synthesis from two public sign language data sets. RWTH-pheennix-Weather 2014T records a sign language video commentary of daily news and Weather forecast of the german public television station pheennix. RWHH-PHOENIX-Weather 2014T contains 8257 video samples, and 2887 words are combined into 5 356 continuous sentences related to Weather forecast. Meanwhile, the invention trains and scores the model on a Chinese sign language data set (CSL). Wherein, there are 100 sentences, there are 5000 continuous sign language videos in total, and each sentence contains 4-8 words on average.
The invention uses a transformer-based sign language translation model as a reverse translation evaluation method to evaluate the accuracy of sign language synthesis results. The invention changes the input of the sign language translation model from a sign language video frame to a sign language attitude sequence and trains a reverse translation evaluation model. The model scores are presented by standard indicators, including BLEU-1/4 and ROUGH, which measure the quality of translation in terms of Precision (Precision) and Recall (Recall), respectively. DTW treatment: and measuring the similarity of the true value and the predicted sequence.
In order to further prove the effectiveness of the sign language video generation method and device based on the improved transform model, the sign language video generation method and device are compared with other sign language video generation algorithms based on deep learning on RTH-PHOENIX-Weather 2014T data set and Chinese sign language data set (CSL), and the sign language video generation method and device comprise an autoregressive transform and a non-autoregressive transform. Table 2 summarizes the results of sign language generation on data set RWTH pheonix Weather 2014T.
As shown in Table 2, the integral improvement of precision after combining the autoregressive Transformer with the rich semantic embedding layer (SR) of the invention can obtain that the BLEU-1 and ROUGE scores are respectively increased by 0.15 and 0.38, which illustrates the importance of combining the position characteristic and the motion characteristic. Meanwhile, a non-autoregressive Transformer is set for a comparison experiment, so as to explore the influence of different regression modes of a decoder on the precision, and table 2 shows that the non-autoregressive Transformer greatly sacrifices the precision of the generated sign language video. In addition, compared with the autoregressive Transformer and the pyramid semi-autoregressive Transformer, although the accuracy score on the verification set is slightly reduced, it can be observed that the BLEU-1 score and the ROUGE score are respectively improved by 0.68 and 0.79 in the test set, which verifies that the pyramid semi-autoregressive model effectively relieves the error accumulation in the decoding process. Finally, according to the result, the fact that the precision is optimized based on the rich-semantic-embedded pyramid semi-autoregressive Transformer is proved. Compared with an autoregressive Transformer, the method improves the BLEU score of 0.91 and the ROUGE score of 1.42 on a test set.
TABLE 2 comparison of accuracy on RWHH-PHOENIX-Weather 2014T dataset
Figure BDA0003744441990000121
As shown in table 3, two factors affect the inference efficiency of the model, one is the time complexity of the decoder regression mode, and the other is the length of the target sequence. In the transform model, only one time step is moved each time when reasoning, and the time complexity of the step is represented by O (ts) in the invention. For the probability distribution obtained in the inference phase, the present invention uses greedy search to take the probability of its maximum probability from the probability distribution of each sign language frame, whose time complexity is represented by O (gs). The time complexity of each model is shown in Table 2, where L represents the target sign language sequence length, N represents the number of decoder layers, Σ d * Representing the sum of the packet sizes in a pyramid structure. As shown in fig. 5 (a), the inference delay of the autoregressive model increases linearly with the increase of the predicted length, and the parallel characteristic of the non-autoregressive Transformer makes the inference delay independent of the target length, and the speed is increased by 18.4 times. On the premise of not influencing the precision, the pyramid semi-autoregressive Transformer improves the reasoning speed by 3.6 times.
TABLE 3 inference delay, model acceleration ratio, time complexity on RWHH-PHOENIX-Weather 2014T dataset
Figure BDA0003744441990000131
FIG. 5 is a relationship between the inference delay and the predicted sign language length under different model configurations. Randomly selecting 20 sign language sequences with the length within the range of [31,163] for prediction, and obtaining experimental results of autoregressive transformers, non-autoregressive transformers and pyramid semi-autoregressive transformers in the step (a) in FIG. 5; fig. 5 (b) on the right shows the experimental results in the case of the pyramidal semi-autoregressive transform in the single configurations d =1, 2, 4, 8, respectively. In order to simplify the calculation amount, two layers of decoders are selected for experiment.
In order to more intuitively show the performance of the method, the method is respectively applied to RTH-PHOThe sign language pose sequences generated by different models are visualized on the ENIX-Weather-2014T and the Chinese sign language data set CSL. As shown in FIG. 6, 10 frames of sign language sequences of "our national folk richness" are sampled sequentially from left to right for comparison, wherein each column represents the attitude frame generated by different models at a certain time, the first row of attitude sequences is the result generated based on autoregressive Transformer, and it can be seen that at the t < th > point 8 、t 9 The frame deviates significantly from the true tag (GT) and even distorts due to the accumulation of errors caused by the over-dependency between frames. Generation of a second behavioral non-autoregressive Transformer, in which t is lost 5 Frame and at t 7 、t 8 、t 9 A repeat frame is generated because it excessively discards the correlation between sign language pose frames, and is output in parallel for each time instant. The third row is a more reasonable and natural sign language gesture sequence generated by the pyramid semi-autoregressive Transformer of the present invention.
FIG. 7 shows the visualization on the RWHH-PHOENIX-Weather-2014T dataset. At t 8 、t 9 In the frame, the autoregressive Transformer has the problem of unreality, and the non-autoregressive Transformer has the problem of frame loss. In contrast, the results of the present invention, which improves the Transformer model, yield a more stable sequence of sign language poses. This shows that the present invention takes advantage of the properties of semi-autoregressive decoding, thereby avoiding the disadvantages of autoregressive transformers and non-autoregressive transformers.

Claims (9)

1. A sign language video generation method based on an improved Transformer model is characterized by comprising the following steps:
a. extracting a two-dimensional skeleton sequence of a target sign language posture in a target sign language video by adopting openposition, intercepting 8 joint points of an upper body and 21 joint points of a left hand and a right hand, and performing model training; lifting the two-dimensional data representing the gesture language posture into three-dimensional data, and performing data cleaning on the skeleton information of abnormal and wrong joints by observing the distribution of the three-dimensional data to form a target gesture language posture sequence;
b. inputting the spoken sentence and the target sign language gesture sequence into an encoder-decoder model, and training the encoder-decoder model to establish a mapping relation between the spoken sentence and the target sign language gesture sequence; after the mapping relation is established, a trained sign language video generation network model is formed;
c. the trained sign language video generation network model is used for processing the input spoken sentences, the output of the model is the probability distribution of the sign language corresponding to each moment, and finally the spoken sentences are translated into the personalized sign language video expressed in human skeleton and graphic format from end to end.
2. The method of claim 1, wherein the coder-decoder model comprises a text feature coder with sign language length prediction and a pyramidal semi-autoregressive decoder with a rich semantic embedding layer.
3. The sign language video generating method of claim 2, wherein in the step b, the coder-decoder model is trained by inputting the spoken sentence into a text feature coder to learn semantic features and transmitting the learned semantic features to a pyramid semi-autoregressive decoder, and adding a convolutional neural network and a softmax classifier to the last layer of the coder to predict the sign language length; inputting the target sign language attitude sequence into a pyramid semi-autoregressive decoder for extracting space-time characteristics, and decoding the target sign language sequence in a semi-autoregressive mode by introducing a Relaxedmasked-attention mechanism; and establishing a mapping relation between the spoken sentence and the sign language action through model training.
4. A sign language video generating method as claimed in claim 3, wherein in step b, the extraction of the spatio-temporal features is to encode the information on the time dimension and the spatial displacement into the same space as the input of the model; grouping the target sign language attitude sequence by a pyramid semi-autoregressive decoder, keeping the cascade characteristic among groups, and generating a target frame in parallel in each group;
a spoken sentence containing N words is represented as: s = (S) 1 ,...,s N );
The target sign language gesture sequence is represented as: t = (T) 1 ,...,t M );
Wherein s is i Is the ith word in the spoken sentence, N is the number of words in the spoken sentence, t i The gesture language gesture of the ith frame is shown, and M is the video frame number;
the goal is to fit a parametric model that maximizes the conditional probability P (T | S) for translation of text into a sequence of spoken poses;
the joint set that incorporates joint velocity information between adjacent frames into the skeleton sequence is represented as:
Figure FDA0003744441980000011
wherein the content of the first and second substances,
Figure FDA0003744441980000012
Figure FDA0003744441980000013
three-dimensional coordinate information representing the joint u at the t-th frame
Figure FDA0003744441980000014
Figure FDA0003744441980000021
And the speed information of the joint u in the t frame is obtained by subtracting the three-dimensional coordinate information of the t frame and the t-1 frame.
The length of the target sign language pose sequence is L, P (L | S) is modeled separately, and the maximum value of L is set to 100.
5. A method as claimed in claim 4, wherein the text feature encoder comprises a plurality of layers having the same structure but different training parameters, each layer comprising two sub-layers, respectively a multi-headed attention mechanism and a feed-forward network fully connected position by position; where each sub-layer uses residual connection and layer normalization to ensure that the gradient is not 0, mitigating the appearance of gradient disappearance, the output of each sub-layer is represented as:
Figure FDA0003744441980000022
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003744441980000023
the method comprises the steps of obtaining a characteristic vector by encoding a source spoken sentence S through a word embedding layer;
the word embedding layer uses a two-layer fully-connected network (FC) and a ReLU activation function ReLU, w 1 Weight matrix for a fully connected network of the first layer, b 1 Bias term for a fully connected network of the first layer, w 2 Weight matrix for a second layer fully connected network, b 2 For the bias item of the second layer full-connection network, multiplying the weight matrix by the input vector, adding the weight matrix to the bias, and introducing a position encoding module to keep the word sequence information as follows:
Figure FDA0003744441980000024
wherein S is n Representing the nth word in the sentence.
6. The method of claim 5, wherein the multi-head attention mechanism is to project query (Q), key (K) and value (V) through h different linear transformations, and finally concatenate different attention results; the expression formula of the multi-head attention mechanism is as follows:
MultiHead(Q,K,V)=Concat(head 1 ,...,head h )W O (4)
wherein Q, K and V are each independently
Figure FDA0003744441980000025
Multiplied by three trainable parameter matrices to obtain head i Denotes the ith head; w O The weight matrix is obtained from a full connection layer and participates in model training along with other parameters;
the calculation formula for the ith head is as follows:
head i =Attention(QW i Q ,KW i K ,VW i V ) (5)
wherein, W i Q 、W i K And W i V Three trainable parameter matrices corresponding to Q, K, V; attention is the Attention of zooming dot product, the higher the similarity of two vectors is, the larger the dot product result is, the more Attention the model is, and the calculation formula is as follows:
Figure FDA0003744441980000026
wherein is divided by
Figure FDA0003744441980000027
Keeping variance control at 1; the Softmax function is a normalized exponential function such that each element ranges between (0, 1) and the sum of all elements is 1.
7. The sign language video generation method of claim 6, wherein the rich semantic embedding layer maps position and velocity information to the same vector space using two layers of fully connected networks (FC) and ReLU activation functions:
Figure FDA0003744441980000031
Figure FDA0003744441980000032
wherein the content of the first and second substances,
Figure FDA0003744441980000033
indicating position information after the joint u is encoded at the t-th frame;
Figure FDA0003744441980000034
velocity information indicating the joint u at the t-th frame; w is a 1 Weight matrix for a fully connected network of the first layer, b 1 Bias term for a first layer fully connected network, w 2 Weight matrix for a second layer fully connected network, b 2 And for the offset item of the second layer of the full-connection network, multiplying the weight matrix by the input vector, adding the multiplied weight matrix and the offset, and introducing a position 1encoding module to keep attitude sequence information.
The gesture language attitude sequence T is obtained by a rich semantic embedding layer
Figure FDA0003744441980000035
Expressed as:
Figure FDA0003744441980000036
wherein the content of the first and second substances,
Figure FDA0003744441980000037
is a collection of encoded position information and velocity information;
the semi-autoregressive decoder with a pyramid structure keeps global autoregressive characteristics and local non-autoregressive characteristics through a RelaxedASKED-attention mechanism
Figure FDA0003744441980000038
Is divided into
Figure FDA0003744441980000039
Groups, where each group contains d frames, the conditional probability can be expressed as:
Figure FDA00037444419800000310
wherein P is a spoken sentence
Figure FDA00037444419800000311
Generating target sign language poses
Figure FDA00037444419800000312
The conditional probability of (a); g represents grouping the target sign language gesture sequence, and the specific division is as follows:
Figure FDA00037444419800000313
wherein, M is the total frame number of the video, and d is the number of frames contained in each group.
8. The sign language video generating method as claimed in claim 7, wherein the Group is predicted by using a Relaxedmasked-attention mechanism model k When the inner posture frame is in, the Group can be seen 1 tok All the frame information of (2); given the target number of frames M and the division level d,
Figure FDA00037444419800000314
is defined as follows:
Figure FDA00037444419800000315
wherein, the delayed _ mask is a two-dimensional matrix with the length of M, and d is the number of frames contained in each group;
the Relaxedmasked-attention with residual concatenation is defined as follows:
Figure FDA00037444419800000316
wherein is divided by
Figure FDA00037444419800000317
Keeping variance control at 1; the Softmax function is a normalized exponential function.
9. The method for generating a sign language video according to claim 8, wherein the output of the network model for generating the sign language video passes through a Linear layer and then passes through a Softmax layer; and obtaining logits through the Linear layer, wherein the logits value becomes a probability value through the Softmax layer, and the maximum Softmax value is the gesture corresponding to the current moment.
CN202210821012.5A 2022-07-13 2022-07-13 Sign language video generation method based on improved Transformer model Pending CN115393948A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210821012.5A CN115393948A (en) 2022-07-13 2022-07-13 Sign language video generation method based on improved Transformer model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210821012.5A CN115393948A (en) 2022-07-13 2022-07-13 Sign language video generation method based on improved Transformer model

Publications (1)

Publication Number Publication Date
CN115393948A true CN115393948A (en) 2022-11-25

Family

ID=84117415

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210821012.5A Pending CN115393948A (en) 2022-07-13 2022-07-13 Sign language video generation method based on improved Transformer model

Country Status (1)

Country Link
CN (1) CN115393948A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115796144A (en) * 2023-02-07 2023-03-14 中国科学技术大学 Controlled text generation method based on fixed format
CN116152299A (en) * 2023-04-21 2023-05-23 之江实验室 Motion state detection method and device, storage medium and electronic equipment
CN116778576A (en) * 2023-06-05 2023-09-19 吉林农业科技学院 Time-space diagram transformation network based on time sequence action segmentation of skeleton

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115796144A (en) * 2023-02-07 2023-03-14 中国科学技术大学 Controlled text generation method based on fixed format
CN115796144B (en) * 2023-02-07 2023-04-28 中国科学技术大学 Controlled text generation method based on fixed format
CN116152299A (en) * 2023-04-21 2023-05-23 之江实验室 Motion state detection method and device, storage medium and electronic equipment
CN116778576A (en) * 2023-06-05 2023-09-19 吉林农业科技学院 Time-space diagram transformation network based on time sequence action segmentation of skeleton

Similar Documents

Publication Publication Date Title
CN110490946B (en) Text image generation method based on cross-modal similarity and antagonism network generation
CN112613303B (en) Knowledge distillation-based cross-modal image aesthetic quality evaluation method
CN115393948A (en) Sign language video generation method based on improved Transformer model
CN111325323B (en) Automatic power transmission and transformation scene description generation method integrating global information and local information
CN110288665A (en) Image Description Methods, computer readable storage medium based on convolutional neural networks, electronic equipment
CN113158875A (en) Image-text emotion analysis method and system based on multi-mode interactive fusion network
CN110929092A (en) Multi-event video description method based on dynamic attention mechanism
CN108416065A (en) Image based on level neural network-sentence description generates system and method
CN101187990A (en) A session robotic system
CN111985205A (en) Aspect level emotion classification model
CN110347831A (en) Based on the sensibility classification method from attention mechanism
WO2020177214A1 (en) Double-stream video generation method based on different feature spaces of text
CN113780059B (en) Continuous sign language identification method based on multiple feature points
CN113609922B (en) Continuous sign language sentence recognition method based on mode matching
Tuyen et al. Conditional generative adversarial network for generating communicative robot gestures
CN114266905A (en) Image description generation model method and device based on Transformer structure and computer equipment
CN116863920B (en) Voice recognition method, device, equipment and medium based on double-flow self-supervision network
CN111079661B (en) Sign language recognition system
CN112883167A (en) Text emotion classification model based on hierarchical self-power-generation capsule network
CN117473561A (en) Privacy information identification system, method, equipment and medium based on artificial intelligence
CN117093692A (en) Multi-granularity image-text matching method and system based on depth fusion
CN111243060A (en) Hand drawing-based story text generation method
Balayn et al. Data-driven development of virtual sign language communication agents
CN116311493A (en) Two-stage human-object interaction detection method based on coding and decoding architecture
Xu et al. Isolated Word Sign Language Recognition Based on Improved SKResNet‐TCN Network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination