CN114048757A

CN114048757A - Sign language synthesis method and device, computer equipment and storage medium

Info

Publication number: CN114048757A
Application number: CN202111432719.9A
Authority: CN
Inventors: 罗弋
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-11-29
Filing date: 2021-11-29
Publication date: 2022-02-15

Abstract

The embodiment of the application belongs to the technical field of voice processing in artificial intelligence, and relates to a sign language synthesis method, a sign language synthesis device, computer equipment and a storage medium. In addition, the application also relates to a block chain technology, and user voice audio and target sign language animation of a user can be stored in the block chain. When receiving the sound sent by the conversation object (namely the voice and the audio of the user), the method carries out voice recognition operation of converting the audio into text on the sound to obtain the speaking content of the conversation object, then carries out semantic analysis operation on the speaking content according to a semantic analysis model to obtain the real semantic to be expressed by the conversation object, carries out sign language serialization according to the real semantic, finally carries out sign language action synthesis and outputs target sign language animation through a preset three-dimensional authoring engine, so that hearing-impaired people understand the semantic information transmitted by the conversation object, carries out sign language synthesis according to the real semantic of the conversation object, and can effectively improve the accuracy of sign language synthesis.

Description

Sign language synthesis method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of speech processing technology in artificial intelligence, and in particular, to a sign language synthesis method, apparatus, computer device, and storage medium.

Background

There are a large number of people with hearing disabilities in China, sign language is the language they communicate with. Sign language uses space motion to express semantics, and is a visual space language. The Chinese sign language video synthesized based on the computer has stronger sense of reality and better acceptable degree, the visual language expression interface of the Chinese sign language video is more vivid and lively, and the Chinese sign language video can better serve the hearing-impaired people, so that the hearing-impaired people and the healthy-hearing people can feel the civilization developed at a high speed together, and has extremely wide social significance.

The existing sign language synthesis method is that new sign language videos are recombined according to a plurality of sign language word video clips and text grammar rules and are played to hearing-impaired people, and therefore the hearing-impaired people can be helped to carry out smooth communication.

However, the applicant finds that the traditional sign language synthesis method is generally not intelligent, and because the way of the text grammar rule is fixed, the language content of people in language communication is diversified and ambiguous, so that the semantics expressed by a user cannot be really known, and the situation of inaccurate translation of subsequent sign language synthesis occurs, and thus, the traditional sign language synthesis method has the problem of inaccurate synthesis.

Disclosure of Invention

An embodiment of the present application provides a sign language synthesis method, a sign language synthesis device, a computer device, and a storage medium, so as to solve the problem of inaccurate synthesis in the conventional sign language synthesis method.

In order to solve the above technical problem, an embodiment of the present application provides a sign language synthesis method, which adopts the following technical solutions:

acquiring a user voice audio to be translated;

performing voice recognition operation on the user voice audio to obtain a user voice text;

performing semantic analysis operation on the user voice text according to a semantic analysis model to obtain user semantic information;

performing sign language serialization operation on the user semantic information according to the generated countermeasure network to obtain sign language sequence information;

synthesizing the sign language sequence information according to a preset three-dimensional authoring engine to obtain a target sign language animation;

and outputting the target sign language animation.

In order to solve the above technical problem, an embodiment of the present application further provides a sign language synthesis apparatus, which adopts the following technical solutions:

the audio acquisition module is used for acquiring the voice audio of the user to be translated;

the voice recognition module is used for carrying out voice recognition operation on the user voice audio to obtain a user voice text;

the semantic analysis module is used for carrying out semantic analysis operation on the user voice text according to a semantic analysis model to obtain user semantic information;

the serialization module is used for carrying out sign language serialization operation on the user semantic information according to the generated countermeasure network to obtain sign language sequence information;

the motion synthesis module is used for carrying out sign language motion synthesis on the sign language sequence information according to a preset three-dimensional authoring engine to obtain a target sign language animation;

and the animation output module is used for outputting the target sign language animation.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which adopts the following technical solutions:

comprising a memory having computer readable instructions stored therein which when executed by the processor implement the steps of the sign language synthesis method as described above.

In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, which adopts the following technical solutions:

the computer readable storage medium has stored thereon computer readable instructions which, when executed by a processor, implement the steps of the sign language synthesis method as described above.

Compared with the prior art, the embodiment of the application mainly has the following beneficial effects:

the application provides a sign language synthesis method, which comprises the following steps: acquiring a user voice audio to be translated; performing voice recognition operation on the user voice audio to obtain a user voice text; performing semantic analysis operation on the user voice text according to a semantic analysis model to obtain user semantic information; performing sign language serialization operation on the user semantic information according to the generated countermeasure network to obtain sign language sequence information; synthesizing the sign language sequence information according to a preset three-dimensional authoring engine to obtain a target sign language animation; and outputting the target sign language animation. When hearing-impaired people communicate with a conversation object, when receiving sound (namely user voice audio) sent by the conversation object, carrying out voice recognition operation of converting the audio into text on the sound to obtain the speaking content of the conversation object, carrying out semantic analysis operation on the speaking content according to a semantic analysis model to obtain the real semantic to be expressed by the conversation object, carrying out sign language serialization according to the real semantic, finally carrying out sign language action synthesis through a preset three-dimensional authoring engine and outputting a target sign language animation so that the hearing-impaired people can understand semantic information transmitted by the conversation object, carrying out sign language synthesis according to the real semantic of the conversation object, and effectively improving the accuracy of sign language synthesis.

Drawings

In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flowchart of an implementation of a sign language synthesis method according to an embodiment of the present application;

FIG. 3 is a flowchart of a specific implementation of obtaining a semantic analysis model according to an embodiment of the present application;

FIG. 4 is a flowchart of one embodiment of step S302 of FIG. 3;

FIG. 5 is a flowchart of one embodiment of step S303 of FIG. 3;

FIG. 6 is a flowchart of one embodiment of step S304 of FIG. 3;

fig. 7 is a flowchart of a specific implementation of acquiring key text data according to an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a sign language synthesis apparatus according to a second embodiment of the present application;

fig. 9 is a schematic structural diagram of a specific implementation of obtaining a semantic analysis model according to the second embodiment of the present application;

FIG. 10 is a schematic block diagram of one embodiment of a computer device according to the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the

terminal devices

101, 102, 103.

It should be noted that the sign language synthesis method provided in the embodiment of the present application is generally executed by a server/terminal device, and accordingly, the sign language synthesis apparatus is generally disposed in the server/terminal device.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Continuing to refer to fig. 2, a flowchart of an implementation of a sign language synthesis method provided in an embodiment of the present application is shown, and for convenience of description, only the relevant portions of the present application are shown.

The sign language synthesis method comprises the following steps: step S201, step S202, step S203, step S204, step S205, and step S206.

Step S201: and acquiring the voice audio of the user to be translated.

In the embodiment of the application, the user voice audio to be translated refers to the audio content of the talking of the conversation object when the hearing-impaired people communicate with the conversation object, and the audio content cannot be normally identified due to the difficulty in hearing of the hearing-impaired people, so that the audio content needs to be translated.

In the embodiment of the present application, the user voice audio to be translated may be acquired in real time through an audio acquisition terminal, and the user voice audio to be translated may also be acquired by sending data carrying the user voice audio content to be translated through a user terminal.

Step S202: and carrying out voice recognition operation on the voice audio of the user to obtain a voice text of the user.

In the embodiment of the present application, the speech recognition operation is mainly used to convert the collected user speech audio to be translated into text data, specifically, the speech recognition operation may be implemented by a pattern matching method, in a training stage, a user speaks each word in a vocabulary in sequence, and stores a feature vector of the word as a template into a template library, in a recognition stage, a feature vector of input speech is compared with each template in the template library in sequence, and the highest similarity is output as a recognition result.

Step S203: and performing semantic analysis operation on the user voice text according to the semantic analysis model to obtain user semantic information.

In the embodiment of the present application, the semantic analysis model may be a model that performs semantic understanding analysis on an input chinese text by using NLP (natural language processing) and Seq2 Sequence (Sequence to Sequence) models, so as to extract a key semantic chinese word; the semantic analysis model can also be a pre-trained deep recognition network model, and the semantic analysis model can obtain the real meaning of the target vocabulary by analyzing the associated text content, and it should be understood that the example of the semantic analysis model is only convenient to understand and is not used for limiting the application.

Step S204: and performing sign language serialization operation on the semantic information of the user according to the generated countermeasure network to obtain sign language sequence information.

In the embodiment of the present application, generating countermeasure Networks (GANs) is a deep learning model, which passes through (at least) two modules in a framework: the mutual game learning of the Generative Model (Generative Model) and the Discriminative Model (Discriminative Model) yields a reasonably good output. In the original GAN theory, it is not required that G and D are both neural networks, but only that functions that can be generated and discriminated correspondingly are fitted.

In the embodiment of the application, the output result of the semantic understanding analysis module is serialized into a 3D sign language action word, and the core technology of the semantic understanding analysis module comprises:

1) convergence of Chinese synonymy, near synonymy words to hand words;

2) the conversion of Chinese language order, grammar to hand language order, grammar.

Step S205: and synthesizing the sign language action of the sign language sequence information according to a preset three-dimensional authoring engine to obtain the target sign language animation.

In the embodiment of the application, the preset three-dimensional authoring engine is a real-time 3D interactive content authoring and operating platform. All creators including game development, art, architecture, automobile design and movie and television, in particular, the preset three-dimensional creation engine changes creatives into reality by means of the Unity3D technology. The Unity platform provides a complete set of complete software solutions that can be used to author, operate and render any real-time interactive 2D and 3D content, and the support platforms include cell phones, tablets, PCs, game consoles, augmented reality and virtual reality devices.

In the embodiment of the application, an independent Chinese sign language Motion 3D animation file is prefabricated, and a corresponding sign language Motion ID is mapped. And converting the sign language action identification sequence into the continuous sign language action 3D animation by real-time 3D action synthesis in a Unity3D engine.

Step S206: and outputting the target sign language animation.

In an embodiment of the present application, a sign language synthesis method is provided, including: acquiring a user voice audio to be translated; carrying out voice recognition operation on the voice audio of the user to obtain a voice text of the user; performing semantic analysis operation on the user voice text according to the semantic analysis model to obtain user semantic information; performing sign language serialization operation on the semantic information of the user according to the generated countermeasure network to obtain sign language sequence information; synthesizing sign language actions of the sign language sequence information according to a preset three-dimensional authoring engine to obtain a target sign language animation; and outputting the target sign language animation. When hearing-impaired people communicate with a conversation object, when receiving sound (namely user voice audio) sent by the conversation object, carrying out voice recognition operation of converting the audio into text on the sound to obtain the speaking content of the conversation object, carrying out semantic analysis operation on the speaking content according to a semantic analysis model to obtain the real semantic to be expressed by the conversation object, carrying out sign language serialization according to the real semantic, finally carrying out sign language action synthesis through a preset three-dimensional authoring engine and outputting a target sign language animation so that the hearing-impaired people can understand semantic information transmitted by the conversation object, carrying out sign language synthesis according to the real semantic of the conversation object, and effectively improving the accuracy of sign language synthesis.

Continuing to refer to fig. 3, a flowchart of a specific implementation of obtaining a semantic analysis model according to an embodiment of the present application is shown, and for convenience of illustration, only the relevant portions of the present application are shown.

In some optional implementations of this embodiment, before step S203, the method further includes: step S301, step S302, step S303, step S304, step S305, and step S306.

Step S301: and reading the local database, obtaining the sample text from the local database, and determining each participle contained in the sample text.

In this embodiment of the present application, a plurality of texts may be obtained from the local database, and a training set formed by the obtained plurality of texts is determined, so that each text in the training set may be used as a sample text.

In this embodiment of the present application, when determining the participles included in the sample text, the sample text may be subjected to a participle process first to obtain each participle included in the sample text. When performing word segmentation processing on a sample text, any word segmentation method may be adopted, and of course, each character in the sample text may also be processed as a word segmentation, and it should be understood that the example of word segmentation processing is only for convenience of understanding and is not limited to the present application.

Step S302: and determining a word vector corresponding to each participle based on the semantic analysis model to be trained.

In the embodiment of the present application, the semantic analysis model may include at least four layers, which are: the system comprises a semantic representation layer, an attribute relevance representation layer and a classification layer.

In the embodiment of the present application, the semantic representation layer at least includes a sub-model for outputting a bidirectional semantic representation vector, such as a bert (bidirectional Encoder representation from transforms) model. Each participle can be input into a semantic representation layer in a semantic analysis model, and a bidirectional semantic representation vector corresponding to each participle output by the semantic representation layer is obtained and serves as a word vector corresponding to each participle. It should be understood that the model for outputting the bi-directional semantic representation vector includes other models besides the BERT model described above, and the example of the model for outputting the bi-directional semantic representation vector is only for convenience of understanding and is not intended to limit the present application.

Step S303: obtaining each semantic attribute from a local database, and determining a first feature expression vector of the sample text related to the semantic attributes according to an attention matrix corresponding to the semantic attributes and a word vector corresponding to each participle in a semantic analysis model to be trained.

In this embodiment of the present application, a word vector corresponding to each participle may be input to an attribute characterization layer in a semantic analysis model, the attention matrix corresponding to the semantic attribute included in the attribute characterization layer is used to perform attention weighting on the word vector corresponding to each participle, and a first feature expression vector of the sample text related to the semantic attribute is determined according to the word vector corresponding to each participle after the attention weighting.

Step S304: according to a self-attention matrix which is contained in the semantic analysis model to be trained and used for representing correlation among different semantic attributes and a first feature representation vector of the sample text related to each semantic attribute, a second feature representation vector of the sample text related to each semantic attribute is determined.

In the embodiment of the present application, the first feature expression vector of the sample text related to each semantic attribute may be input to an attribute relevance expression layer in the speech analysis model, the first feature expression vector of the sample text related to each semantic attribute may be self-attention weighted by the above-mentioned self-attention matrix included in the attribute relevance expression layer, and a second feature expression vector of the sample text related to each semantic attribute may be determined according to each self-attention weighted first feature expression vector.

Step S305: and determining a classification result output by the semantic training model to be trained according to the semantic analysis model to be trained and the second feature expression vector of each semantic attribute related to the sample text, wherein the classification result comprises the semantic attribute to which the sample text belongs and the emotion polarity corresponding to the semantic attribute to which the sample text belongs.

In the embodiment of the application, the classification layer at least comprises a hidden layer, a full connection layer and a softmax layer.

In the embodiment of the application, the second feature representation vectors of the sample texts related to each semantic attribute can be sequentially input into the hidden layer, the full-link layer and the softmax layer in the classification layer, and the sample texts are classified according to the classification parameters corresponding to each semantic attribute contained in each second feature representation vector and the hidden layer, the full-link layer and the softmax layer of the classification layer, so that the classification result output by the classification layer is obtained.

In the embodiment of the present application, the classification result at least includes the semantic attribute to which the sample text belongs and the emotion polarity corresponding to the semantic attribute to which the sample text belongs.

In the embodiment of the present application, the emotion polarity can be quantified by a numerical value, for example, the closer the numerical value is to 1, the more positive the emotion polarity is, the closer the numerical value is to-1, the more negative the emotion polarity is, and the closer the numerical value is to 0, the neutral the emotion polarity is.

Step S306: and adjusting model parameters in the semantic analysis model according to the classification result and labels preset for the sample text so as to complete the training of the semantic analysis model.

In the embodiment of the present application, the model parameters to be adjusted at least include the classification parameters described above, and may further include the attention matrix and the self-attention matrix described above. The model parameters in the semantic analysis model can be adjusted by using a traditional training method. That is, the loss (hereinafter referred to as a first loss) corresponding to the classification result is determined directly from the obtained classification result and the label preset for the sample text, and the model parameters in the semantic analysis model are adjusted with the first loss minimized as a training target to complete the training of the semantic analysis model.

In the embodiment of the application, because the self-attention matrix for representing the correlation between different semantic attributes is added to the semantic analysis model, the semantic analysis model obtained by training by adopting the traditional training method can analyze the semantics of the text to be analyzed more accurately.

Continuing to refer to fig. 4, a flowchart of one embodiment of step S302 of fig. 3 is shown, and for ease of illustration, only the portions relevant to the present application are shown.

In some optional implementation manners of this embodiment, step S302 specifically includes: step S401.

Step S401: and inputting each participle into a semantic representation layer in a semantic analysis model to obtain a bidirectional semantic representation vector corresponding to each participle output by the semantic representation layer as a word vector corresponding to each participle.

In an embodiment of the application, the semantic representation layer comprises at least a sub-model for outputting the bi-directional semantic representation vector, the sub-model comprising a BERT model.

Continuing to refer to fig. 5, a flowchart of one embodiment of step S303 of fig. 3 is shown, and for convenience of illustration, only the portions relevant to the present application are shown.

In some optional implementation manners of this embodiment, step S303 specifically includes: step S501, step S502, and step S503.

Step S501: and inputting the word vector corresponding to each participle into an attribute representation layer in the semantic analysis model.

In the embodiment of the present application, at least the attribute characterization layer includes an attention matrix corresponding to each semantic attribute.

Step S502: and carrying out attention weighting on the word vector corresponding to each participle through an attention matrix corresponding to the semantic attribute contained in the attribute representation layer.

Step S503: and determining a first feature expression vector of the sample text related to the semantic attribute according to the word vector corresponding to each participle after attention weighting.

In this embodiment, the first feature expression vector may characterize the probability that the sample text relates to the semantic attribute and the emotion polarity on the semantic attribute.

Continuing to refer to FIG. 6, a flowchart of one embodiment of step S304 of FIG. 3 is shown, and for ease of illustration, only the portions relevant to the present application are shown.

In some optional implementation manners of this embodiment, step S304 specifically includes: step S601, step S602, and step S603.

Step S601: the first feature representation vector of the sample text related to each semantic attribute is input to an attribute relevance representation layer in the speech analysis model.

In this applicationIn an embodiment, the attribute correlation representation layer in the semantic analysis model includes at least a self-attention matrix, where the self-attention matrix is used to represent the correlation between different semantic attributes, and the self-attention matrix may be in the form of: element R in the matrix_ijRepresenting the correlation of the ith semantic attribute and the jth semantic attribute, the stronger the correlation, R_ijThe larger the value of (A) and the smaller the opposite.

Step S602: the first feature representation vector of the sample text relating to each semantic attribute is self-attention weighted by a self-attention matrix included in the attribute relevance representation layer for representing the relevance between different semantic attributes.

Step S603: a second feature representation vector of the sample text relating to each semantic attribute is determined from the respective first feature representation vectors weighted from attention.

In the embodiment of the present application, the second feature expression vector may also represent the probability that the sample text relates to each semantic attribute and the emotion polarity on the semantic attribute, but unlike the first feature expression vector, the first feature expression vector is obtained by weighting the word vector by using the attention matrix corresponding to each semantic attribute, which is independent of each other, and therefore, the probability that the sample text characterized by the second feature expression vector relates to each semantic attribute and the emotion polarity on the semantic attribute do not consider the correlation between different semantic attributes. And the second feature expression vector is obtained by weighting the first feature expression vector by using a self-attention matrix for expressing the correlation between different semantic attributes, which is equivalent to a factor of the correlation between different semantic attributes introduced by the self-attention matrix, so that the probability of the sample text represented by the second feature expression vector related to each semantic attribute and the emotion polarity on the semantic attributes take the correlation between different semantic attributes into consideration.

With continued reference to fig. 7, a flowchart of a specific implementation of obtaining key text data according to an embodiment of the present application is shown;

in some optional implementations of this embodiment, after step S202, the method further includes: step S701, step S702, step S703 and step S704, wherein the step S202 includes: step S705.

Step S701: and carrying out preprocessing operation on the user voice text to obtain a preprocessed field.

In the embodiment of the application, the preprocessing operation is mainly used for splitting the user voice text in a phrase form, and deleting the text with weak association degree to obtain the preprocessed field.

Step S702: and carrying out similarity calculation operation on the preprocessed fields to obtain similarity scores.

In the embodiment of the present application, it is assumed that a series of sentences obtained after the above operation can be represented as s₁,s₂,…,s_n]A total of n sentences; the ith sentence is [ w_i1,w_i2,…,w_im]Suppose there are m words in a sentence. Similarity is calculated for n sentences pairwise, and a similarity score matrix P of n x n can be obtained. Ith row and jth column P of the scoring matrix_ijRepresenting the similarity score of the ith sentence and the jth sentence. The similarity calculation method can use the traditional method for calculating the number of overlapped words between two sentences, and can also use word vectors (word2vec) for weighted average to obtain sentence vectors, and the result of cosine similarity calculation is used as a similarity score. Then all P are put_iiSet to 0 because it makes no sense to calculate the similarity with itself.

In the embodiment of the present application, the similarity calculation operation is performed based on cosine similarity. This results in a total skew-symmetric matrix P of n × n, which is symmetric because the similarity between sentence a and sentence B is equal to the similarity between sentence B and sentence a. Wherein P is_ijRepresenting the similarity of the ith sentence and the jth sentence.

Step S703: and inputting the similarity score into the TextRank for iterative operation to obtain a field score.

In the embodiment of the present application, the scoring matrix of the above steps is used as an input of the TextRank to perform iteration, where the TextRank is specifically iterated as follows:

wherein W (i) represents the field score for field i; p_ijRepresenting a similarity score between field i and field j; p_jkRepresenting the weight of the edge between field j and field k.

Step S704: and splicing the preprocessed fields according to the field scores to obtain the key text data.

In the embodiment of the present application, the calculation result is [ v [ ]₁,v₂,…,v_n]The importance scores are respectively corresponding to the n sentences. V is to be₁Remove (corresponding to title) and then rank the remaining scores from large to small. The corresponding sentences are taken from high to low as the digests for splicing until the target length L is reached, which may be 200 as an example.

Step S705: and performing semantic analysis operation on the key text data according to the semantic analysis model to obtain user semantic information.

In the embodiment of the application, the user voice text is split, the similarity of each split field is calculated, the importance score of each field is obtained based on the TextRank, and finally the fields are spliced based on the importance scores to obtain the key text data with the maximum relevance degree and the optimal quality with the user voice text, so that the quality of the user voice text is effectively improved, and the efficiency and the accuracy of the subsequent semantic analysis operation are improved.

In summary, the present application provides a sign language synthesis method, including: acquiring a user voice audio to be translated; carrying out voice recognition operation on the voice audio of the user to obtain a voice text of the user; performing semantic analysis operation on the user voice text according to the semantic analysis model to obtain user semantic information; performing sign language serialization operation on the semantic information of the user according to the generated countermeasure network to obtain sign language sequence information; synthesizing sign language actions of the sign language sequence information according to a preset three-dimensional authoring engine to obtain a target sign language animation; and outputting the target sign language animation. When hearing-impaired people communicate with a conversation object, when receiving sound (namely user voice audio) sent by the conversation object, carrying out voice recognition operation of converting the audio into text on the sound to obtain the speaking content of the conversation object, carrying out semantic analysis operation on the speaking content according to a semantic analysis model to obtain the real semantic to be expressed by the conversation object, carrying out sign language serialization according to the real semantic, finally carrying out sign language action synthesis through a preset three-dimensional authoring engine and outputting a target sign language animation so that the hearing-impaired people can understand semantic information transmitted by the conversation object, carrying out sign language synthesis according to the real semantic of the conversation object, and effectively improving the accuracy of sign language synthesis. Meanwhile, semantic analysis is carried out by combining the context content of the ambiguous vocabulary, the actual meaning of the vocabulary is obtained, and objective normative expression evaluation operation is carried out, so that the condition of misjudgment is effectively avoided, and the reference value of normative expression scoring is effectively improved; frequent words appearing in each return call are filtered by filtering the return call content of the customer service personnel each time, so that the situation that some normative words are intentionally repeated by individual customer service personnel for improving the self normative phrase score is effectively avoided, and the reference value of the normative phrase score is further effectively improved; the user voice text is split, the similarity of each split field is calculated, the importance score of each field is obtained based on the TextRank, and finally the key text data with the maximum relevance degree and the optimal quality with the user voice text are obtained by splicing each field based on the importance score, so that the quality of the user voice text is effectively improved, and the efficiency and the accuracy of the subsequent semantic analysis operation are improved.

It is emphasized that, to further ensure the privacy and security of the user speech audio and the target sign language animation, the user speech audio and the target sign language animation can also be stored in a node of a blockchain.

The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware associated with computer readable instructions, which can be stored in a computer readable storage medium, and when executed, can include processes of the embodiments of the methods described above. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

Example two

With further reference to fig. 8, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a sign language synthesis apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 8, the sign language synthesis apparatus 200 of the present embodiment includes: an audio acquisition module 201, a speech recognition module 202, a semantic analysis module 203, a serialization module 204, a motion synthesis module 205, and an animation output module 206. Wherein:

an audio acquiring module 201, configured to acquire a user voice audio to be translated;

the voice recognition module 202 is configured to perform voice recognition operation on a user voice audio to obtain a user voice text;

the semantic analysis module 203 is used for performing semantic analysis operation on the user voice text according to the semantic analysis model to obtain user semantic information;

the serialization module 204 is used for carrying out sign language serialization operation on the user semantic information according to the generated countermeasure network to obtain sign language sequence information;

the motion synthesis module 205 is configured to perform sign language motion synthesis on the sign language sequence information according to a preset three-dimensional authoring engine to obtain a target sign language animation;

and the animation output module 206 is used for outputting the target sign language animation.

1) convergence of Chinese synonymy, near synonymy words to hand words;

In an embodiment of the present application, there is provided a sign language synthesis apparatus 200, including: an audio acquiring module 201, configured to acquire a user voice audio to be translated; the voice recognition module 202 is configured to perform voice recognition operation on a user voice audio to obtain a user voice text; the semantic analysis module 203 is used for performing semantic analysis operation on the user voice text according to the semantic analysis model to obtain user semantic information; the serialization module 204 is used for carrying out sign language serialization operation on the user semantic information according to the generated countermeasure network to obtain sign language sequence information; the motion synthesis module 205 is configured to perform sign language motion synthesis on the sign language sequence information according to a preset three-dimensional authoring engine to obtain a target sign language animation; and the animation output module 206 is used for outputting the target sign language animation. When hearing-impaired people communicate with a conversation object, when receiving sound (namely user voice audio) sent by the conversation object, carrying out voice recognition operation of converting the audio into text on the sound to obtain the speaking content of the conversation object, carrying out semantic analysis operation on the speaking content according to a semantic analysis model to obtain the real semantic to be expressed by the conversation object, carrying out sign language serialization according to the real semantic, finally carrying out sign language action synthesis through a preset three-dimensional authoring engine and outputting a target sign language animation so that the hearing-impaired people can understand semantic information transmitted by the conversation object, carrying out sign language synthesis according to the real semantic of the conversation object, and effectively improving the accuracy of sign language synthesis.

Continuing to refer to fig. 9, a schematic structural diagram of a specific implementation of obtaining a semantic analysis model provided in embodiment two of the present application is shown, and for convenience of description, only the relevant portions of the present application are shown.

In some optional implementations of the present embodiment, the sign language synthesis apparatus 100 further includes: a sample obtaining module 207, a word vector obtaining module 208, a first feature determining module 209, a second feature determining module 210, a classification result determining module 211, and a parameter adjusting module 212, wherein:

a sample obtaining module 207, configured to obtain a sample text from a local database, and determine each participle included in the sample text;

a word vector obtaining module 208, configured to determine, based on the semantic analysis model to be trained, a word vector corresponding to each participle;

the first feature determining module 209 is configured to read the local database, obtain each semantic attribute in the local database, and determine a first feature representation vector of the sample text related to the semantic attribute according to an attention matrix corresponding to the semantic attribute and a word vector corresponding to each participle included in the semantic analysis model to be trained;

a second feature determination module 210, configured to determine a second feature representation vector of the sample text related to each semantic attribute according to a self-attention matrix included in the semantic analysis model to be trained and used for representing correlation between different semantic attributes, and the first feature representation vector of the sample text related to each semantic attribute;

the classification result determining module 211 is configured to determine a classification result output by the semantic training model to be trained according to the semantic analysis model to be trained and the second feature expression vector of each semantic attribute related to the sample text, where the classification result includes the semantic attribute to which the sample text belongs and an emotion polarity corresponding to the semantic attribute to which the sample text belongs;

and the parameter adjusting module 212 is configured to adjust model parameters in the semantic analysis model according to the classification result and a preset label for the sample text, so as to complete training of the semantic analysis model.

In some optional implementations of this embodiment, the word vector obtaining module 208 includes: a word vector acquisition submodule, wherein:

and the word vector acquisition submodule is used for inputting each participle into a semantic representation layer in the semantic analysis model to obtain a bidirectional semantic representation vector which corresponds to each participle output by the semantic representation layer and is used as a word vector corresponding to each participle.

In some optional implementations of this embodiment, the first characteristic determining module 209 includes: a word vector input sub-module, an attention weighting sub-module, and a first feature determination sub-module, wherein:

the word vector input submodule is used for inputting the word vector corresponding to each participle into an attribute representation layer in the semantic analysis model;

the attention weighting submodule is used for carrying out attention weighting on the word vector corresponding to each participle through an attention matrix corresponding to the semantic attribute contained in the attribute representation layer;

and the first feature determination submodule is used for determining a first feature representation vector of the sample text related to the semantic attribute according to the word vector corresponding to each participle after the attention weighting.

In some optional implementations of this embodiment, the second feature determining module 210 includes: a first feature representation vector input sub-module, a self-attention weighting sub-module, and a second feature determination sub-module, wherein:

a first feature representation vector input submodule for inputting a first feature representation vector of the sample text related to each semantic attribute into an attribute correlation representation layer in the speech analysis model;

the self-attention weighting submodule is used for carrying out self-attention weighting on a first feature representation vector of the sample text related to each semantic attribute through a self-attention matrix which is contained in the attribute relevance representation layer and used for representing the relevance between different semantic attributes;

and the second feature determination submodule is used for determining a second feature representation vector of the sample text related to each semantic attribute according to the first feature representation vectors weighted by the self attention.

In some optional implementations of the present embodiment, the sign language synthesis apparatus 200 further includes: a preprocessing operation module, a similarity calculation module, an iteration operation module, and a splicing operation module, where the semantic analysis module 230 includes: a semantic analysis submodule, wherein:

the preprocessing operation module is used for preprocessing the user voice text to obtain a preprocessing field;

the similarity calculation module is used for carrying out similarity calculation operation on the preprocessed fields to obtain similarity scores;

the iteration operation module is used for inputting the similarity score to the TextRank for iteration operation to obtain a field score;

the splicing operation module is used for splicing the preprocessed fields according to the field scores to obtain key text data;

and the semantic analysis submodule is used for performing semantic analysis operation on the key text data according to the semantic analysis model to obtain user semantic information.

In some optional implementations of this embodiment, the TextRank specific iteration is:

wherein W (i) represents a wordField scoring for segment i; p_ijRepresenting a similarity score between field i and field j; p_jkRepresenting the weight of the edge between field j and field k.

In summary, the present application provides a sign language synthesis apparatus 200, comprising: an audio acquiring module 201, configured to acquire a user voice audio to be translated; the voice recognition module 202 is configured to perform voice recognition operation on a user voice audio to obtain a user voice text; the semantic analysis module 203 is used for performing semantic analysis operation on the user voice text according to the semantic analysis model to obtain user semantic information; the serialization module 204 is used for carrying out sign language serialization operation on the user semantic information according to the generated countermeasure network to obtain sign language sequence information; the motion synthesis module 205 is configured to perform sign language motion synthesis on the sign language sequence information according to a preset three-dimensional authoring engine to obtain a target sign language animation; and the animation output module 206 is used for outputting the target sign language animation. When hearing-impaired people communicate with a conversation object, when receiving sound (namely user voice audio) sent by the conversation object, carrying out voice recognition operation of converting the audio into text on the sound to obtain the speaking content of the conversation object, carrying out semantic analysis operation on the speaking content according to a semantic analysis model to obtain the real semantic to be expressed by the conversation object, carrying out sign language serialization according to the real semantic, finally carrying out sign language action synthesis through a preset three-dimensional authoring engine and outputting a target sign language animation so that the hearing-impaired people can understand semantic information transmitted by the conversation object, carrying out sign language synthesis according to the real semantic of the conversation object, and effectively improving the accuracy of sign language synthesis. Meanwhile, semantic analysis is carried out by combining the context content of the ambiguous vocabulary, the actual meaning of the vocabulary is obtained, and objective normative expression evaluation operation is carried out, so that the condition of misjudgment is effectively avoided, and the reference value of normative expression scoring is effectively improved; frequent words appearing in each return call are filtered by filtering the return call content of the customer service personnel each time, so that the situation that some normative words are intentionally repeated by individual customer service personnel for improving the self normative phrase score is effectively avoided, and the reference value of the normative phrase score is further effectively improved; the user voice text is split, the similarity of each split field is calculated, the importance score of each field is obtained based on the TextRank, and finally the key text data with the maximum relevance degree and the optimal quality with the user voice text are obtained by splicing each field based on the importance score, so that the quality of the user voice text is effectively improved, and the efficiency and the accuracy of the subsequent semantic analysis operation are improved.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 10, fig. 10 is a block diagram of a basic structure of a computer device according to the present embodiment.

The computer device 300 includes a memory 310, a processor 320, and a network interface 330 communicatively coupled to each other via a system bus. It is noted that only computer device 300 having

components

310 and 330 is shown, but it is understood that not all of the shown components are required and that more or fewer components may alternatively be implemented. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

The memory 310 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 310 may be an internal storage unit of the computer device 300, such as a hard disk or a memory of the computer device 300. In other embodiments, the memory 310 may also be an external storage device of the computer device 300, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, provided on the computer device 300. Of course, the memory 310 may also include both internal and external storage devices of the computer device 300. In this embodiment, the memory 310 is generally used for storing an operating system and various application software installed on the computer device 300, such as computer readable instructions of a sign language synthesis method. In addition, the memory 310 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 320 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 320 is generally operative to control overall operation of the computer device 300. In this embodiment, the processor 320 is configured to execute computer readable instructions stored in the memory 310 or process data, such as computer readable instructions for executing the sign language synthesis method.

The network interface 330 may include a wireless network interface or a wired network interface, and the network interface 330 is generally used to establish a communication connection between the computer device 300 and other electronic devices.

When the hearing-impaired people communicate with the conversation object, when the hearing-impaired people receive the sound emitted by the conversation object (namely, the voice and the audio of the user), voice recognition operation of converting the audio into text is carried out on the sound to obtain the speaking content of the conversation object, semantic analysis operation is carried out on the speaking content according to a semantic analysis model to obtain the real semantics to be expressed by the conversation object, sign language serialization is carried out according to the real semantics, finally, sign language action synthesis is carried out through a preset three-dimensional authoring engine, and target sign language animation is output, so that the hearing-impaired people can understand the semantic information transmitted by the conversation object, and sign language synthesis is carried out according to the real semantics of the conversation object, and the accuracy of sign language synthesis can be effectively improved.

The present application further provides another embodiment, which is to provide a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of the sign language synthesis method as described above.

When hearing-impaired people communicate with a conversation object, when receiving sound emitted by the conversation object (namely user voice audio), the computer-readable storage medium performs voice recognition operation of converting the audio into text on the sound to obtain speaking content of the conversation object, performs semantic analysis operation on the speaking content according to a semantic analysis model to obtain real semantics to be expressed by the conversation object, performs sign language serialization according to the real semantics, performs sign language action synthesis through a preset three-dimensional authoring engine and outputs target sign language animation so that the hearing-impaired people can understand semantic information transmitted by the conversation object, performs sign language synthesis according to the real semantics of the conversation object, and can effectively improve accuracy of sign language synthesis.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. A sign language synthesis method is characterized by comprising the following steps:

acquiring a user voice audio to be translated;

and outputting the target sign language animation.

2. The sign language synthesis method according to claim 1, wherein before the step of performing semantic analysis operation on the user speech text according to the semantic analysis model to obtain user semantic information, the method further comprises:

reading a local database, obtaining a sample text from the local database, and determining each participle contained in the sample text;

determining a word vector corresponding to each participle based on a semantic analysis model to be trained;

acquiring each semantic attribute from the local database, and determining a first feature expression vector of the sample text related to the semantic attribute according to an attention matrix corresponding to the semantic attribute and a word vector corresponding to each participle in the semantic analysis model to be trained;

determining a second feature representation vector of the sample text related to each semantic attribute according to a self-attention matrix which is contained in the semantic analysis model to be trained and used for representing correlation among different semantic attributes and a first feature representation vector of the sample text related to each semantic attribute;

determining a classification result output by the semantic training model to be trained according to the semantic analysis model to be trained and a second feature expression vector of each semantic attribute related to the sample text, wherein the classification result comprises the semantic attribute to which the sample text belongs and the emotion polarity corresponding to the semantic attribute to which the sample text belongs;

and adjusting model parameters in the semantic analysis model according to the classification result and labels preset for the sample text so as to finish training the semantic analysis model.

3. The sign language synthesis method according to claim 2, wherein the step of determining the word vector corresponding to each participle based on the semantic analysis model to be trained specifically comprises:

and inputting each participle into a semantic representation layer in the semantic analysis model to obtain a bidirectional semantic representation vector corresponding to each participle output by the semantic representation layer as a word vector corresponding to each participle.

4. The sign language synthesis method according to claim 2, wherein the step of obtaining each semantic attribute in the local database, and determining the first feature expression vector of the sample text related to the semantic attribute according to the attention matrix corresponding to the semantic attribute included in the semantic analysis model to be trained and the word vector corresponding to each participle specifically comprises:

inputting a word vector corresponding to each participle into an attribute representation layer in the semantic analysis model;

carrying out attention weighting on a word vector corresponding to each participle through an attention matrix corresponding to the semantic attribute contained in the attribute representation layer;

and determining a first feature expression vector of the sample text related to the semantic attribute according to the word vector corresponding to each participle after attention weighting.

5. The sign language synthesis method according to claim 2, wherein the step of determining the second feature representation vector of the sample text related to each semantic attribute according to the self-attention matrix included in the semantic analysis model to be trained for representing the correlation between different semantic attributes and the first feature representation vector of the sample text related to each semantic attribute comprises:

inputting a first feature representation vector of the sample text related to each semantic attribute into an attribute relevance representation layer in the speech analysis model;

self-attention weighting a first feature representation vector of the sample text relating to each semantic attribute by a self-attention matrix included in the attribute relevance representation layer for representing relevance between different semantic attributes;

and determining a second feature representation vector of the sample text related to each semantic attribute according to the first feature representation vectors after self attention weighting.

6. A sign language synthesis method according to claim 1, characterized by further comprising, after the step of performing a speech recognition operation on the user speech audio to obtain a user speech text, the steps of:

preprocessing the user voice text to obtain a preprocessed field;

carrying out similarity calculation operation on the preprocessed fields to obtain similarity scores;

inputting the similarity score into the TextRank for iterative operation to obtain a field score;

splicing the preprocessed fields according to the field scores to obtain key text data;

the step of performing semantic analysis operation on the user voice text according to the semantic analysis model to obtain user semantic information specifically comprises the following steps:

and performing semantic analysis operation on the key text data according to the semantic analysis model to obtain the user semantic information.

7. The sign language synthesis method according to claim 6, wherein the iteration of the TextRank is represented as:

8. A sign language synthesis apparatus, comprising:

9. A computer device comprising a memory having computer readable instructions stored therein and a processor that when executed performs the steps of the sign language synthesis method of any one of claims 1 to 7.

10. A computer-readable storage medium having computer-readable instructions stored thereon which, when executed by a processor, implement the steps of a sign language synthesis method according to any one of claims 1 to 7.