CN113923471A - Interaction method, device, equipment and storage medium - Google Patents

Interaction method, device, equipment and storage medium Download PDF

Info

Publication number
CN113923471A
CN113923471A CN202111507859.8A CN202111507859A CN113923471A CN 113923471 A CN113923471 A CN 113923471A CN 202111507859 A CN202111507859 A CN 202111507859A CN 113923471 A CN113923471 A CN 113923471A
Authority
CN
China
Prior art keywords
stream data
information
target
original
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111507859.8A
Other languages
Chinese (zh)
Inventor
叶天兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Damo Institute Hangzhou Technology Co Ltd
Original Assignee
Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Damo Institute Hangzhou Technology Co Ltd filed Critical Alibaba Damo Institute Hangzhou Technology Co Ltd
Priority to CN202111507859.8A priority Critical patent/CN113923471A/en
Publication of CN113923471A publication Critical patent/CN113923471A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • H04N21/2335Processing of audio elementary streams involving reformatting operations of audio signals, e.g. by converting from one coding standard to another
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/2053D [Three Dimensional] animation driven by audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • H04N21/234309Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements by transcoding between formats or standards, e.g. from MPEG-2 to MPEG-4 or from Quicktime to Realvideo
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4398Processing of audio elementary streams involving reformatting operations of audio signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440218Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by transcoding between formats or standards, e.g. from MPEG-2 to MPEG-4
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/141Systems for two-way working between two video terminals, e.g. videophone

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The disclosure relates to an interaction method, apparatus, device and storage medium. The original video information can be the video information of the real person acquired by the terminal, and the original audio information is the audio information of the real person acquired by the terminal. The terminal respectively encodes the video information and the audio information of the real person and then sends the encoded video information and the encoded audio information to the cloud server, the cloud server can generate target video information comprising a first virtual person according to the audio information of the real person, and target audio information broadcasted by the first virtual person or other virtual persons is generated according to the video information of the real person. When the cloud server respectively encodes the target video information of the first virtual character and the target audio information broadcasted by the first virtual character or other virtual characters and then sends the encoded target audio information to the terminal, real-time video information between the video information of the real character and the target video information of the first virtual character is achieved. Therefore, users with hearing or language obstacles can effectively communicate with normal people when carrying out video communication.

Description

Interaction method, device, equipment and storage medium
Technical Field
The present disclosure relates to the field of information technologies, and in particular, to an interaction method, apparatus, device, and storage medium.
Background
With the continuous development of science and technology, the terminal can acquire the video information of the real person and push the video information of the real person to the cloud, and the cloud can generate the video information of the virtual person. However, the video information of real characters and the video information of virtual characters cannot be viewed in real time. In addition, video communication between real persons is more and more common and convenient at present, but for users with hearing impairment or language impairment, when the users perform video communication with normal persons, effective communication cannot be achieved.
Disclosure of Invention
To solve the above technical problem or to at least partially solve the above technical problem, the present disclosure provides an interaction method, apparatus, device, and storage medium to implement real-time video between video information of a real character and target video information of a first virtual character.
In a first aspect, an embodiment of the present disclosure provides an interaction method, including:
acquiring original audio stream data and original video stream data;
decoding the original audio stream data to obtain original audio information, and decoding the original video stream data to obtain original video information, wherein the original video information comprises real figures;
performing voice recognition on the original audio information, and generating target video information according to a voice recognition result, wherein the target video information comprises a first virtual character;
performing action recognition on the original video information, and generating target audio information according to an action recognition result;
coding the target video information to obtain target video stream data, and coding the target audio information to obtain target audio stream data;
and pushing the target video stream data and the target audio stream data.
In a second aspect, an embodiment of the present disclosure provides an interaction apparatus, including:
the acquisition module is used for acquiring original audio stream data and original video stream data;
the decoding module is used for decoding the original audio stream data to obtain original audio information, and decoding the original video stream data to obtain original video information, wherein the original video information comprises real characters;
the generating module is used for carrying out voice recognition on the original audio information and generating target video information according to a voice recognition result, wherein the target video information comprises a first virtual character; performing action recognition on the original video information, and generating target audio information according to an action recognition result;
the coding module is used for coding the target video information to obtain target video stream data and coding the target audio information to obtain target audio stream data;
and the pushing module is used for pushing the target video stream data and the target audio stream data.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including:
a memory;
a processor; and
a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of the first aspect.
In a fourth aspect, the present disclosure provides a computer-readable storage medium, on which a computer program is stored, the computer program being executed by a processor to implement the method of the first aspect.
According to the interaction method, the device, the equipment and the storage medium provided by the embodiment of the disclosure, original audio stream data and original video stream data are obtained, the original audio stream data are decoded to obtain original audio information, the original video stream data are decoded to obtain original video information, and the original video information comprises real characters. Further, voice recognition is carried out on the original audio information, and target video information is generated according to a voice recognition result, wherein the target video information comprises a first virtual character. And performing action recognition on the original video information, and generating target audio information according to an action recognition result. And coding the target video information to obtain target video stream data, coding the target audio information to obtain target audio stream data, and pushing the target video stream data and the target audio stream data. The original video information can be the video information of the real person acquired by the terminal, and the original audio information is the audio information of the real person acquired by the terminal. The terminal respectively encodes the video information and the audio information of the real person and then sends the encoded video information and the encoded audio information to the cloud server, the cloud server can generate target video information comprising a first virtual person according to the audio information of the real person, and target audio information broadcasted by the first virtual person or other virtual persons is generated according to the video information of the real person. When the cloud server respectively encodes the target video information of the first virtual character and the target audio information broadcasted by the first virtual character or other virtual characters and then sends the encoded target video information and the target audio information broadcasted by the first virtual character or other virtual characters to the terminal, not only can real-time video information between the video information of the real character and the target video information of the first virtual character be realized, but also real-time interaction between the audio information of the real character and the target audio information broadcasted by the first virtual character or other virtual characters can be realized. Therefore, the users with hearing impairment or language impairment can effectively communicate with normal people when carrying out video communication.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a flowchart of an interaction method provided by an embodiment of the present disclosure;
fig. 2 is a schematic diagram of an application scenario provided by an embodiment of the present disclosure;
fig. 3 is a schematic diagram of an application scenario provided by an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a user interface provided by an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of a user interface provided by another embodiment of the present disclosure;
FIG. 6 is a schematic diagram of a user interface provided by another embodiment of the present disclosure;
FIG. 7 is a schematic diagram of a user interface provided by another embodiment of the present disclosure;
FIG. 8 is a schematic diagram of a user interface provided by another embodiment of the present disclosure;
FIG. 9 is a flowchart of an interaction method provided by another embodiment of the present disclosure;
FIG. 10 is a flowchart of an interaction method provided by another embodiment of the present disclosure;
fig. 11 is a schematic structural diagram of an interaction device provided in an embodiment of the present disclosure;
fig. 12 is a schematic structural diagram of an embodiment of an electronic device provided in the embodiment of the present disclosure.
Detailed Description
In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.
Under a general condition, the terminal can acquire and obtain video information of a real person and push the video information of the real person to the cloud, and the cloud can generate the video information of a virtual person. However, the video information of real characters and the video information of virtual characters cannot be viewed in real time. In view of this problem, the embodiments of the present disclosure provide an interaction method, which may be applied not only to a scene of sign language translation, but also to a metastic application, for example, a scene of interaction between a real person and an avatar of the real person. The method is described below with reference to specific examples.
Fig. 1 is a flowchart of an interaction method provided in an embodiment of the present disclosure. The embodiment can be applied to the application scenario shown in fig. 2, which includes the terminal 21 and the cloud server 22. The terminal 21 specifically includes, but is not limited to, a smart phone, a palm computer, a tablet computer, a wearable device with a display screen, a desktop computer, a notebook computer, an all-in-one machine, a smart home device, and the like. The interaction method according to the embodiment of the present disclosure may be executed by the cloud server 22. In other embodiments, the method may also be performed by the terminal. As shown in fig. 1, the method comprises the following specific steps:
s101, acquiring original audio stream data and original video stream data.
As shown in fig. 2, the terminal 21 may capture audio information and video information, where the audio information captured by an audio capture module of the terminal 21, such as a microphone, is denoted as original audio information, and the video information captured by a video capture module of the terminal 21, such as a camera or a video camera, is denoted as original video information, where the original video information includes a real person. Specifically, the original audio information may be an analog audio signal, and the terminal 21 may perform Pulse Code Modulation (PCM) on the analog audio signal to obtain PCM data, where PCM is an audio data format. Further, the terminal 21 may encode the PCM data to obtain the original audio stream data. In addition, the format of the original video information may be an RGB format, which is a video pixel data format. Where, RGB is a color representing three channels of red (R), green (G), and blue (B). The terminal 21 may convert the raw video information in RGB format into raw video information in YUV format. The YUV format is a video pixel data format. YUV is a color coding method. "Y" represents brightness (Luminince or Luma), i.e., a gray scale value. "U" and "V" denote Chroma (Chroma) which is used to describe the color and saturation of an image for specifying the color of a pixel. Further, the terminal 21 may encode the raw video information in YUV format to obtain raw video stream data. It will be appreciated that the processing from the original audio information to the original audio stream data and the processing from the original video information to the original video stream data may be parallel or may be non-parallel.
The terminal 21 may combine the original audio stream data and the original video stream data into a stream data, and the stream data may be recorded as audio/video stream data or multimedia stream data. The terminal 21 may push the audio/video stream data or the multimedia stream data to the cloud server 22. When the cloud server 22 receives the audio/video stream data or the multimedia stream data, the cloud server 22 may obtain the original audio stream data and the original video stream data from the audio/video stream data or the multimedia stream data.
S102, decoding the original audio stream data to obtain original audio information, and decoding the original video stream data to obtain original video information, wherein the original video information comprises real characters.
For example, the cloud server 22 may decode the original audio stream data to obtain PCM data, and further demodulate the PCM data to obtain the original audio information. In addition, the cloud server 22 may decode the original video stream data to obtain original video information, where the original video information includes real people.
Optionally, decoding the original video stream data to obtain original video information includes: decoding the original video stream data to obtain original video information in a first format; and converting the original video information in the first format into original video information in a second format.
For example, the YUV format may be denoted as a first format and the RGB format may be denoted as a second format. The cloud server 22 may decode the original video stream data to obtain original video information in YUV format, and further convert the original video information in YUV format into original video information in RGB format, where the original video information includes a real person. Similarly, the processing procedure of decoding the original audio stream data by the cloud server 22 to obtain the original audio information and the processing procedure of decoding the original video stream data by the cloud server 22 to obtain the original video information may be executed in parallel or in series.
S103, carrying out voice recognition on the original audio information, and generating target video information according to a voice recognition result, wherein the target video information comprises a first virtual character.
For example, the cloud server 22 may perform voice recognition on the original audio information to obtain a voice recognition result, and generate target video information according to the voice recognition result, where the target video information includes the first virtual character.
Optionally, performing voice recognition on the original audio information, and generating target video information according to a voice recognition result, including: carrying out voice recognition on the original audio information to obtain text information; generating action information corresponding to the first virtual character according to the text information; and generating target video information at least according to the action information.
For example, the cloud server 22 may perform speech recognition on the original audio information to obtain a speech recognition result, where the speech recognition result may specifically be text information corresponding to the original audio information. Further, the cloud server 22 may generate action information corresponding to the first avatar according to the text information, for example, pose change information of the skeleton. And performing three-dimensional rendering on the action information corresponding to the first virtual character to obtain an animation video or an action video of the first virtual character, wherein the animation video or the action video is the target video information.
In some other embodiments, generating the target video information based at least on the motion information includes: performing three-dimensional rendering according to the action information to obtain an action video of a first virtual character; generating a subtitle according to the text information; and adding the subtitles to the action video of the first virtual character to obtain the target video information.
For example, the cloud server 22 performs three-dimensional rendering on the motion information corresponding to the first virtual character to obtain an animation video or a motion video of the first virtual character. Meanwhile, the cloud server 22 may further generate a subtitle according to the text information corresponding to the original audio information, and add the subtitle to the animation video or the motion video of the first virtual character, so as to obtain the target video information, that is, the target video information includes not only the first virtual character but also the subtitle, and the motion (e.g., gesture motion) of the first virtual character and the subtitle are matched. For example, the gesture motion of the first virtual character expressing "two catties of watermelon" and the subtitle "two catties of watermelon" can be played synchronously.
And S104, performing motion recognition on the original video information, and generating target audio information according to a motion recognition result.
For example, the cloud server 22 may perform motion recognition on the original video information to obtain a motion recognition result, and generate the target audio information according to the motion recognition result. The target audio information may be announced by the first avatar or other avatars.
Optionally, performing motion recognition on the original video information, and generating target audio information according to a motion recognition result, including: recognizing one or more continuous actions of the real character in the original video information in the second format as a word, wherein the words form a sentence; each sentence is converted into one target audio information.
For example, the cloud server 22 may perform motion recognition on the raw video information in RGB format, specifically, recognize a motion of a real person in the raw video information, for example, recognize one or more continuous motions of the real person as a word, so as to form a sentence with a plurality of words, and convert each sentence into a TTS speech, where the TTS represents text-to-speech (TTS), and the TTS speech may be recorded as target audio information. Among them, TTS is a speech synthesis technology.
And S105, coding the target video information to obtain target video stream data, and coding the target audio information to obtain target audio stream data.
For example, the cloud server 22 may encode target video information to obtain target video stream data, and encode target audio information to obtain target audio stream data.
Optionally, generating the target video information at least according to the motion information includes: and performing three-dimensional rendering according to the action information to obtain target video information in a second format. Encoding the target video information to obtain target video stream data, comprising: converting the target video information in the second format into the target video information in the first format; and coding the target video information in the first format to obtain target video stream data.
For example, the cloud server 22 three-dimensionally renders the motion information corresponding to the first virtual character to obtain target video information in RGB format, and the cloud server 22 may convert the target video information in RGB format into target video information in YUV format, and then encode the target video information in YUV format to obtain target video stream data.
S106, pushing the target video stream data and the target audio stream data.
For example, the cloud server 22 may combine the target video stream data and the target audio stream data into one stream data and push the stream data to the terminal 21, so that the terminal 21 may obtain the target video stream data and the target audio stream data from the stream data. Further, the terminal 21 may decode the target video stream data to obtain target video information in YUV format, and convert the target video information in YUV format into target video information in RGB format. In addition, the terminal 21 may decode the target audio stream data to obtain the target audio information. The terminal 21 can play the target video information and the target audio information synchronously.
The method and the device for obtaining the original video information have the advantages that original audio stream data and original video stream data are obtained, the original audio stream data are decoded to obtain original audio information, the original video stream data are decoded to obtain original video information, and the original video information comprises real characters. Further, voice recognition is carried out on the original audio information, and target video information is generated according to a voice recognition result, wherein the target video information comprises a first virtual character. And performing action recognition on the original video information, and generating target audio information according to an action recognition result. And coding the target video information to obtain target video stream data, coding the target audio information to obtain target audio stream data, and pushing the target video stream data and the target audio stream data. The original video information can be the video information of the real person acquired by the terminal, and the original audio information is the audio information of the real person acquired by the terminal. The terminal respectively encodes the video information and the audio information of the real person and then sends the encoded video information and the encoded audio information to the cloud server, the cloud server can generate target video information comprising a first virtual person according to the audio information of the real person, and target audio information broadcasted by the first virtual person or other virtual persons is generated according to the video information of the real person. When the cloud server respectively encodes the target video information of the first virtual character and the target audio information broadcasted by the first virtual character or other virtual characters and then sends the encoded target video information and the target audio information broadcasted by the first virtual character or other virtual characters to the terminal, not only can real-time video information between the video information of the real character and the target video information of the first virtual character be realized, but also real-time interaction between the audio information of the real character and the target audio information broadcasted by the first virtual character or other virtual characters can be realized. Therefore, the users with hearing impairment or language impairment can effectively communicate with normal people when carrying out video communication.
In an application scenario, acquiring original audio stream data and original video stream data includes: and acquiring the original audio stream data and the original video stream data from the same multimedia stream data. As shown in fig. 2, the terminal 21 synthesizes the original audio stream data and the original video stream data into a piece of stream data, which may be recorded as audio/video stream data or multimedia stream data. The terminal 21 pushes the audio/video stream data or the multimedia stream data to the cloud server 22. The cloud server 22 obtains the original audio stream data and the original video stream data from the audio/video stream data or the multimedia stream data, that is, the cloud server 22 obtains the original audio stream data and the original video stream data from the same audio/video stream data or the multimedia stream data.
The terminal 21 shown in fig. 2 may be used by one user or may be used by a plurality of users at the same time. For example, the user may be a normal person, and the terminal 21 may collect original audio information and original video information of the user, encode the original audio information to obtain original audio stream data, and encode the original video information to obtain original video stream data. Further, the terminal 21 synthesizes the original audio stream data and the original video stream data into one stream data, and sends the stream data to the cloud server 22. The cloud server 22 obtains the original audio stream data and the original video stream data from the stream data, and decodes the original audio stream data and the original video stream data respectively to obtain the original audio information and the original video information. Further, the cloud server 22 may generate target audio information and target video information according to at least one of the original audio information and the original video information, where the target audio information may be a reply to the original audio information, the target video information includes a virtual character, the virtual character may be a voice or video-driven virtual character of another real character, and the virtual character may be used to broadcast the target audio information. Further, the cloud server 22 may encode the target audio information to obtain target audio stream data, and encode the target video information to obtain target video stream data. The cloud server 22 may combine the target audio stream data and the target video stream data into one stream data and send the stream data to the terminal 21, so that the terminal 21 may play the target video information and the target audio information. For example, the original audio information is "how many jin of watermelon" the user asks, and the target audio information may be "two jin of watermelon" that the virtual character broadcasts. The target video information can be specifically a video of a virtual character making gesture actions, and the gesture actions of the virtual character represent that 'two watermelon pieces are one kilogram'. Therefore, the user can carry out real-time video communication with the virtual character corresponding to the other user, namely the real character can interact with the virtual character driven by the voice or the video of the other real character.
Taking the example that the terminal 21 is used by two users at the same time, the two users can be respectively marked as a first user and a second user, the first user can be a normal person, and the second user can be a hearing impaired person.
Specifically, the obtaining the original audio stream data and the original video stream data from the same multimedia stream data includes: and if the multimedia stream data come from a terminal shared by a first user and a second user, respectively acquiring the original audio stream data and the original video stream data from the multimedia stream data.
For example, when the normal person and the hearing-impaired person share the terminal 21, the terminal 21 may collect audio information of the normal person and video information of the hearing-impaired person, the audio information of the normal person being used as the original audio information, and the video information of the hearing-impaired person being used as the original video information. The terminal 21 encodes the original audio information to obtain original audio stream data, encodes the original video information to obtain original video stream data, synthesizes the original audio stream data and the original video stream data into the same multimedia stream data, and sends the same multimedia stream data to the cloud server 22, so that the cloud server 22 can obtain the original audio stream data and the original video stream data from the same multimedia stream data. Further, the cloud server 22 decodes the original audio stream data to obtain original audio information, that is, audio information of a normal person, and decodes the original video stream data to obtain original video information, that is, video information of a hearing-impaired person performing gesture actions. In this case, the cloud server 22 may translate the audio information of the normal person into the video information of the gesture action of the first virtual character, and the video information of the gesture action of the first virtual character is used as the target video information, and encode the target video information to obtain the target video stream data. In addition, the cloud server 22 may also translate the video information of the gesture action performed by the hearing impaired person into the audio information broadcasted by the first virtual character or other virtual characters, and encode the target audio information to obtain target audio stream data, where the audio information broadcasted by the first virtual character or other virtual characters is used as the target audio information.
In another application scenario, the first user and the second user may use respective terminals for remote interaction. As shown in fig. 3, the terminal 31 is a terminal of a first user, the terminal 32 is a terminal of a second user, the terminal 31 is referred to as a first terminal, and the terminal 32 is referred to as a second terminal. The terminal 31 and the terminal 32 communicate through the cloud server 33 so that the first user and the second user can remotely interact. The terminal 31 sends first multimedia stream data to the cloud server 33, where the first multimedia stream data includes original audio stream data of the first user and original video stream data of the first user. The terminal 32 sends second multimedia stream data to the cloud server 33, where the second multimedia stream data includes the original video stream data of the second user, and in addition, the second multimedia stream data may also include the original audio stream data of other sound sources in the surrounding environment of the second user.
In this case, acquiring the original audio stream data and the original video stream data includes: and acquiring the original audio stream data from the first multimedia stream data, and acquiring the original video stream data from the second multimedia stream data.
For example, when the cloud server 33 receives first multimedia stream data from the terminal 31, the original audio stream data of the first user is obtained from the first multimedia stream data. When the cloud server 33 receives the second multimedia stream data from the terminal 32, the original video stream data of the second user is obtained from the second multimedia stream data.
Specifically, acquiring the original audio stream data from the first multimedia stream data includes: if the first multimedia stream data comes from a first terminal of a first user, acquiring the original audio stream data from the first multimedia stream data; acquiring the original video stream data from second multimedia stream data, wherein the acquiring comprises the following steps: and if the second multimedia stream data comes from a second terminal of a second user, acquiring the original video stream data from the second multimedia stream data.
For example, the first multimedia stream data may further include identification information of the first user, and the second multimedia stream data includes identification information of the second user. When the cloud server 33 receives the first multimedia stream data, it is determined that the first multimedia stream data is from the terminal 31 of the first user according to the identification information of the first user included in the first multimedia stream data, and therefore, the original audio stream data of the first user is obtained from the first multimedia stream data. When the cloud server 33 receives the second multimedia stream data, it is determined that the second multimedia stream data is from the terminal 32 of the second user according to the identification information of the second user included in the second multimedia stream data, and therefore, the original video stream data of the second user is obtained from the second multimedia stream data.
Further, the cloud server 33 generates video information of the gesture action of the first virtual character, that is, target video information, according to the original audio stream data of the first user, and encodes the target video information to obtain target video stream data. In addition, the cloud server 33 may generate, according to the original video stream data of the second user, target audio information that is audio information broadcasted by the first virtual character or another virtual character, and encode the target audio information to obtain target audio stream data. For a specific generation process, reference may be made to the contents of the foregoing embodiments, which are not described herein again.
The cloud server may push the target video stream data and the target audio stream data to a terminal that provides too many media stream data for the cloud server, or may push the target video stream data and the target audio stream data to other terminals except the terminal. In addition, the cloud server is not limited to pushing the target video stream data and the target audio stream data to the same terminal, and may also push the target video stream data and the target audio stream data to different terminals respectively. That is, the process of pushing the target video stream data and the target audio stream data by the cloud server may not be limited to several terminals in the application scenario providing the multimedia stream data to the cloud server. The following are presented one by one.
In one possible scenario, pushing the target video stream data and the target audio stream data comprises: and pushing the target video stream data and the target audio stream data to the same terminal.
For example, the cloud server 22 may combine the target video stream data and the target audio stream data into one stream data and push the piece of stream data to the same terminal, which may be the terminal 21 shown in fig. 2, or may be another terminal other than the terminal 21. Taking the terminal 21 as an example, after the terminal 21 acquires the target video stream data and the target audio stream data from the stream data, the terminal 21 decodes the target video stream data and the target audio stream data respectively, so as to obtain the video information of the first virtual character performing the gesture motion and the audio information broadcasted by the first virtual character or other virtual characters, and further, the terminal 21 can broadcast the video information of the first virtual character performing the gesture motion and the audio information broadcasted by the first virtual character or other virtual characters. The terminal 21 may be a common terminal for normal people and hearing-impaired people, and in this case, the normal people and the hearing-impaired people may perform normal interaction and interaction through the terminal 21, thereby implementing sign language translation. Alternatively, the terminal 21 may be a terminal used by a user, and the user may be a normal person, so that the user can perform real-time video communication with a virtual character corresponding to another user, that is, the real character can interact with a voice or video-driven virtual character of another real character.
When the same terminal may be a terminal other than the terminal 21, the cloud server 22 may receive the original video stream data and the original audio stream data from the terminal 21, and push the target video stream data and the target audio stream data to the other terminal.
In another possible scenario, pushing the target video stream data and the target audio stream data comprises: and pushing the target video streaming data and the target audio streaming data to a first terminal, or pushing the target video streaming data and the target audio streaming data to a second terminal.
For example, as shown in fig. 2, the terminal 21 may be referred to as a first terminal, and the other terminals except the terminal 21 may be referred to as second terminals. The cloud server 22 may push the target video stream data and the target audio stream data to the first terminal. Or the cloud server 22 may push the target video stream data and the target audio stream data to the second terminal. Still alternatively, the cloud server 22 may push the target video stream data and the target audio stream data to both the first terminal and the second terminal.
As another example, as shown in fig. 3, the cloud server 33 may push the target video stream data and the target audio stream data to the terminal 31. Therefore, in the user interface of the terminal 31, the video information that the first virtual character performs the gesture action, that is, the target video information, may be displayed, and in addition, the terminal 31 may also play the audio information that is broadcasted by the first virtual character or other virtual characters, that is, the target audio information. For example, the cloud server 22 translates the original video information of the gesture action of the hearing-impaired person into the target audio information of "how many jin of watermelon", and after the terminal 31 acquires the target audio information, the first virtual character or other virtual characters are displayed in the user interface, and the first virtual character or other virtual characters broadcast "how many jin of watermelon". The original audio information replied by the first user of the terminal 31 is "two yuan per kilogram of watermelon", and the cloud server 22 translates the original audio information replied by the first user into the target video information of the gesture action of the first virtual character, so that the user interface of the terminal 31 can play the target video information, and in the target video information, the gesture action of the first virtual character represents "two yuan per kilogram of watermelon". For example, as shown in fig. 4, the user interface of the terminal 31 includes a first area 311 and a second area 312, the terminal 31 may display other virtual characters in the first area 311, the other virtual characters broadcast "how many kilograms of watermelon are", and display a first virtual character in the second area 312, where the gesture action of the first virtual character represents "two kilograms of watermelon". Alternatively, the cloud server 33 may push the target video stream data and the target audio stream data to the terminal 32, so that the user interface of the terminal 32 is consistent with the user interface of the terminal 31 shown in fig. 4, which is not described herein again. Still alternatively, the cloud server 33 may push the target video stream data and the target audio stream data to both the terminal 31 and the terminal 32, so that the terminal 31 and the terminal 32 play the same target video information and target audio information.
Specifically, the pushing the target video stream data and the target audio stream data to the first terminal includes: and pushing the target video stream data, the target audio stream data and the original video stream data of a second user to a first terminal.
For example, in some embodiments, the cloud server 22 may push not only the target video stream data and the target audio stream data to the terminal 31, but also push the original video stream data of the gesture action of the hearing impaired person to the terminal 31. The terminal 31 decodes the target video stream data, the target audio stream data, and the original video stream data of the hearing-impaired person performing the gesture action, respectively, to obtain target video information, target audio information, and original video information of the hearing-impaired person performing the gesture action. As shown in fig. 5, the terminal 31 may play the original video information of the hearing-impaired person performing the gesture action in the first area 311, where the gesture action of the hearing-impaired person indicates "how much money per kilogram of watermelon", and the terminal 31 may play the target audio information of "how much money per kilogram of watermelon", and display the subtitle "how much money per kilogram of watermelon" in the first area 311. The original audio information replied by the first user of the terminal 31 is "two yuan per kilogram of watermelon", the cloud server 22 translates the original audio information replied by the first user into the target video information of the first virtual character performing the gesture action, and the terminal 31 plays the target video information and the caption "two yuan per kilogram of watermelon" in the second area 312. In addition, the present embodiment does not limit the positional relationship between the first region and the second region, and the first region and the second region shown in fig. 4 or fig. 5 are only an illustrative illustration and are not specifically limited.
In addition, the user interface of the terminal 32 may also play the original video information of the gesture action performed by the hearing-impaired person, and at this time, the user interface of the terminal 32 is consistent with the user interface of the terminal 31 shown in fig. 5, which is not described herein again.
In yet another possible scenario, pushing the target video stream data and the target audio stream data comprises: and pushing the target audio stream data to a first terminal, and pushing the target video stream data to a second terminal.
For example, the cloud server 33 may push the target audio stream data to the terminal 31 and push the target video stream data to the terminal 32. So that the terminal 31 can play the target audio information such as "how many jin of watermelon", and the terminal 32 can play the target video information of the first virtual character performing the gesture action, wherein the gesture action of the first virtual character represents "two yuan of watermelon".
Specifically, the pushing the target audio stream data to the first terminal includes: and pushing the target audio stream data and preset video stream data to the first terminal, wherein the preset video stream data comprises video stream data of a second virtual character and/or original video stream data of a second user. The mouth shape of the second avatar matches the target audio information.
For example, the cloud server 22 may push the target audio stream data to the terminal 31 and may also push the preset video stream data to the terminal 31. The terminal 31 decodes the target audio stream data and the preset video stream data respectively to obtain target audio information and preset video information. As shown in fig. 6, the terminal 31 plays the video information of the second virtual character in the upper right corner area 61 of the user interface, and the second virtual character is used to broadcast the target audio information, for example, "how much money is in watermelon". Specifically, the mouth shape of the second virtual character matches "how much money the watermelon is. In addition, the terminal 31 may display the original video information of the first user in a region other than the upper right corner region 61.
In some embodiments, the preset video information may be original video information of the second user, for example, as shown in fig. 7, the terminal 31 plays the original video information of the hearing-impaired person performing the gesture action in the area 61 except the upper right corner, the gesture action of the hearing-impaired person indicates "how much money for watermelon" and the terminal 31 may play the target audio information of "how much money for watermelon" and display the caption "how much money for watermelon" in the area 61. In addition, the terminal 31 may display the original video information of the first user in a region other than the upper right corner region 61.
In some other embodiments, the preset video information may include video information of the second avatar and original video information of the second user. For example, as shown in fig. 8, the terminal 31 may play the original video information of the hearing-impaired person performing the gesture in the area 61, and play the video information of the second virtual character in the area 62, where the second virtual character is used to broadcast the target audio information, such as "how many kilograms of watermelon are. Specifically, the mouth shape of the second virtual character matches "how much money the watermelon is.
This disclosed embodiment translates the audio information of normal person into the video information that first virtual personage was done the gesture action through the high in the clouds server, the video information that the person who is hearing impaired made the gesture action translates into the audio information that is reported by first virtual personage or other virtual personages, make the high in the clouds server can realize sign language translation, provide sign language promptly and change pronunciation and the function of pronunciation to sign language, make normal person and the person who is hearing impaired can normally communicate through pronunciation and sign language, and the pronunciation of normal person and the video of person who is hearing impaired can transmit in real time to the high in the clouds server, make the high in the clouds server can carry out real-time identification to pronunciation or video, guarantee the low time delay of communication and communication between normal person and the person who is hearing impaired.
Fig. 9 is a flowchart of an interaction method according to another embodiment of the disclosure. The method comprises the following specific steps:
and S901, acquiring original audio stream data and original video stream data.
As shown in fig. 10, the terminal 31 pushes the audio/video stream to the audio/video server, and the audio/video server extracts the original video stream data and the original audio stream data from the audio/video stream.
S902, decoding the original audio stream data to obtain original audio information.
As shown in fig. 10, the audio/video server may decode the original audio stream data to obtain original audio information, where the original audio information may be PCM data or data obtained by demodulating the PCM data.
And S903, carrying out voice recognition on the original audio information to obtain text information.
As shown in fig. 10, the audio/video server may perform speech recognition on the original audio information to obtain text information.
And S904, generating action information corresponding to the first virtual character according to the text information.
For example, the audio/video server may process the text information by Natural Language Processing (NLP) technology to obtain motion information corresponding to the first avatar, such as a three-dimensional motion shown in fig. 10.
And S905, performing three-dimensional rendering according to the action information to obtain target video information in a second format.
For example, the audio/video server may perform three-dimensional rendering on the three-dimensional motion to obtain target video information of the first virtual character performing the gesture motion, where the target video information may be in an RGB format.
And S906, converting the target video information in the second format into the target video information in the first format.
For example, the audio-video server may convert the target video information in RGB format into the target video information in YUV format.
And S907, encoding the target video information in the first format to obtain target video stream data.
For example, the audio/video server encodes the target video information in the YUV format to obtain target video stream data.
S908, decoding the original video stream data to obtain original video information in a first format.
For example, the audio/video server may decode the original video stream data to obtain the original video information in YUV format, where the YUV format may be specifically YUV 420P.
And S909, converting the original video information in the first format into original video information in a second format, wherein the original video information comprises real characters.
For example, the audio/video server may convert the raw video information in the YUV format into the raw video information in the RGB format, where the raw video information includes the real person acquired by the terminal 31.
S910, one or more continuous actions of the real character in the original video information in the second format are recognized as a word, and a plurality of words form a sentence; each sentence is converted into one target audio information.
For example, the audio-video server performs motion recognition on the raw video information in RGB format, for example, recognizes one or more continuous motions of a real person in the raw video information in RGB format as a word. A plurality of words can form a sentence through NLP, each sentence is converted into TTS voice, and the TTS voice is placed in a TTS voice queue. The TTS speech is denoted as target audio information.
And S911, coding the target audio information to obtain target audio stream data.
For example, the audio/video server may encode each TTS speech in the TTS speech queue to obtain target audio stream data.
S912, pushing the target video stream data and the target audio stream data.
Wherein the process from S902 to S907 and the process from S908 to S911 may be performed in parallel.
For example, as shown in fig. 10, the audio/video server may synthesize the target video stream data and the target audio stream data into one stream data, and push the piece of stream data to the terminal 31, so that the terminal 31 may pull the piece of stream data, that is, pull the audio/video stream.
It will be appreciated that the processing in the dashed box 101 shown in fig. 10 may be performed by an audiovisual server, that is, the remote server may be an audiovisual server as described above.
In some embodiments, the processing in the dashed box 101 shown in fig. 10 may be performed by other servers, that is, the remote server may be a server other than the audio-video server as described above. When the processing process in the dashed box 101 is executed by another server, the audio/video server may forward the audio/video stream received by the audio/video server from the terminal 31 to the other server, and the other server executes corresponding processing, and after the other server generates the target video stream data and the target audio stream data, the other server may synthesize the target video stream data and the target audio stream data into one stream data and push the stream data to the audio/video server, and the audio/video server forwards the stream data to the terminal 31.
In addition, in some other embodiments, the processing procedure in the dashed box 101 may be partially executed on the av server, and other parts are executed on other servers, for example, the NLP may be executed on other servers, and other processing procedures besides the NLP may be executed on the av server.
In addition, not only the terminal 31 may provide the audio/video stream (e.g., the first multimedia stream data) to the audio/video server, but also the terminal 32 may provide the audio/video stream (e.g., the second multimedia stream data) to the audio/video server. The audio/video server may obtain the original audio stream data of the first user from the first multimedia stream data, and obtain the original video stream data of the second user from the second multimedia stream data.
In addition, the audio/video server can not only push the stream data synthesized by the target video stream data and the target audio stream data to the terminal 31, but also push the stream data to the terminal 32, so that the terminal 31 and the terminal 32 can respectively pull the audio/video stream from the audio/video server. Alternatively, the audio/video server may push the target video stream data to the terminal 32 and push the target audio stream data to the terminal 31. The specific implementation process is as described in the above embodiments, and is not described herein again.
This disclosed embodiment translates the audio information of normal person into the video information that first virtual personage was done the gesture action through the high in the clouds server, the video information that the person who is hearing impaired made the gesture action translates into the audio information that is reported by first virtual personage or other virtual personages, make the high in the clouds server can realize sign language translation, provide sign language promptly and change pronunciation and the function of pronunciation to sign language, make normal person and the person who is hearing impaired can normally communicate through pronunciation and sign language, and the pronunciation of normal person and the video of person who is hearing impaired can transmit in real time to the high in the clouds server, make the high in the clouds server can carry out real-time identification to pronunciation or video, guarantee the low time delay of communication and communication between normal person and the person who is hearing impaired.
Fig. 11 is a schematic structural diagram of an interaction device according to an embodiment of the present disclosure. The interaction device may be a cloud server as described above, or the interaction device may be a component or assembly in a cloud server. The interaction apparatus provided in the embodiment of the present disclosure may execute the processing flow provided in the embodiment of the interaction method, as shown in fig. 11, the interaction apparatus 110 includes:
the acquiring module 111 is configured to acquire original audio stream data and original video stream data;
a decoding module 112, configured to decode the original audio stream data to obtain original audio information, and decode the original video stream data to obtain original video information, where the original video information includes a real person;
a generating module 113, configured to perform voice recognition on the original audio information, and generate target video information according to a voice recognition result, where the target video information includes a first virtual character; performing action recognition on the original video information, and generating target audio information according to an action recognition result;
the encoding module 114 is configured to encode the target video information to obtain target video stream data, and encode the target audio information to obtain target audio stream data;
a pushing module 115, configured to push the target video stream data and the target audio stream data.
Optionally, the obtaining module 111 is specifically configured to:
acquiring the original audio stream data and the original video stream data from the same multimedia stream data; or
And acquiring the original audio stream data from the first multimedia stream data, and acquiring the original video stream data from the second multimedia stream data.
Optionally, when the obtaining module 111 obtains the original audio stream data and the original video stream data from the same multimedia stream data, the obtaining module 111 is specifically configured to: and if the multimedia stream data come from a terminal shared by a first user and a second user, respectively acquiring the original audio stream data and the original video stream data from the multimedia stream data. For example, in the scenario shown in fig. 2, the first user and the second user share the terminal 21, and the terminal 21 sends multimedia stream data to the cloud server 22, so that the obtaining module 111 can obtain original audio stream data (e.g., original audio stream data of the first user) and original video stream data (e.g., original video stream data of the second user) from the same multimedia stream data. Alternatively, the terminal 21 may be used by a user (e.g., a first user), and the multimedia stream data sent by the terminal 21 to the cloud server 22 is multimedia stream data of the first user, so that the obtaining module 111 may obtain original audio stream data (e.g., original audio stream data of the first user) and original video stream data (e.g., original video stream data of the first user) from the same multimedia stream data.
Optionally, when the obtaining module 111 obtains the original audio stream data from the first multimedia stream data, the obtaining module 111 is specifically configured to: if the first multimedia stream data comes from a first terminal of a first user, acquiring the original audio stream data from the first multimedia stream data; when the obtaining module 111 obtains the original video stream data from the second multimedia stream data, the obtaining module 111 is specifically configured to: and if the second multimedia stream data comes from a second terminal of a second user, acquiring the original video stream data from the second multimedia stream data. For example, in the scenario shown in fig. 3, the terminal 21 sends first multimedia stream data to the cloud server 33, and the terminal 32 sends second multimedia stream data to the cloud server 33. The cloud server 33 obtains the original audio stream data (e.g., the original audio stream data of the first user) from the first multimedia stream data, so as to obtain the original video stream data (e.g., the original video stream data of the second user) from the second multimedia stream data.
Optionally, when the pushing module 115 pushes the target video stream data and the target audio stream data, it is specifically used in at least one of the following situations:
one situation is: and pushing the target video stream data and the target audio stream data to the same terminal. For example, the pushing module 115 may push the target video stream data and the target audio stream data to a terminal (i.e., the same terminal) shared by the first user and the second user, so that the first user and the second user can effectively communicate and exchange through the terminal. Or, when the terminal is used by a user, the pushing module 115 may also push the target video stream data and the target audio stream data to the terminal, so that the user may perform an interaction, a video conversation, with a virtual character corresponding to the user.
The other situation is that: and pushing the target video streaming data and the target audio streaming data to a first terminal, or pushing the target video streaming data and the target audio streaming data to a second terminal. For example, as shown in fig. 2, the terminal 21 may be referred to as a first terminal, and the other terminals except the terminal 21 may be referred to as second terminals. The cloud server 22 may push the target video stream data and the target audio stream data to the first terminal. Or the cloud server 22 may push the target video stream data and the target audio stream data to the second terminal. Still alternatively, the cloud server 22 may push the target video stream data and the target audio stream data to both the first terminal and the second terminal. Or for example, as shown in fig. 3, the cloud server 33 may push the target video stream data and the target audio stream data to the terminal 31, or to the terminal 32, or to both the terminal 31 and the terminal 32.
Yet another situation is: and pushing the target audio stream data to a first terminal, and pushing the target video stream data to a second terminal. For example, as shown in fig. 3, the cloud server 33 may push target audio stream data to the terminal 31 and push target video stream data to the terminal 32. So that normal people and hearing impaired people can remotely interact.
Optionally, when the pushing module 115 pushes the target video stream data and the target audio stream data to the first terminal, the pushing module is specifically configured to: and pushing the target video stream data, the target audio stream data and the original video stream data of a second user to a first terminal.
Optionally, when the pushing module 115 pushes the target audio stream data to the first terminal, the pushing module is specifically configured to: and pushing the target audio stream data and preset video stream data to the first terminal, wherein the preset video stream data comprises video stream data of a second virtual character and/or original video stream data of a second user.
Optionally, the mouth shape of the second virtual character matches the target audio information.
Optionally, when the decoding module 112 decodes the original video stream data to obtain the original video information, the decoding module is specifically configured to: decoding the original video stream data to obtain original video information in a first format; converting the original video information in the first format into original video information in a second format; the generating module 113 is specifically configured to, when performing motion recognition on the original video information and generating target audio information according to a motion recognition result: recognizing one or more continuous actions of the real character in the original video information in the second format as a word, wherein the words form a sentence; each sentence is converted into one target audio information.
Optionally, when the generating module 113 performs speech recognition on the original audio information and generates target video information according to a speech recognition result, the generating module is specifically configured to: carrying out voice recognition on the original audio information to obtain text information; generating action information corresponding to the first virtual character according to the text information; and generating target video information at least according to the action information. The target video information includes at least the motion video of the first avatar, and in other embodiments, other information, such as subtitles, may be included in the target video information.
Optionally, when the generating module 113 generates the target video information at least according to the action information, the generating module is specifically configured to: performing three-dimensional rendering according to the action information to obtain an action video of a first virtual character; generating a subtitle according to the text information; and adding the subtitles to the action video of the first virtual character to obtain the target video information. That is, in this manner, the target video information includes not only the motion video of the first avatar but also subtitles. And in this manner, the specific format of the target video information generated by the generation module 113 is not limited.
However, in some other embodiments, the target video information generated by the generation module 113 may be in RGB format, and the RGB format may need to be converted into YUV format before the target video information is encoded. Optionally, when the generating module 113 generates the target video information at least according to the action information, the generating module is specifically configured to: performing three-dimensional rendering according to the action information to obtain target video information in a second format (for example, an RGB format); when the encoding module 114 encodes the target video information to obtain target video stream data, the encoding module is specifically configured to: converting the target video information in the second format into target video information in the first format (e.g., YUV format); and coding the target video information in the first format to obtain target video stream data.
The interaction apparatus in the embodiment shown in fig. 11 may be used to implement the technical solution of the above method embodiment, and the implementation principle and technical effect are similar, which are not described herein again.
The internal functions and structures of the interactive apparatus, which can be implemented as an electronic device, are described above. Fig. 12 is a schematic structural diagram of an embodiment of an electronic device provided in the embodiment of the present disclosure. As shown in fig. 12, the electronic device includes a memory 121 and a processor 122.
The memory 121 stores programs. In addition to the above-described programs, the memory 121 may also be configured to store other various data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on the electronic device, contact data, phonebook data, messages, pictures, videos, and so forth.
The memory 121 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
A processor 122, coupled to the memory 121, for executing the program stored in the memory 121 to:
acquiring original audio stream data and original video stream data;
decoding the original audio stream data to obtain original audio information, and decoding the original video stream data to obtain original video information, wherein the original video information comprises real figures;
performing voice recognition on the original audio information, and generating target video information according to a voice recognition result, wherein the target video information comprises a first virtual character;
performing action recognition on the original video information, and generating target audio information according to an action recognition result;
coding the target video information to obtain target video stream data, and coding the target audio information to obtain target audio stream data;
and pushing the target video stream data and the target audio stream data.
Further, as shown in fig. 12, the electronic device may further include: communication components 123, power components 124, audio components 125, display 126, and other components. Only some of the components are schematically shown in fig. 12, and the electronic device is not meant to include only the components shown in fig. 12.
When the electronic device is a cloud server as described in the above method embodiments, the memory 121 stores a program, and the processor 122 can execute the program and implement the steps in the above method embodiments. The communication component 123 may be used to communicate with the terminal, for example, to receive multimedia stream data sent by the terminal, or to push target audio stream data and/or target video stream data to the terminal.
When the electronic device is a terminal as described in the above method embodiment, the electronic device may further include a video capture module, where the video capture module is configured to capture original video information, and the audio component 125 is configured to capture original audio information. The processor 122 may perform format conversion, encoding, and PCM, encoding, etc. operations on the original video information. The communication component 123 may send the original video stream data and the original audio stream data to the cloud server, so that the cloud server may perform the steps in the method embodiments as described above.
The communication component 123 is configured to facilitate wired or wireless communication between the electronic device and other devices. The electronic device may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 123 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 123 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
A power supply component 124 that provides power to the various components of the electronic device. The power components 124 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for an electronic device.
Audio component 125 is configured to output and/or input audio signals. For example, the audio component 125 includes a Microphone (MIC) configured to receive external audio signals when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 121 or transmitted via the communication component 123. In some embodiments, audio component 125 also includes a speaker for outputting audio signals.
The display 126 includes a screen, which may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.
In addition, the embodiment of the present disclosure also provides a computer readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the interaction method described in the above embodiment.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (15)

1. An interaction method, wherein the method comprises:
acquiring original audio stream data and original video stream data;
decoding the original audio stream data to obtain original audio information, and decoding the original video stream data to obtain original video information, wherein the original video information comprises real figures;
performing voice recognition on the original audio information, and generating target video information according to a voice recognition result, wherein the target video information comprises a first virtual character;
performing action recognition on the original video information, and generating target audio information according to an action recognition result;
coding the target video information to obtain target video stream data, and coding the target audio information to obtain target audio stream data;
and pushing the target video stream data and the target audio stream data.
2. The method of claim 1, wherein obtaining raw audio stream data and raw video stream data comprises:
acquiring the original audio stream data and the original video stream data from the same multimedia stream data; or
And acquiring the original audio stream data from the first multimedia stream data, and acquiring the original video stream data from the second multimedia stream data.
3. The method according to claim 2, wherein obtaining the original audio stream data and the original video stream data from the same multimedia stream data comprises:
and if the multimedia stream data come from a terminal shared by a first user and a second user, respectively acquiring the original audio stream data and the original video stream data from the multimedia stream data.
4. The method of claim 2, wherein obtaining the original audio stream data from first multimedia stream data comprises:
if the first multimedia stream data comes from a first terminal of a first user, acquiring the original audio stream data from the first multimedia stream data;
acquiring the original video stream data from second multimedia stream data, wherein the acquiring comprises the following steps:
and if the second multimedia stream data comes from a second terminal of a second user, acquiring the original video stream data from the second multimedia stream data.
5. The method of claim 1, wherein pushing the target video stream data and the target audio stream data comprises:
pushing the target video stream data and the target audio stream data to the same terminal; or
Pushing the target audio stream data to a first terminal, and pushing the target video stream data to a second terminal; or
And pushing the target video streaming data and the target audio streaming data to a first terminal, or pushing the target video streaming data and the target audio streaming data to a second terminal.
6. The method of claim 5, wherein pushing the target video stream data and the target audio stream data to the first terminal comprises:
and pushing the target video stream data, the target audio stream data and the original video stream data of a second user to a first terminal.
7. The method of claim 5, wherein pushing the target audio stream data to the first terminal comprises:
pushing the target audio stream number and preset video stream data to the first terminal, wherein the preset video stream data comprises video stream data of a second virtual character and/or original video stream data of a second user;
wherein the mouth shape of the second avatar matches the target audio information.
8. The method of claim 1, wherein decoding the primary video stream data to obtain primary video information comprises:
decoding the original video stream data to obtain original video information in a first format;
converting the original video information in the first format into original video information in a second format;
correspondingly, the action recognition is carried out on the original video information, and the target audio information is generated according to the action recognition result, and the method comprises the following steps:
recognizing one or more continuous actions of the real character in the original video information in the second format as a word, wherein the words form a sentence;
each sentence is converted into one target audio information.
9. The method of claim 1, wherein performing speech recognition on the original audio information and generating target video information according to a speech recognition result comprises:
carrying out voice recognition on the original audio information to obtain text information;
generating action information corresponding to the first virtual character according to the text information;
generating target video information at least according to the action information;
wherein generating the target video information at least according to the action information comprises:
performing three-dimensional rendering according to the action information to obtain target video information in a second format;
encoding the target video information to obtain target video stream data, comprising:
converting the target video information in the second format into the target video information in the first format;
and coding the target video information in the first format to obtain target video stream data.
10. An interactive apparatus, wherein the apparatus comprises:
the acquisition module is used for acquiring original audio stream data and original video stream data;
the decoding module is used for decoding the original audio stream data to obtain original audio information, and decoding the original video stream data to obtain original video information, wherein the original video information comprises real characters;
the generating module is used for carrying out voice recognition on the original audio information and generating target video information according to a voice recognition result, wherein the target video information comprises a first virtual character; performing action recognition on the original video information, and generating target audio information according to an action recognition result;
the coding module is used for coding the target video information to obtain target video stream data and coding the target audio information to obtain target audio stream data;
and the pushing module is used for pushing the target video stream data and the target audio stream data.
11. The apparatus of claim 10, wherein the means for obtaining is further configured to:
acquiring the original audio stream data and the original video stream data from the same multimedia stream data; or
And acquiring the original audio stream data from the first multimedia stream data, and acquiring the original video stream data from the second multimedia stream data.
12. The apparatus according to claim 10, wherein the decoding module, when decoding the original video stream data to obtain the original video information, is specifically configured to:
decoding the original video stream data to obtain original video information in a first format;
converting the original video information in the first format into original video information in a second format;
the generating module is specifically configured to, when performing motion recognition on the original video information and generating target audio information according to a motion recognition result:
recognizing one or more continuous actions of the real character in the original video information in the second format as a word, wherein the words form a sentence;
each sentence is converted into one target audio information.
13. The apparatus according to claim 10, wherein the generating module is configured to perform speech recognition on the original audio information and generate the target video information according to a speech recognition result, and is configured to:
carrying out voice recognition on the original audio information to obtain text information;
generating action information corresponding to the first virtual character according to the text information;
generating target video information at least according to the action information;
when the generating module generates the target video information at least according to the action information, the generating module is specifically configured to: performing three-dimensional rendering according to the action information to obtain target video information in a second format;
when the encoding module encodes the target video information to obtain target video stream data, the encoding module is specifically configured to:
converting the target video information in the second format into the target video information in the first format;
and coding the target video information in the first format to obtain target video stream data.
14. An electronic device, comprising:
a memory;
a processor; and
a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of any one of claims 1-9.
15. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the method of any one of claims 1-9.
CN202111507859.8A 2021-12-10 2021-12-10 Interaction method, device, equipment and storage medium Pending CN113923471A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111507859.8A CN113923471A (en) 2021-12-10 2021-12-10 Interaction method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111507859.8A CN113923471A (en) 2021-12-10 2021-12-10 Interaction method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113923471A true CN113923471A (en) 2022-01-11

Family

ID=79248509

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111507859.8A Pending CN113923471A (en) 2021-12-10 2021-12-10 Interaction method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113923471A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114554267A (en) * 2022-02-22 2022-05-27 上海艾融软件股份有限公司 Audio and video synchronization method and device based on digital twin technology

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104038617A (en) * 2013-03-04 2014-09-10 联想移动通信科技有限公司 Calling method and intelligent mobile terminal
CN106254960A (en) * 2016-08-30 2016-12-21 福州瑞芯微电子股份有限公司 A kind of video call method for communication disorders and system
CN107707726A (en) * 2016-08-09 2018-02-16 深圳市鹏华联宇科技通讯有限公司 A kind of terminal and call method communicated for normal person with deaf-mute
US20190244623A1 (en) * 2018-02-02 2019-08-08 Max T. Hall Method of translating and synthesizing a foreign language
CN110533750A (en) * 2019-07-10 2019-12-03 浙江工业大学 A method of it converts the audio into as the sign language animation with customized 3D role
CN113379879A (en) * 2021-06-24 2021-09-10 北京百度网讯科技有限公司 Interaction method, device, equipment, storage medium and computer program product

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104038617A (en) * 2013-03-04 2014-09-10 联想移动通信科技有限公司 Calling method and intelligent mobile terminal
CN107707726A (en) * 2016-08-09 2018-02-16 深圳市鹏华联宇科技通讯有限公司 A kind of terminal and call method communicated for normal person with deaf-mute
CN106254960A (en) * 2016-08-30 2016-12-21 福州瑞芯微电子股份有限公司 A kind of video call method for communication disorders and system
US20190244623A1 (en) * 2018-02-02 2019-08-08 Max T. Hall Method of translating and synthesizing a foreign language
CN110533750A (en) * 2019-07-10 2019-12-03 浙江工业大学 A method of it converts the audio into as the sign language animation with customized 3D role
CN113379879A (en) * 2021-06-24 2021-09-10 北京百度网讯科技有限公司 Interaction method, device, equipment, storage medium and computer program product

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114554267A (en) * 2022-02-22 2022-05-27 上海艾融软件股份有限公司 Audio and video synchronization method and device based on digital twin technology
CN114554267B (en) * 2022-02-22 2024-04-02 上海艾融软件股份有限公司 Audio and video synchronization method and device based on digital twin technology

Similar Documents

Publication Publication Date Title
US10991380B2 (en) Generating visual closed caption for sign language
KR100836616B1 (en) Portable Terminal Having Image Overlay Function And Method For Image Overlaying in Portable Terminal
US8931031B2 (en) Matrix code-based accessibility
JP2006330958A (en) Image composition device, communication terminal using the same, and image communication system and chat server in the system
US8650591B2 (en) Video enabled digital devices for embedding user data in interactive applications
WO2022089224A1 (en) Video communication method and apparatus, electronic device, computer readable storage medium, and computer program product
CN108960158A (en) Intelligent sign language translation system and method
KR101377208B1 (en) Karaoke system using synthesized image
JP2014140135A (en) Information reproduction terminal
CN112135155A (en) Audio and video connecting and converging method and device, electronic equipment and storage medium
CN113923471A (en) Interaction method, device, equipment and storage medium
CN108320331B (en) Method and equipment for generating augmented reality video information of user scene
CN111131853A (en) Handwriting live broadcasting system and method
KR20170127354A (en) Apparatus and method for providing video conversation using face conversion based on facial motion capture
CN101562711A (en) Method for realizing interesting human-machine interaction function of digital TV set
CN115359796A (en) Digital human voice broadcasting method, device, equipment and storage medium
KR20140084463A (en) Apparatus and method for displaying image of narrator information and, server for editing video data
CN113709579A (en) Audio and video data transmission method and device and storage medium
CN113422997A (en) Method and device for playing audio data and readable storage medium
TW201026052A (en) Method for transmitting a man-machine operation picture, a mobile video device thereof, and a video system using the same
Slater et al. TV and broadband: Innovative applications for people with disabilities
CN112153083A (en) Anchor point sharing method, device, system, electronic equipment and storage medium
KR102546532B1 (en) Method for providing speech video and computing device for executing the method
KR101454333B1 (en) Real time system for mixing captured image and backgroud image
KR100703320B1 (en) Imaginary broadcasting system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20220111