CN118113249A

CN118113249A - Audio data processing method, related device, equipment and storage medium

Info

Publication number: CN118113249A
Application number: CN202211530099.7A
Authority: CN
Inventors: 高鹏; 曾维亿; 左小祥; 程君; 尚海豹; 贾永库; 蔡书翔
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-11-30
Filing date: 2022-11-30
Publication date: 2024-05-31

Abstract

The application discloses an audio data processing method, a device, equipment and a storage medium, wherein an application scene at least comprises various terminals, such as: a mobile phone, a computer, a vehicle-mounted terminal, etc. The method comprises the following steps: acquiring first voice data from a first object; responding to the first voice data, and acquiring scene association information of a first virtual object aiming at a virtual scene; acquiring first adjustment data for first voice data according to scene association information, wherein the first adjustment data comprises at least one of audio attribute parameters and accompanying audio data; processing the first voice data by adopting the first adjustment data to obtain first audio data; and playing the first audio data is realized. The application introduces the voice component after the plug-in into the audio engine local to the terminal, thereby providing a tool for designing the immersive voice for a developer, and realizing the immersive voice effect following the scene change.

Description

Audio data processing method, related device, equipment and storage medium

Technical Field

The present application relates to the field of multimedia technologies and the field of cloud technologies, and in particular, to an audio data processing method, a related device, equipment, and a storage medium.

Background

Game socialization is an important trend in recent years of development of the game industry, and the realization of real-time interaction and the capability of a social platform are two major keys of game socialization. As game voices become more popular, communication of voices in games has become a necessity for players. Voice interaction plays a key role in both making friends and communicating tactical content within the game.

In order to add voice capability in a game, in related technologies commonly used in the industry, a developer separately accesses a software development kit (Software Development Kit, SDK) provided by a voice Platform AS A SEVICE, paaS, and implements various service scenarios of in-game voice through a basic application programming interface (application programming Interface, API) provided by the SDK.

The inventors have found that at least the following problems exist with current solutions, since the voice SDK access scheme is designed independently of the game sound effects. Thus, the voice logic and processing of the game breaks away from the game scene, resulting in a uniform voice experience provided to the player, limiting the voice experience. In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the application provides an audio data processing method, a related device, equipment and a storage medium. By introducing the voice component after plugin into an audio engine local to the terminal, a tool for designing the immersive voice is provided for a developer, thereby realizing the immersive voice effect following scene change.

In view of the foregoing, an aspect of the present application provides an audio data processing method, applied to a first terminal, including:

Acquiring first voice data from a first object;

Responding to the first voice data, acquiring scene association information of a first virtual object aiming at a virtual scene, wherein the first virtual object is a virtual object controlled by the first object, and the scene association information comprises at least one of position information, environment information, emotion information and role information of the first virtual object in the virtual scene;

acquiring first adjustment data for first voice data according to scene association information, wherein the first adjustment data comprises at least one of audio attribute parameters and accompanying audio data, the audio attribute parameters are used for playing effects on the voice data, and the accompanying audio data are used for providing at least one of background music and background sound effects;

processing the first voice data by adopting the first adjustment data to obtain first audio data;

And playing the first audio data is realized.

Another aspect of the present application provides an audio data processing apparatus applied to a first terminal, the audio data processing apparatus comprising:

an acquisition module for acquiring first voice data derived from a first object;

The acquisition module is further used for responding to the first voice data and acquiring scene association information of the first virtual object for the virtual scene, wherein the first virtual object is a virtual object controlled by the first object, and the scene association information comprises at least one of position information, environment information, emotion information and role information of the first virtual object in the virtual scene;

The acquisition module is further used for acquiring first adjustment data aiming at the first voice data according to the scene association information, wherein the first adjustment data comprises at least one of audio attribute parameters and accompanying audio data, the audio attribute parameters are used for playing effects on the voice data, and the accompanying audio data are used for providing at least one of background music and background sound effects;

the processing module is used for processing the first voice data by adopting the first adjustment data to obtain first audio data;

and the playing module is used for playing the first audio data.

In one possible design, in another implementation of another aspect of the embodiments of the present application,

The acquisition module is specifically used for acquiring original voice data from the first object through the audio input equipment;

Performing voice preprocessing on the original voice data to obtain first processed voice data, wherein the voice preprocessing comprises at least one of noise reduction processing, echo cancellation processing, volume balancing processing and howling suppression processing;

and performing audio frame conversion processing on the first processed voice data to obtain the first voice data, wherein the audio frame conversion processing comprises at least one of sampling rate conversion processing, sound channel conversion processing, sampling depth conversion processing and audio frame length.

The acquisition module is specifically configured to perform audio frame conversion processing on the first processed voice data to obtain a first audio frame sequence, where the first audio frame sequence includes at least one first audio frame;

Sampling the first audio frame sequence to obtain a first sampling point data set of each first audio frame, wherein the first sampling point data set comprises M sampling point data, and M is an integer greater than 1;

Writing a first set of sample point data for each first audio frame into a ring buffer;

Reading at least one second sampling point data set from the ring buffer, wherein each second sampling point data set comprises N sampling point data, and N is an integer greater than 1;

First speech data is generated from at least one second set of sample point data, wherein the first speech data comprises at least one second audio frame, each second audio frame corresponding to one of the second set of sample point data.

In one possible design, in another implementation of another aspect of the embodiments of the present application, the scene-related information includes location information, and the first adjustment data includes audio attribute parameters;

The acquisition module is specifically used for determining the space type of the first virtual object in the virtual scene according to the position information;

if the space type belongs to a first space type, determining that the audio attribute parameters for the first voice data comprise reverberation adjustment parameters, wherein the first space type represents a virtual space which is smaller than or equal to a space size threshold in the virtual scene;

If the spatial type belongs to a second spatial type, determining that the audio attribute parameters for the first voice data comprise echo adjustment parameters, wherein the second spatial type represents a virtual space in the virtual scene that is greater than a spatial size threshold.

In one possible design, in another implementation of another aspect of the embodiments of the present application, the scene-related information includes environmental information, and the first adjustment data includes accompanying audio data;

the acquisition module is specifically configured to determine that the accompanying audio data for the first voice data includes audio data recorded based on the indoor environment if the environment information indicates that the first virtual object is in the indoor environment;

if the environment information indicates that the first virtual object is in the outdoor environment, determining that the accompanying audio data for the first voice data comprises audio data recorded based on the outdoor environment;

if the environment information indicates that the first virtual object is in a weather environment, determining that the accompanying audio data for the first voice data includes audio data recorded based on the weather environment.

In one possible design, in another implementation of another aspect of the embodiments of the present application, the scene-related information includes mood information, and the first adjustment data includes audio attribute parameters;

The system comprises an acquisition module, a first frequency domain adjustment module and a second frequency domain adjustment module, wherein the acquisition module is specifically used for determining that the audio attribute parameter aiming at the first voice data comprises at least one of a time domain adjustment parameter, a first frequency domain adjustment parameter and an up-regulation adjustment parameter if the emotion information indicates that the first virtual object is in a first emotion state, the time domain adjustment parameter is used for improving a voice sound head, and the first frequency domain adjustment parameter is used for enhancing a voice high-frequency component;

if the emotion information indicates that the first virtual object is in the second emotion state, determining that the audio attribute parameters for the first voice data comprise at least one of a speech adjustment parameter and a period adjustment parameter, wherein the speech adjustment parameter is used for slowing down the speech rate, and the period adjustment parameter is used for periodically changing the intonation;

If the emotion information indicates that the first virtual object is in a third emotion state, determining that the audio attribute parameters for the first voice data comprise at least one of second frequency domain adjustment parameters and tone reduction adjustment parameters, wherein the second frequency domain adjustment parameters are used for enhancing low frequency components of the voice data.

In one possible design, in another implementation of another aspect of the embodiments of the present application, the scene-related information includes character information, and the first adjustment data includes audio attribute parameters;

the obtaining module is specifically configured to determine that the audio attribute parameters for the first voice data include a pitch adjustment parameter, a timbre adjustment parameter, and a formant adjustment parameter if the character information indicates that the first virtual object belongs to the target character type.

In one possible design, in another implementation of another aspect of the embodiments of the present application, the first adjustment data includes audio attribute parameters and accompanying audio data;

the processing module is specifically used for adjusting the first voice data by adopting the audio attribute parameters to obtain target voice data;

Performing voice post-processing on the target voice data to obtain second processed voice data, wherein the voice post-processing comprises at least one of voice enhancement processing, frequency band gain processing and tuning processing;

And superposing the accompanying audio data and the second processed voice data to obtain first audio data.

The playing module is specifically configured to respond to an audio playing operation, and play the first audio data through the audio output device.

Or alternatively

The playing module is specifically configured to send the first audio data to the second terminal, so that the second terminal plays the first audio data through the audio output device.

The playing module is specifically used for acquiring the voice scene type;

If the voice scene type belongs to the first scene type, sending first audio data to a second terminal, wherein the first audio data are monophonic audio data, and the first audio data adopt a first sampling rate;

And if the voice scene type belongs to the second scene type, sending first audio data to the second terminal, wherein the first audio data are of stereo channels, and the first audio data adopt a second sampling rate which is higher than the first sampling rate.

The playing module is specifically configured to perform frame-splitting processing on the first audio data to obtain a second audio frame sequence, where the second audio frame sequence includes at least one third audio frame;

sampling the second audio frame sequence to obtain a third sampling point data set of each third audio frame, wherein the third sampling point data set comprises P sampling point data, and P is an integer greater than 1;

Writing a third sample point data set of each third audio frame into the ring buffer;

Reading at least one fourth sampling point data set from the ring buffer, wherein each fourth sampling point data set comprises Q sampling point data, and Q is an integer greater than 1;

and performing audio frame conversion processing on at least one fourth sampling point data set, and sending the processed first audio data to the second terminal, wherein the first audio data comprises at least one fourth audio frame, and each fourth audio frame corresponds to one fourth sampling point data set.

In one possible design, in another implementation of another aspect of the embodiments of the present application, the audio data processing apparatus further includes a receiving module;

The receiving module is used for receiving second audio data sent by the second terminal, wherein the second audio data is obtained after the second terminal processes second voice data by adopting second adjustment data, the second voice data is derived from a second object, the second adjustment data is obtained according to scene association information of a second virtual object aiming at a virtual scene, and the second virtual object is a virtual object controlled by the second object;

And the playing module is also used for playing the second audio data.

The playing module is specifically configured to perform frame-splitting processing on the second audio data to obtain a third audio frame sequence, where the third audio frame sequence includes at least one fifth audio frame;

sampling the third audio frame sequence to obtain a fifth sampling point data set of each fifth audio frame, wherein the fifth sampling point data set comprises X sampling point data, and X is an integer greater than 1;

writing a fifth set of sample point data for each fifth audio frame into the ring buffer;

reading at least one sixth sampling point data set from the ring buffer, wherein each sixth sampling point data set comprises Y sampling point data, and Y is an integer greater than 1;

And performing audio frame conversion processing on at least one sixth sampling point data set, and playing the processed second audio data, wherein the second audio data comprises at least one sixth audio frame, and each sixth audio frame corresponds to one sixth sampling point data set.

The receiving module is further configured to receive second audio data sent by the second terminal, where the second audio data is obtained after the second terminal processes second voice data with second adjustment data, the second voice data is derived from a second object, the second adjustment data is obtained according to scene association information of a second virtual object for a virtual scene, and the second virtual object is a virtual object controlled by the second object;

The acquisition module is also used for responding to the second audio data and acquiring object association information according to the position information of the first virtual object and the second virtual object in the virtual scene;

The acquisition module is also used for acquiring third adjustment data aiming at the second audio data according to the object association information;

The processing module is also used for processing the second audio data by adopting third adjustment data to obtain third audio data;

and the playing module is also used for playing the third audio data.

The playing module is specifically configured to perform frame-splitting processing on the third audio data to obtain a fourth audio frame sequence, where the fourth audio frame sequence includes at least one seventh audio frame;

Sampling the fourth audio frame sequence to obtain a seventh sampling point data set of each seventh audio frame, wherein the seventh sampling point data set comprises S sampling point data, and S is an integer greater than 1;

writing a seventh set of sample point data for each seventh audio frame into the ring buffer;

reading at least one eighth sampling point data set from the ring buffer, wherein each eighth sampling point data set comprises R sampling point data, and R is an integer greater than 1;

And performing audio frame conversion processing on at least one eighth sampling point data set, and playing the processed third audio data, wherein the third audio data comprises at least one eighth audio frame, and each eighth audio frame corresponds to one eighth sampling point data set.

The receiving module is specifically configured to receive second audio data sent by the second terminal based on a data transmission policy, where the data transmission policy includes at least one of a forward error correction policy, a packet loss compensation policy, an automatic retransmission policy, and a buffer anti-jitter policy, the forward error correction policy is used for correcting an error code occurring in a data transmission process, the packet loss compensation policy is used for compensating a missing audio frame, the automatic retransmission policy is used for requesting retransmission of the audio frame, and the buffer anti-jitter policy is used for adopting dynamic buffer in the data transmission process.

Another aspect of the application provides a computer device comprising a memory storing a computer program and a processor implementing the methods of the above aspects when the processor executes the computer program.

Another aspect of the application provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the method of the above aspects.

In another aspect of the application, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the methods of the above aspects.

From the above technical solutions, the embodiment of the present application has the following advantages:

In an embodiment of the present application, an audio data processing method is provided, in response to acquired first voice data, scene association information of a first virtual object for a virtual scene is acquired, where the first virtual object is a virtual object controlled by the first object, and the scene association information includes at least one of position information, environment information, emotion information, and role information of the first virtual object in the virtual scene. Then, first adjustment data for the first voice data is acquired according to the scene association information. Then, the first voice data is processed by using the first adjustment data to obtain first audio data. Thereby, playback of the first audio data can be achieved. By the mode, after the voice component is plugged, the voice component is introduced into an audio engine local to the terminal, and voice data can be directly sent into the audio engine for processing. That is, according to scene association information of the virtual object in the virtual scene, adjustment data conforming to the current scene effect is generated. Real-time voice data is processed by utilizing the adjustment data so as to achieve rich sound processing effects and realize immersive voice effects along with scene changes.

Drawings

FIG. 1 is a schematic diagram of an implementation environment of an audio data processing method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of another embodiment of an audio data processing method according to an embodiment of the present application;

FIG. 3 is a flow chart of an audio data processing method according to an embodiment of the application;

FIG. 4 is a schematic diagram of an architecture for interaction of speech data with an audio engine according to an embodiment of the present application;

FIG. 5 is a schematic diagram of another architecture for interaction of speech data with an audio engine according to an embodiment of the present application;

FIG. 6 is a schematic workflow diagram of an acquisition plug-in accordance with an embodiment of the present application;

FIG. 7 is a schematic workflow diagram of a send plug-in accordance with an embodiment of the present application;

FIG. 8 is a schematic workflow diagram of a receiving plug-in accordance with an embodiment of the present application;

FIG. 9 is a schematic workflow diagram of a play plug-in accordance with an embodiment of the present application;

FIG. 10 is a schematic diagram of an embodiment of the application for implementing end-to-end speech processing based on an audio engine;

FIG. 11 is a schematic diagram of an audio data processing device according to an embodiment of the present application;

Fig. 12 is a schematic structural diagram of a terminal according to an embodiment of the present application.

Detailed Description

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "includes" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

In order to add voice capability in Application (APP), it is currently common practice in the industry for a developer to access an SDK provided by a voice PaaS server, and to implement various voice service scenarios, such as channel voice of game teams, range voice between different teams, voice black list, white list, and the like, through a basic API provided by the SDK. In addition to these basic voice function APIs, some advanced APIs are provided to enable other services, such as voice chat and voice variability capabilities in three-dimensional space. It can be seen that the voice experience conferred in APP depends on SDK capabilities, whereas the traditional voice SDK scheme access procedure is independent of APP sound effect design, so various professional designs and controls in sound effect design cannot be used on voice processing.

Based on this, the application provides an audio data processing method, which is used for directly sending the voice data stream into an audio engine for processing, so that the voice data stream is used as a special sound effect and is fused into the audio design of an APP, so that developers, audio engineers and the like of the APP can combine rich sound processing effects and sound control with real-time voice to realize the immersive voice effect changing along with virtual scenes. Generally, a developer generally needs to use an audio engine to design the sound effects in the APP, and the developer is familiar with the operation of the audio engine, so that the design flow of the voice can be similar to the design flow of the sound effects in other APPs based on the plug-in scheme of the audio engine. Compared with the traditional scheme of accessing the voice SDK, the audio engine plug-in scheme provided by the application does not need a developer to input extra time to know the voice SDK, so that the workload of accessing the voice is saved. At the same time, the audio engine plugin scheme provided by the application supports immersive design of speech, i.e., the rich capabilities of the audio engine are available for processing speech. While the audio engine is a commonly used sound tool for game developers, thus giving the developer more speech design space.

The audio data processing method provided by the application comprises at least one of the following scenes when applied.

1. Recording a scene;

In some interactive games, a user may converse with a non-player character (non-PLAYER CHARACTER, NPC) and record the game process for subsequent viewing or sharing, etc. Illustratively, the user records voice data through the terminal, which is data that the user dialogues with the NPC in a virtual scene (e.g., a game scene) for the virtual object. Based on the above, the terminal calls an audio engine, acquires corresponding adjustment data based on scene association information of the virtual scene, and processes the recorded voice data by using the adjustment data to obtain audio data.

2. Dubbing scenes;

because the sound signal can effectively supplement and enhance the visual perception content of people, the audio data processing method provided by the application can be applied to dubbing of an animation film, computer game sound effect production, virtual reality sound effect production and the like. Illustratively, the user records voice data, which is data dubbed by the user in a virtual scene (e.g., an animated scene) for a virtual object (e.g., an animated character), through the terminal. Based on the above, the terminal calls an audio engine, acquires corresponding adjustment data based on scene association information of the virtual scene, and processes the recorded voice data by using the adjustment data to obtain audio data.

3. A game voice scene;

In connection with the design of an audio engineer, in a gaming application that integrates an audio engine plug-in scheme, the mutual voices between users can be handled as the virtual scene of the game changes. In addition to three-dimensional spatial processing, the virtual objects controlled by the user are in conversations in different game scenes, for example, different reverberation effects can be generated in open valleys or narrow rooms, and the voice effects felt by the user are matched with the scenes. For another example, two virtual characters controlled by a user who is talking are respectively located in two rooms in the virtual scene, and if a door in the virtual scene is opened and closed, the user can also feel the change of sound. It can be seen that the voice effect is adapted to the virtual scene in the game, and the effect of meta universe is achieved.

It should be noted that the above application scenario is only an example, and the audio data processing method provided in this embodiment may also be applied to other scenarios, which is not limited herein.

It will be appreciated that the present application may be directed to cloud gaming (closed gaming), which may also be referred to as game on demand (on demand), an online gaming technology based on cloud computing technology. Cloud gaming technology enables lightweight devices (THIN CLIENT) with relatively limited graphics processing and data computing capabilities to run high quality games. In a cloud game scene, the game is not run in a player game terminal, but is run in a cloud server, the cloud server renders the game scene into a video and audio stream, and the video and audio stream is transmitted to the player game terminal through a network. The player game terminal does not need to have strong graphic operation and data processing capability, and only needs to have basic streaming media playing capability and the capability of acquiring player input instructions and sending the player input instructions to the cloud server.

The method provided by the application can be applied to the implementation environment shown in fig. 1 or fig. 2, wherein the implementation environment shown in fig. 1 comprises a first terminal 110, and an audio engine is deployed in the first terminal 110. The implementation environment shown in fig. 2 includes a first terminal 110, a server 120, a communication network 130, and a second terminal 140. Wherein the first terminal 110 and the server 120 are connected through a communication network 130, and the second terminal 140 and the server 120 are connected through the communication network 130. The communication network 130 uses standard communication techniques and/or protocols, typically the Internet, but may be any network including, but not limited to, bluetooth, a local area network (local area network, LAN), a metropolitan area network (metropolitan area network, MAN), a wide area network (wide area network, WAN), a mobile, private network, or any combination of virtual private networks. In some embodiments, custom or dedicated data communication techniques may be used in place of or in addition to the data communication techniques described above.

Terminals to which the present application relates (i.e., the first terminal 110 and the second terminal 140) include, but are not limited to, mobile phones, computers, game hosts, intelligent voice interaction devices, intelligent home appliances, vehicle terminals, aircraft, and the like. The client is deployed on the terminal, and the client may run on the terminal 110 in a browser mode, or may run on the terminal in an independent APP mode, or the like. The server 120 according to the present application may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (content delivery network, CDN), and basic cloud computing services such as big data and AI platform.

In connection with the implementation environment shown in fig. 1, user a records first voice data through first terminal 110. The first terminal 110 determines first adjustment data according to scene association information of the first virtual object for the virtual scene. Wherein the first virtual object is a virtual object controlled by user a. Based on this, the first terminal 110 processes the first voice data with the first adjustment data through the audio engine, to obtain first audio data. The first audio data is played by the first terminal 110.

In connection with the implementation shown in fig. 2, user a records first voice data through first terminal 110. The first terminal 110 determines first adjustment data according to scene association information of the first virtual object for the virtual scene. Wherein the first virtual object is a virtual object controlled by user a. Based on this, the first terminal 110 processes the first voice data with the first adjustment data through the audio engine, to obtain first audio data. The first terminal 110 transmits the first audio data to the server 120 through the communication network 130, and the server 120 transmits the first audio data to the second terminal 140 through the communication network 130. Then, the user B can hear the first audio data played by the second terminal 140.

Similarly, the user B records second voice data through the second terminal 140. The second terminal 140 determines second adjustment data according to scene association information of the second virtual object for the virtual scene. Wherein the second virtual object is a virtual object controlled by user B. Based on this, the second terminal 140 processes the second voice data with the second adjustment data through the audio engine to obtain second audio data. The second terminal 140 transmits the second audio data to the server 120 through the communication network 130, and the server 120 transmits the second audio data to the first terminal 110 through the communication network 130. Thus, the user a can hear the second audio data played by the first terminal 110.

It can be appreciated that the first terminal 110 processes the first voice data using the audio engine based on a relative condition (e.g., a relative position, a direction, whether there is an obstacle blocking, etc.) between the virtual object controlled by the user a and the virtual object controlled by the user B. Thus, the user B can hear the corresponding first audio data. Similarly, the second terminal 140 processes the second voice data using an audio engine. Thus, the user a can hear the corresponding second audio data.

In view of the fact that the present application relates to a number of terms related to the technical field, the following explanation will be made for ease of understanding.

(1) A game multimedia engine (game multimedia engine, GME); the method is a voice communication PaaS service, the product exists in the form of SDK, and a developer needs to integrate GME into APP codes, so that APP can realize voice correlation capability.

(2) An audio engine: the audio effect engineer is used for designing and creating audio software, the audio engine is utilized to greatly accelerate the design efficiency of sound in the APP, and the speciality and creativity of audio creation are improved.

(3) Real-time communication (real-time communication): the application refers to a real-time voice communication technology.

(4) Platform AS A SEVICE (PaaS): the application relates to a real-time voice service constructed by a voice server platform.

(5) In response to: representing conditions or states upon which an operation is performed, one or more operations may be performed when certain conditions or states are met. These operations may be real-time or with some delay.

(6) Adaptive echo cancellation (auto echo cancellation, AEC): and (3) carrying out parameter identification on an unknown echo channel by using an adaptive filter, and establishing a far-end signal model simulation echo path according to the correlation between a loudspeaker signal and the generated multipath echoes. The impulse response and the real echo path are approximated by an adaptive algorithm adjustment. Then subtracting the estimated value from the signal received by the microphone to realize echo cancellation.

(7) Noise suppression (noise suppression, NS): the noise reduction purpose is realized through filtering operation.

(8) Automatic gain control (automatic gain control, AGC): the output signal is conditioned with an effective combination of linear amplification and compression amplification. When a weak signal is input, the linear amplifying circuit works to ensure the intensity of an output signal. When the input signal reaches a certain intensity, the compression amplifying circuit is started to reduce the output amplitude. That is, the AGC function can automatically control the magnitude of the gain by changing the input-output compression ratio.

(9) Voice activity detection (voice activity detection, VAD)): the purpose is to identify and eliminate long silence period from the voice signal stream, so as to achieve the effect of saving speech channel resources without reducing service quality and reduce end-to-end delay perceived by users.

(10) Discontinuous transmission (discontinuous transmission, DTX): when the method is used for detecting mute transmission, the code rate is automatically reduced, and the bandwidth is saved.

With reference to fig. 3, the audio data processing method in the embodiment of the present application may be independently completed by a terminal, or may be completed by cooperation of the terminal and a server, and the method provided by the present application includes:

210. acquiring first voice data from a first object;

In one or more embodiments, a first terminal collects speech originating from a first object through an audio input device, thereby obtaining first speech data. The voice of the first object may refer to voice of a user call or recorded voice.

It will be appreciated that the audio input device may be internal or external to the first terminal. Audio input devices include, but are not limited to, a microphone, a musical instrument digital interface (musical Instrument DIGITAL INTERFACE, MIDI) keyboard, or other digital musical instrument, etc. The first object may be a user of the APP or a game player, etc.

220. Responding to the first voice data, acquiring scene association information of a first virtual object aiming at a virtual scene, wherein the first virtual object is a virtual object controlled by the first object, and the scene association information comprises at least one of position information, environment information, emotion information and role information of the first virtual object in the virtual scene;

In one or more embodiments, a first terminal responds to acquired first voice data and acquires scene association information of a first virtual object controlled by a first object in a virtual scene. The scene association information is used for describing the state of the first virtual object in the virtual scene and comprises at least one of position information, environment information, emotion information and role information of the first virtual object in the virtual scene. In addition, the scene association information may also include a relative relationship between the first virtual object and other virtual objects (e.g., the second virtual object), such as a direction, a distance, and whether a barrier is present.

230. Acquiring first adjustment data for first voice data according to scene association information, wherein the first adjustment data comprises at least one of audio attribute parameters and accompanying audio data, the audio attribute parameters are used for playing effects on the voice data, and the accompanying audio data are used for providing at least one of background music and background sound effects;

In one or more embodiments, first adjustment data is generated in connection with scene-related information, wherein the first adjustment data includes at least one of audio attribute parameters and companion audio data. The audio attribute parameter is used to adjust the first voice data. The companion audio data is used to provide background music and/or background sound effects.

240. Processing the first voice data by adopting the first adjustment data to obtain first audio data;

In one or more embodiments, the first terminal invokes a local audio engine, and processes the first voice data using the first adjustment data to obtain first audio data. Taking the game application as an example, the GME is integrated in a plug-in form in an audio engine, which is integrated in the game application.

250. And playing the first audio data is realized.

In one or more embodiments, the first case may be that the first audio data may be directly played after the first terminal generates the first audio data. In the second case, after the first terminal generates the first audio data, the first audio data may be sent to the server, and the server may send the first audio data to at least one second terminal, where the second terminals may directly play the first audio data, or play the first audio data after further processing.

Specifically, taking the first case as an example, please refer to fig. 4, fig. 4 is a schematic diagram of an architecture of interaction between voice data and an audio engine in an embodiment of the present application, as shown in the drawing, a voice acquisition module in a first terminal transmits acquired voice to the audio engine, and the voice is transmitted to a voice playing module in the first terminal after being processed by the audio engine. Taking the second case as an example, please refer to fig. 5, fig. 5 is another schematic diagram illustrating interaction between voice data and an audio engine in the embodiment of the present application, where as shown in the drawing, the voice acquisition module in the first terminal transmits the acquired voice to the audio engine, and the voice is processed by the audio engine and then transmitted to the voice sending module in the first terminal. The voice sending module sends the processed audio data to the server, the server sends the audio data to the voice receiving module of the second terminal, and based on the voice receiving module, the voice playing module of the second terminal plays the audio data.

The embodiment of the application provides an audio data processing method. By the mode, after the voice component is plugged, the voice component is introduced into an audio engine local to the terminal, and voice data can be directly sent into the audio engine for processing. That is, according to scene association information of the virtual object in the virtual scene, adjustment data conforming to the current scene effect is generated. Real-time voice data is processed by utilizing the adjustment data so as to achieve rich sound processing effects and realize immersive voice effects along with scene changes.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 3, in another optional embodiment provided by the present application, acquiring first voice data from a first object may specifically include:

collecting original voice data from a first object through an audio input device;

In one or more embodiments, a manner of processing speech data is presented. As can be seen from the foregoing embodiments, the first terminal obtains local voice of the first object collected by the audio input device through an application programming interface (application programming interface, API), that is, obtains original voice data. Thus, the first terminal performs voice preprocessing and audio frame conversion processing on the original voice data, thereby obtaining first voice data.

Specifically, the noise reduction process in the pre-speech process is mainly used to remove noise in the original speech data.

The echo cancellation process in the pre-speech process is mainly used to remove echoes from the original speech data. The echo cancellation processing algorithm needs to take the sound played by the current loudspeaker as a reference signal, so the pre-processing module needs to exchange data with the playing plug-in. Because the playback plug-in acts at the end of the audio bus, the audio signal information to be played back is available, thereby providing a reference playback signal for the echo cancellation processing algorithm.

The volume equalization process in the pre-speech process is used to make an adjustment of the desired volume loudness to the original speech data.

The howling suppression processing in the pre-speech processing is mainly used to suppress howling in the original speech data. The howling suppression processing method mainly comprises a frequency shift phase shift method, a notch suppression method and an adaptive feedback cancellation method.

Based on this, the first processed voice data can be obtained after voice preprocessing is performed on the original voice data. Thus, it is necessary to further perform audio frame conversion processing on the first processed voice data to obtain the first voice data. Illustratively, the voice parameters of the first processed voice data may include: a sampling rate of 16 kilohertz (kHz), mono, sampling depth of 16 bit fixed point/sampling point (bit-int), a frame length of 20 milliseconds (ms). Illustratively, the voice parameters of the first voice data may include: the sampling rate of 48kHz, bi-channel, sampling depth is 32bit floating point/sampling point (bit-float), frame length of 21.3ms.

It should be noted that, the sampling rate of the first voice data may be 44.1kHz, or 96kHz may be used to achieve the high fidelity effect, and the above example is described by taking the sample rate of 48kHz as an example, which should not be construed as limiting the present application.

In a second embodiment of the present application, a method for processing voice data is provided. By the method, the collected voice data of the user is subjected to voice preprocessing, so that the processed voice data can represent the essential characteristics of voice, and a better audio processing effect is achieved.

Optionally, based on the foregoing respective embodiments corresponding to fig. 3, another optional embodiment provided by the present application performs audio frame conversion processing on the first processed voice data to obtain first voice data, which may specifically include:

Performing audio frame conversion processing on the first processed voice data to obtain a first audio frame sequence, wherein the first audio frame sequence comprises at least one first audio frame;

In one or more embodiments, a method of obtaining first voice data is provided. From the foregoing embodiments, it can be seen that an audio engine includes a capture plug-in which first processed speech data output by an APP speech system (e.g., GME) can be subjected to audio frame conversion processing to obtain a first sequence of audio frames including at least one first audio frame. Because the frame length of the APP speech system is different from that of the audio engine, the first audio frame cannot be directly processed as one audio frame in the audio engine. Based on this, a ring buffer (ring buffer) may be introduced for processing.

Specifically, for ease of understanding, referring to fig. 6, fig. 6 is a schematic workflow diagram of an acquisition plug-in the embodiment of the present application, as shown in the drawing, first, audio frame conversion processing is performed on first processed voice data (i.e., audio data frames of a voice system) to obtain a first audio frame sequence, where the voice parameters of the first processed voice data may include: the sampling rate of 16kHz, mono, sampling depth is 16bit-int, frame length of 20ms. The speech parameters of the first sequence of audio frames may include: the sampling rate of 48kHz, the binaural channel, the sampling depth was 32bit-float and the frame length of the first audio frame was 20ms. It will be appreciated that the APP speech system (e.g., GME) samples the sequence of first audio frames to obtain a first set of sample point data for each first audio frame. The first set of sample point data includes M sample point data.

The present application is described by taking M as 960 and n as 1024 as examples, but the present application should not be construed as being limited thereto.

Based on this, the APP speech system (e.g., GME) plays the role of a producer, writing the first set of sample point data for each first audio frame into the ring buffer. For example, an APP voice system (e.g., GME) writes 960 sample point data at a time. While the audio engine plays the role of a consumer reading at least one second set of sample point data from the ring buffer. For example, the audio engine reads 1024 sample point data at a time. Thus, the sample point data is read at a time as one second sample point data set, and one second sample point data set is regarded as one second audio frame. Thereby, first speech data (i.e., audio data frames of the audio engine) composed of the respective second audio frames are obtained. Wherein the voice parameters of the first voice data may include: the sampling rate of 48kHz, the binaural, the sampling depth was 32bit-float and the frame length of the second audio frame was 21.3ms.

It will be appreciated that the above-described speech parameters are illustrative and should not be construed as limiting the application.

In another embodiment of the present application, a method for obtaining first voice data is provided. In this way, it is considered that the APP speech system and the audio engine are both used to process digital audio signals, and that the audio signal processing parameters processed by the two systems may be different. Therefore, in order to enable the two-part system to interact with audio data, conversion and unification can be performed at the time of data exchange, thereby ensuring the feasibility and operability of audio processing.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 3, in another optional embodiment provided by the present application, the scene association information includes location information, and the first adjustment data includes an audio attribute parameter;

according to the scene association information, acquiring first adjustment data for the first voice data may specifically include:

Determining the space type of the first virtual object in the virtual scene according to the position information;

In one or more embodiments, a manner of determining audio attribute parameters based on location information is presented. As can be seen from the foregoing embodiments, the scene association information includes position information of the first virtual character in the virtual scene, and corresponding audio attribute parameters can be generated based on the position information.

Specifically, the spatial type of the first virtual object in the virtual scene is determined according to the location information, and if the first virtual object is a "room" in the virtual scene, it is determined that the virtual space in which the first virtual object is located belongs to the first spatial type. Based on this, the audio attribute parameters include reverberation adjustment parameters. The reverberation effect is provided by the reverberation adjustment parameters. Illustratively, if the first virtual object is "valley" in the virtual scene, it is determined that the virtual space in which the first virtual object is located is of the second space type. Based on this, the audio attribute parameters include echo-tuning parameters. The echo effect is provided by an echo adjustment parameter.

The first space type is a virtual space less than or equal to the space size threshold, and the second space type is a virtual space greater than the space size threshold. Virtual spaces belonging to the first space type include, but are not limited to, cabins, rooms, classrooms, etc., and virtual spaces belonging to the second space type include, but are not limited to, halls, valleys, grasslands, etc.

In a virtual scenario, in order to enable a user to have a more immersive speech experience, it is necessary to simulate the implementation of such reverberation effects by means of Digital Signal Processing (DSP) algorithms. The original speech signal can be processed in two ways, generally according to the physical principles of reverberation generation, as will be described separately below.

(1) One way is to model the space by means of an algorithm. For example, parameters such as delay, attenuation, scattering, etc. are adjusted according to properties such as spatial dimensions and materials, so that the original signal is filtered to achieve reverberation effects generated by different spatial dimensions with different textures. The essence of the method is that the multipath original signals are duplicated for filtering, then the multipath delay is attenuated, and then the multipath delay is overlapped and played. For a virtual space of the first spatial type, the reflection path and delay are not much needed, and therefore filtering is achieved with the reverberation adjustment parameters, by which a warm and soft feel is achieved. For the virtual space of the second space type, more reflection paths and larger delay are needed, so that the echo adjustment parameters are utilized to realize filtering, and the aim of properly lifting the high-frequency component is achieved through filtering, thereby realizing the unoccupied high-volume echo effect.

(2) Alternatively, this is achieved by a spatially measured filter. That is, by constructing a physical space like a virtual scene, an Impulse Response (IR) of the space is measured, which is equivalent to calculating a transfer function of sound in the physical space from an original signal and a recorded signal. In the virtual scene, the hearing effect of the corresponding environment can be realized by convolving the transfer function with the original voice, thereby achieving immersive communication experience. For the virtual space of the first space type, parameters in a transfer function of the corresponding real space are taken as reverberation adjustment parameters. For the virtual space of the second space type, the parameters in the transfer function of the corresponding real space are taken as echo adjustment parameters.

In a second embodiment of the present application, a method for determining an audio attribute parameter based on location information is provided. By combining the position information of the virtual object in the virtual scene, the voice effect perceived by the user in the real environment can be simulated. Therefore, the method can be closer to the transmission of sound in a real environment, and provides the effect of immersive voice for a user.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 3, in another optional embodiment provided by the embodiment of the present application, the scene association information includes environmental information, and the first adjustment data includes accompanying audio data;

If the environment information indicates that the first virtual object is in the indoor environment, determining that the accompanying audio data for the first voice data comprises audio data recorded based on the indoor environment;

In one or more embodiments, a manner of determining audio attribute parameters based on environmental information is presented. As can be seen from the foregoing embodiments, the scene-related information includes environmental information of the first virtual character in the virtual scene, and corresponding accompanying audio data can be acquired based on the position information. The accompanying audio data may also increase the immersive sensation of voice chat due to interaction of the APP voice system (e.g., GME) with the audio engine. A sound of the real world is transmitted into the user's ear, not only in relation to the reverberation of the physical environment in which the user is located, but also in relation to the surrounding sound environment.

The following will be described in connection with the environment types, respectively.

1. An indoor environment;

Specifically, if the first virtual object is in an indoor environment, accompanying audio data corresponding to the indoor environment needs to be superimposed into the first voice data tone. Wherein the accompanying audio data comprises audio data recorded based on indoor environment. The indoor environment may be a subway station, a coffee shop, a bar, or the like.

2. An outdoor environment;

Specifically, if the first virtual object is in an outdoor environment, accompanying audio data corresponding to the outdoor environment needs to be superimposed into the first voice data tone. Wherein the accompanying audio data includes audio data recorded based on an outdoor environment. The outdoor environment may be a platform, road, park, or the like.

3. Weather conditions;

Specifically, if the first virtual object is in a weather environment, accompanying audio data corresponding to the weather environment needs to be superimposed into the first voice data tone. Wherein the accompanying audio data comprises audio data recorded based on weather conditions. The weather environment may be rain, wind, lightning strike, or the like.

It can be appreciated that, in order to enable the user to experience the immersive effect in the voice chat or voice recording of the virtual scene, the pre-recorded accompanying audio data is superimposed to indicate indoor factors, outdoor factors, weather factors and the like of the environment where the user is located, so that the user as a listener can experience and immerse the voice effect in the game.

In a second embodiment of the present application, a method for determining audio attribute parameters based on environmental information is provided. By the method, the voice effect perceived by the user in the real environment can be simulated by combining the environment information of the virtual object in the virtual scene. Therefore, the method can be closer to the transmission of sound in a real environment, and provides the effect of immersive voice for a user.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 3, in another optional embodiment provided by the present application, the scene association information includes emotion information, and the first adjustment data includes audio attribute parameters;

If the emotion information indicates that the first virtual object is in a first emotion state, determining that the audio attribute parameters for the first voice data comprise at least one of time domain adjustment parameters, first frequency domain adjustment parameters and rising tone adjustment parameters, wherein the time domain adjustment parameters are used for improving voice sound heads, and the first frequency domain adjustment parameters are used for enhancing voice frequency components;

In one or more embodiments, a manner of determining audio attribute parameters based on mood information is presented. As can be seen from the foregoing embodiments, the scene-related information includes emotion information of the first virtual character in the virtual scene, and the corresponding audio attribute parameter can be acquired based on the emotion information. When people speak in the real world, the same sentence can show different effects under different emotions. For the avatar, in order to enhance the voice immersion, these effects also need to be simulated by means of DSP.

The following will be presented in connection with emotion types, respectively.

1. A first emotional state;

in particular, the first emotional state is used to represent that the first virtual object is in an excited or excited emotional state. Illustratively, if the first virtual object picks up the precious object in the virtual scene, the first speech data is adjusted using at least one of the time domain adjustment parameter, the first frequency domain adjustment parameter, and the up-scaling adjustment parameter.

Wherein the time domain adjustment parameter is used to properly enhance the head of each word in the speech data. The first frequency domain adjustment parameter is used for performing high-pass filtering processing on voice data, that is, attenuating a low-frequency portion of voice data, and retaining a high-frequency portion so that a high-frequency component is emphasized in the frequency domain. The pitch-up adjustment parameter is used to boost the pitch of the speech data, i.e. to achieve the aim of high pitch.

2. A second emotional state;

In particular, the second emotional state is used to represent that the first virtual object is in an emotional state of fear or injury. Illustratively, if the first virtual object is subject to blood volume drop after attack in the virtual scene, the first voice data is adjusted using at least one of the speech rate adjustment parameter and the period adjustment parameter.

Wherein, the language speed adjusting parameter is used for carrying out proper slow processing on the voice data. The period adjustment parameter is used to produce a tremolo effect by a predetermined change in period.

3. A third emotional state;

In particular, the third emotional state is used to represent an emotional state in which the first virtual object is angry or angry. Illustratively, if the first virtual object collides in the virtual scene, the first voice data is adjusted using at least one of the second frequency domain adjustment parameter and the down-regulation parameter.

The second frequency domain adjusting parameter is used for enhancing the low frequency part of the voice data to express the sinking and shocking voice effect. The down-regulation parameter is used for improving the volume of the voice data.

It should be noted that, considering that the virtual scene has entertainment properties, the sound designer may perform a larger process on the voice data to express the mind state of the user according to the playing requirement of the voice.

In a second embodiment of the present application, a method for determining audio attribute parameters based on emotion information is provided. By combining the emotion information of the virtual object in the virtual scene, the emotion of the user in the real environment can be simulated, and the character and the state of the virtual object are reflected. Therefore, the method can be closer to the transmission of sound in a real environment, and provides the effect of immersive voice for a user.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 3, in another optional embodiment provided by the present application, the scene association information includes role information, and the first adjustment data includes audio attribute parameters;

If the character information indicates that the first virtual object belongs to the target character type, determining that the audio attribute parameters for the first voice data comprise a pitch adjustment parameter, a tone adjustment parameter and a formant adjustment parameter.

In one or more embodiments, a manner of determining audio attribute parameters based on role information is presented. As can be seen from the foregoing embodiments, the scene-related information includes character information of the first virtual character in the virtual scene, and the corresponding audio attribute parameter can be obtained based on the character information. In some cases, the user may also dub the virtual object, thus implementing a specific voice sound change based on the sound change parameters or based on the AI model in conjunction with the target character type indicated by the character information (e.g., girls or elderly).

Illustratively, the pitch adjustment parameter is used to either turn "volatility" of the speech data up or down to increase or decrease the pitch of the sound. The lower the pitch adjustment parameter, the lower the output sound, e.g., the sound of a male bass. The higher the pitch adjustment parameter, the sharper the sound output, e.g., the sound of an infant, girl, boy, female.

Illustratively, the tone adjustment parameter is used to adjust the tone of the voice data. The lower the tone adjustment parameter, the lower the output sound will be, for example, the sound of an aged man. The higher the tone adjustment parameter, the lighter and faster the sound to be output, for example, the sound of a young child.

Illustratively, the formant-adjustment parameters are used to output a more natural speech quality for the speech data. The pitch of the speech data is configured without affecting the quality of the other items. The lower the formant adjustment parameter is, the lower the formant moving point is, and the lower the sound after sound change is. The higher the formant adjustment parameter is, the higher the formant moving point is, and the sharper the sound after the sound is changed.

In the embodiment of the application, a mode for determining the audio attribute parameters based on the role information is provided. By combining the role information of the virtual object in the virtual scene, the specific voice sound changing effect conforming to the role information can be simulated. Thereby increasing interest and being suitable for a particular use scenario (e.g., optimizing a particular sound, etc.).

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 3, in another optional embodiment provided by the embodiment of the present application, the first adjustment data includes an audio attribute parameter and accompanying audio data;

processing the first voice data by adopting the first adjustment data to obtain first audio data, which specifically may include:

Adjusting the first voice data by adopting audio attribute parameters to obtain target voice data;

In one or more embodiments, a method of processing resulting first audio data is presented. As can be seen from the foregoing embodiments, the first adjustment data includes audio attribute parameters and accompanying audio data

Specifically, first, the first voice data is adjusted by using the audio attribute parameter to obtain the target voice data. The target voice data is the changed voice data. Thus, the target voice data can be subjected to voice post-processing to obtain second processed voice data. The post-processing of voice refers to some signal processing of voice data before rendering, so that the voice data is clearer when played by the terminal. The post-processing of the speech includes at least one of a speech enhancement process, a band gain process, and a tuning process to satisfy the user's personality settings. For example, an equalizer that adjusts the gain of each frequency band, or some tuning for the speaker characteristics of the terminal, for example, using frequency response curve compensation techniques. And finally, superposing the accompanying audio data and the second processed voice data, thereby obtaining the first audio data.

It should be noted that when a user speaks, the focus of the mixing should fall on the voice data, not the accompanying audio data. Based on this, like radio station, when playing voice data, the volume of the accompanying audio data needs to be reduced slightly, and after the voice data is played, the volume of the accompanying audio data is restored.

In a second embodiment of the present application, a method for processing first audio data is provided. Through the mode, the voice design is fused into the APP sound effect design, and the thought of a sound effect designer to design voice is widened. Based on the method, the ecology of the end-to-end processing of the voice is facilitated to be developed, so that other professional processing algorithms and middleware can be added into the ecology, and the voice experience of the user is improved together.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 3, in another optional embodiment provided by the present application, playing the first audio data may specifically include:

The first audio data is played through the audio output device in response to the audio play operation.

Or alternatively

The playing of the first audio data may specifically include:

and sending the first audio data to the second terminal so that the second terminal plays the first audio data through the audio output device.

In one or more embodiments, a manner of playing first audio data is presented. As can be seen from the foregoing embodiments, after the first audio data is obtained, the first audio data may be played by an audio output device (e.g., a speaker) of the first terminal or by an audio output device (e.g., a speaker) of the second terminal. The following description will be made in connection with specific scenarios.

1. Locally playing the first audio data;

Specifically, taking a dubbing scene as an example, a user records first voice data, and obtains the first audio data after processing. Wherein the first audio data represents audio content after dubbing. If the user needs to listen to the dubbing effect, an audio playing operation can be triggered, and therefore the first terminal plays the first audio data through the audio output device.

2. Remotely playing the first audio data;

Specifically, taking an intra-game communication scene as an example, a user records first voice data, and the first voice data is obtained after processing. Wherein the first audio data represents audio content within the office. Then, the first terminal transmits the first audio data to the server, and the server transmits the first audio data to the second terminal. Based on the above, the second terminal can directly play the first audio data, or call an audio engine in the second terminal to process the first audio data and play the first audio data.

In a second embodiment of the present application, a method for playing first audio data is provided. By the mode, the local terminal is supported to play the first audio data, and the remote terminal is also supported to play the first audio data. Thus, flexibility and diversity of voice data playback is increased.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 3, in another optional embodiment provided by the present application, sending the first audio data to the second terminal may specifically include:

Acquiring a voice scene type;

In one or more embodiments, a method of source and channel encoding first audio data is presented. As can be seen from the foregoing embodiments, the core of the internet protocol-based voice transmission (voice over internet protocol, voIP) is that digitally encoded signals are transmitted over a network, and several processing algorithms are typically required to record voice data from a first terminal to a second terminal for playing audio data. In addition, packet loss, jitter, etc. may occur in the network, and thus, a certain delay is also introduced, and the delay size and the bandwidth of signal transmission are relatively large. That is, if the bandwidth occupied by the signal needs to be reduced in order to reduce the delay, a relatively large delay is introduced if a signal of a high bandwidth is transmitted. Based on this, a corresponding network policy may be selected according to the voice scene type. The following will be presented in connection with examples.

1. A first scene type;

Specifically, the first scene type is a scene with strong real-time performance, such as a game scene, a live broadcast scene, and the like. Such scenes require low-latency and fluent audio data. Accordingly, mono and first sample rate first audio data may be transmitted, and lossy vocoding (opus) may also be employed on the first audio data. Wherein the first sampling rate is a low sampling rate, e.g., 16kHz. Thus, the first audio data has a smaller data size and occupies a smaller bandwidth, so that the delay is small.

2. A second scene type;

specifically, the first scene type is a scene with low real-time performance, such as a personal voice radio scene, an online concert scene, and the like. Such scenes are typically talking to the host and do not communicate frequently with the audience, and therefore do not require much delay, but rather audio quality. Thus, the first audio data of the stereo channel and the second sample rate may be transmitted. Wherein the second sampling rate is a high sampling rate, e.g., 48kHz.

It should be noted that the types of the voice rooms may be further defined, and illustratively, the voice room of the first scene type belongs to a "smooth room type", and the voice room of the second scene type belongs to a "high definition room type". Different application scenarios may be applicable based on different speech scenario types.

In another embodiment of the present application, a method for performing source and channel coding on first audio data is provided. By the method, the audio data can be subjected to information source and channel coding based on the voice scene type, so that the network transmission bandwidth is saved, and the weak network robustness is improved.

carrying out framing treatment on the first audio data to obtain a second audio frame sequence, wherein the second audio frame sequence comprises at least one third audio frame;

In one or more embodiments, a method of obtaining first audio data is presented. As can be seen from the foregoing embodiments, the audio engine includes a transmitting plug-in, in which the first audio data output by the audio engine may be subjected to a framing process to obtain a second audio frame sequence including at least one third audio frame. Because the frame lengths of the audio engine and the APP speech system are different, the third audio frame cannot be directly treated as one audio frame in the APP speech system. Based on this, a ring buffer may be introduced for processing.

Specifically, for ease of understanding, referring to fig. 7, fig. 7 is a schematic workflow diagram of a transmitting plug-in the embodiment of the present application, as shown in the figure, first, frame-dividing a first audio data (i.e. an audio data frame of an audio engine) to obtain a second audio frame sequence, where the speech parameters of the second audio frame sequence may include: the sampling rate of 48kHz, the binaural, the sampling depth was 32bit-float and the frame length of the third audio frame was 21.3ms. It will be appreciated that the audio engine samples the second sequence of audio frames to obtain a third set of sample point data for each third audio frame. The third sample point data set includes P sample point data.

The present application is described by taking P as 1024 and q as 960 as examples, but the present application should not be construed as being limited thereto.

Based on this, the audio engine plays the role of a producer, writing the third sample point data set for each third audio frame into the ring buffer. For example, the audio engine writes 1024 sample point data at a time. While the APP voice system (e.g., GME) acts as a consumer, reading at least one fourth sample point data set from the ring buffer. For example, an APP voice system (e.g., GME) reads 960 sample point data at a time. Then, the sampling point data is read each time as one fourth sampling point data set, and audio frame conversion processing is performed based on each fourth sampling point data set, resulting in processed first audio data (i.e., audio data frames of a speech system) composed of each fourth audio frame. Wherein the voice parameters of the processed first audio data may include: the sampling rate of 16kHz, mono, sampling depth of 16bit-int, and the frame length of the fourth audio frame of 20ms.

In an embodiment of the present application, a method for obtaining first audio data is provided. In this way, it is considered that the APP speech system and the audio engine are both used to process digital audio signals, and that the audio signal processing parameters processed by the two systems may be different. Therefore, in order to enable the two-part system to interact with audio data, conversion and unification can be performed at the time of data exchange, thereby ensuring the feasibility and operability of audio processing.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 3, another optional embodiment provided by the embodiment of the present application may further include:

Receiving second audio data sent by a second terminal, wherein the second audio data is obtained after the second terminal processes second voice data by adopting second adjustment data, the second voice data is derived from a second object, the second adjustment data is obtained according to scene association information of a second virtual object for a virtual scene, and the second virtual object is a virtual object controlled by the second object;

and playing the second audio data.

In one or more embodiments, a manner of playing far-end audio data is presented. As can be seen from the foregoing embodiments, the audio engine includes a transmitting plug-in, in which audio data input by an APP speech system (e.g., GME) may be subjected to audio frame conversion processing, resulting in converted audio frames. Because the frame lengths of the audio engine and the APP speech system are different, the converted audio frame cannot be treated as one audio frame in the audio system. Based on this, a ring buffer may be introduced for processing.

Specifically, for ease of understanding, referring to fig. 8, fig. 8 is a schematic diagram illustrating a workflow of a receiving plug-in an embodiment of the present application, as shown in the figure, first, audio data (i.e., audio data frames of a speech system) input by an APP speech system (e.g., GME) is subjected to audio frame conversion processing. Thus, an audio frame sequence with a sampling rate of 16kHz, mono, a sampling depth of 16bit-int, and a frame length of 20ms is obtained.

Based on this, the APP speech system (e.g., GME) acts as a producer, writing the sequence of audio frames into the ring buffer. For example, an APP voice system (e.g., GME) writes 960 sample point data at a time. The audio engine plays the role of a consumer, reading 1024 sample point data at a time from the ring buffer, thereby obtaining second audio data.

It will be appreciated that the second terminal collects second voice data derived from a second object, wherein the second object controls a second virtual object. Then, second adjustment data is determined according to scene association information of the second virtual object for the virtual scene. And processing the second voice data by adopting the second adjustment data to obtain second audio data. The specific process flow is similar to the flow of obtaining the first audio data, and will not be described here.

In a second embodiment of the present application, a method for playing remote audio data is provided. By the mode, the audio data transmitted from the far end can be directly played. Because the audio data also changes along with the change of the virtual scene, the voice heard by the user has stronger atmosphere, thereby improving the user experience.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 3, in another optional embodiment provided by the present application, playing the second audio data may specifically include:

Framing the second audio data to obtain a third audio frame sequence, wherein the third audio frame sequence comprises at least one fifth audio frame;

In one or more embodiments, a manner of playing second audio data is presented. As can be seen from the foregoing embodiments, the audio engine includes a play plug-in, in which the second audio data output by the audio engine may be subjected to a framing process to obtain a third audio frame sequence including at least one fifth audio frame. Because the frame lengths of the audio engine and the APP speech system are different, the fifth audio frame cannot be directly treated as one audio frame in the APP speech system. Based on this, a ring buffer may be introduced for processing.

Specifically, for ease of understanding, referring to fig. 9, fig. 9 is a schematic workflow diagram of a play plug-in the embodiment of the present application, as shown in the drawing, first, framing second audio data (i.e., audio data frames of an audio engine) to obtain a third audio frame sequence, where the speech parameters of the third audio frame sequence may include: the sampling rate of 48kHz, the binaural, the sampling depth of 32bit-float, and the frame length of the fifth audio frame of 21.3ms. It will be appreciated that the audio engine samples the third sequence of audio frames to obtain a fifth set of sample point data for each fifth audio frame. The fifth sample point data set includes X sample point data.

The present application is described by taking X as 1024 and y as 960 as examples, but the present application should not be construed as being limited thereto.

Based on this, the audio engine plays the role of a producer, writing the fifth set of sample point data for each fifth audio frame into the ring buffer. For example, the audio engine writes 1024 sample point data at a time. While the APP voice system (e.g., GME) acts as a consumer, reading at least one sixth sample point data set from the ring buffer. For example, an APP voice system (e.g., GME) reads 960 sample point data at a time. Then, the sampling point data is read each time as one sixth sampling point data set, and audio frame conversion processing is performed based on each sixth sampling point data set, resulting in processed second audio data (i.e., audio data frames of the speech system) composed of each sixth audio frame. Wherein the voice parameters of the processed second audio data may include: the sampling rate of 16kHz, mono, sampling depth of 16bit-int, frame length of the sixth audio frame of 20ms.

It should be noted that, since the play plug-in is at the last ring of the audio engine processing data, the data to be processed includes not only the voice data but also the audio data in the APP. An audio designer may design various possible audio presentation formats, such as multi-channel audio and object audio, etc., based on the audio material. These non-default format signals require first conversion to binaural audio using a down-mix algorithm (downmix) or an object audio rendering algorithm (OAR).

In an embodiment of the present application, a method for playing second audio data is provided. In this way, it is considered that the APP speech system and the audio engine are both used to process digital audio signals, and that the audio signal processing parameters processed by the two systems may be different. Therefore, in order to enable the two-part system to interact with audio data, conversion and unification can be performed at the time of data exchange, thereby ensuring the feasibility and operability of audio processing.

responding to the second audio data, and acquiring object association information according to the position information of the first virtual object and the second virtual object in the virtual scene;

Acquiring third adjustment data for the second audio data according to the object association information;

processing the second audio data by adopting third adjustment data to obtain third audio data;

And playing the third audio data.

In one or more embodiments, another way of playing far-end audio data is presented. As can be seen from the foregoing embodiments, the second terminal collects second voice data derived from a second object, wherein the second object controls a second virtual object. Then, second adjustment data is determined according to scene association information of the second virtual object for the virtual scene. And processing the second voice data by adopting the second adjustment data to obtain second audio data. The specific process flow is similar to the flow of obtaining the first audio data, and will not be described here.

Specifically, the virtual scene includes the first virtual object and the second virtual object, so that the object association information can be obtained according to the position information of the first virtual object and the second virtual object in the virtual scene. Wherein the object association information includes distance information, direction information, and barrier information between the virtual objects. Based on this, after the first terminal receives the second audio data transmitted by the second terminal, the third adjustment data for the second audio data may be acquired according to the object association information. And processing the second audio data by using the third adjustment data to obtain third audio data, and finally, playing the third audio data by the first terminal.

Illustratively, third adjustment data is generated based on the distance information included in the object association information, wherein the third adjustment data includes a distance decay parameter. The distance decay parameter may adjust the loudness of the audio data.

Illustratively, third adjustment data is generated based on the direction information included in the object association information, wherein the third adjustment data includes 3D parameters. The audio data can enable the effect of sound transmission to the first object to be related to the direction information of the second virtual object in the first virtual object through the adjustment of the 3D parameters.

The third adjustment data is generated based on the blocker information included in the object association information, and includes sound reflection parameters if the blocker information indicates that there is a blocker between the first virtual object and the second virtual object. Reflecting the reflection and diffraction effects of the sound in different spaces through the sound reflection parameters.

It should be noted that, the processing flow of the second audio data may refer to the embodiment described in fig. 8, and will not be described herein.

In a second embodiment of the present application, another way of playing the far-end audio data is provided. By the method, the audio data transmitted from the far end can be further optimized. Therefore, the audio data can not only change along with the change of the virtual scene, but also take the relative position between the virtual objects into consideration, so that the voice heard by the user has stronger sense of reality, and the user experience is improved.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 3, in another optional embodiment provided by the present application, playing third audio data may specifically include:

Framing the third audio data to obtain a fourth audio frame sequence, wherein the fourth audio frame sequence comprises at least one seventh audio frame;

In one or more embodiments, a manner of playing third audio data is presented. As can be seen from the foregoing embodiments, the audio engine includes a play plug-in, in which the third audio data output by the audio engine may be subjected to framing processing to obtain a fourth audio frame sequence including at least one seventh audio frame. Because the frame lengths of the audio engine and the APP speech system are different, the seventh audio frame cannot be directly treated as one audio frame in the APP speech system. Based on this, a ring buffer may be introduced for processing.

Specifically, for ease of understanding, please refer to fig. 9 again, as shown in the drawing, first, the third audio data (i.e. the audio data frames of the audio engine) is subjected to framing processing to obtain a fourth audio frame sequence, where the speech parameters of the fourth audio frame sequence may include: the sampling rate of 48kHz, the binaural, the sampling depth of 32bit-float, and the frame length of the seventh audio frame of 21.3ms. It will be appreciated that the audio engine samples the fourth sequence of audio frames to obtain a seventh set of sample point data for each seventh audio frame. The seventh sample point data set includes S sample point data.

The present application is described by taking S1024 and r 960 as examples, but the present application should not be construed as being limited thereto.

Based on this, the audio engine plays the role of a producer, writing a seventh set of sample point data for each seventh audio frame into the ring buffer. For example, the audio engine writes 1024 sample point data at a time. While the APP voice system (e.g., GME) plays the role of a consumer, reading at least one eighth sample point data set from the ring buffer. For example, an APP voice system (e.g., GME) reads 960 sample point data at a time. Then, the sampling point data is read each time as one eighth sampling point data set, and audio frame conversion processing is performed based on each eighth sampling point data set, resulting in processed third audio data (i.e., audio data frames of the speech system) composed of each eighth audio frame. Wherein the voice parameters of the processed third audio data may include: the sampling rate of 16kHz, mono, sampling depth of 16bit-int, frame length of the eighth audio frame of 20ms.

In an embodiment of the present application, a method for playing third audio data is provided. In this way, it is considered that the APP speech system and the audio engine are both used to process digital audio signals, and that the audio signal processing parameters processed by the two systems may be different. Therefore, in order to enable the two-part system to interact with audio data, conversion and unification can be performed at the time of data exchange, thereby ensuring the feasibility and operability of audio processing.

Optionally, on the basis of the foregoing respective embodiments corresponding to fig. 3, in another optional embodiment provided by the present application, receiving second audio data sent by a second terminal may specifically include:

and receiving second audio data sent by the second terminal based on a data transmission strategy, wherein the data transmission strategy comprises at least one of a forward error correction strategy, a packet loss compensation strategy, an automatic retransmission strategy and a buffer anti-jitter strategy, the forward error correction strategy is used for correcting error codes occurring in the data transmission process, the packet loss compensation strategy is used for compensating missing audio frames, the automatic retransmission strategy is used for requesting retransmission of the audio frames, and the buffer anti-jitter strategy is used for adopting dynamic buffer in the data transmission process.

In one or more embodiments, a manner of receiving second audio data based on a data transmission policy is presented. As can be seen from the foregoing embodiments, the transmission of audio data over a network is required to cope with various problems of the network, common problems are network packet loss and network jitter, based on which APP voice systems (e.g., GME) are required to employ data transmission policies to increase resistance to weak networks. The manner in which the first terminal receives the second audio data based on the data transmission policy will be described below.

1. Forward error correction (forward error correction, FEC) strategies;

Specifically, the FEC strategy is a strategy for error control. The method is characterized in that the signal is encoded and processed in advance according to a certain algorithm before being sent into a transmission channel, redundant codes with the characteristics of the signal are added, and the received signal is decoded at a receiving end according to a corresponding algorithm, so that error codes generated in the transmission process are found out and corrected.

2. Packet loss compensation (packet loss concealment, PLC) strategy;

Specifically, the PLC strategy utilizes all available information to perform proper estimation and compensation on the lost frame so as to conceal the lost frame from being easily perceived, thereby achieving the purpose of improving the audio quality.

3. An automatic retransmission (automatic repeat request, ARQ) policy;

Specifically, an acknowledgement (acknowledge character, ACK) signal indicating that the received packet contains no error, or a non-acknowledgement (negative acknowledge character, NACK) signal indicating that the received packet contains error, is transmitted from the receiving end to the transmitting end, and automatic request retransmission is performed.

4. A cache anti-Jitter (Jitter) policy;

Specifically, the jitter buffer is used for processing the conditions of data packet loss, disorder, delay arrival and the like, smoothly outputting audio frames to the decoding module, resisting the influence of various weak network conditions on playing or rendering, and reducing the blocking.

In an embodiment of the present application, a method for receiving second audio data based on a data transmission policy is provided. By the method, when the weak network condition occurs, good communication quality can be maintained at the receiving end.

In connection with the above embodiments, the end-to-end process flow will be described with reference to the drawings. Referring to fig. 10, fig. 10 is a schematic diagram illustrating an embodiment of the present application for implementing end-to-end speech processing based on an audio engine, where an APP speech system is taken as a GME as an example, a plug-in solution based on the audio engine mainly involves integration of clients, and the capability of a server is represented by a GME speech server cluster. The audio engine standard is divided into four types of plugins according to specific functions, namely a collection plugin, a sending plugin, a receiving plugin and a playing plugin, and the four plugins serve as bridges for exchanging data with the audio engine. Wherein the acquisition plug-in and the receiving plug-in are used for pushing the voice signal into the audio engine for subsequent processing, and the sending plug-in and the playing plug-in are used for outputting the voice signal from the audio engine to the SDK. In addition, the audio engine has built-in processing and audio control capabilities that are utilized by the sound engineer to achieve various sound effects within the APP. Thus, the voice data is also fed into the audio engine, and thus the processing capability of the audio is also applicable to voice processing.

Based on such a plugged-in audio engine processing scheme, the processing flow of the voice data from the reception to the transmission includes the following:

(1) Pretreatment of voice: the GME acquires local user voice acquired by the audio input equipment through the system API, and the preprocessing technology removes echo and noise in recording data and performs processing such as volume equalization and howling suppression.

(2) Local user speech: after passing through the collection plug-in, the pre-processed voice signal can be sent to the audio engine as a sound source signal. The processing and control capabilities built into the audio engine may be used to process the incoming voice data stream and implement the immersive voice sound effects associated with the virtual scene. The voice data processed by the built-in capability of the audio engine is output from the audio engine to the GME voice SDK through the transmitting plug-in unit to continue the next processing.

(3) Accompanying audio data: the audio and/or music local to the APP can also be used as accompanying audio data by the sending plug-in and output to the GME voice SDK. After the next processing, sharing to the voice chat room, the audio designer can design interesting and various voice playing methods based on the capability.

(4) Source and channel coding: and according to the voice scene type, performing information source and channel coding on voice streams or other audio streams output by the sending plug-in to the GME voice SDK so as to save network transmission bandwidth and increase weak network robustness, and sending the coded data to the GME voice server cluster.

(5) GME voice server cluster: all the voices of the client are required to be sent to the GME voice server, and the GME voice server forwards the target voice stream to the corresponding terminal according to the member list of the voice chat room. The GME voice server employs an authentication mechanism to ensure that only legitimate clients can connect to the GME voice server. The background quality of service (quality of service, qoS) policy ensures that clients can connect to the nearest GME voice server and provide reliable quality voice services.

(6) Network policy and decoding: voice transmission over a network is required to address various problems of the network, based on which FEC policies, PLC policies, ARQ policies, and cache anti-jitter policies may be employed to increase resistance to weak networks. Alternatively, the encoded voice stream may be decoded and the decoded pulse code modulated (pulse code modulation, PCM) voice stream may then be passed to the next link for processing.

(7) Voice mixing: the GME client receives voice streams sent by all the users and forwarded by the GME voice server, and the voice streams are mixed at the GME client, namely, multiple voice streams are mixed into one voice stream.

(8) Each path of player voice stream: to achieve the immersive voice sound effect associated with the virtual scene, each received voice stream needs to be processed separately. For example, 3D processing, distance attenuation, or processing simulating sound reflection is performed on each voice stream. Each path of decoded voice stream is sent to the audio engine as a sound source signal through the receiving plug-in, various processing capacity and control capacity built in the audio engine can process the sent voice stream to realize immersive voice sound effect associated with the virtual scene, and processed voice data is output to the audio engine bus and other sound effects of the audio engine for mixing.

(9) Voice stream after mixing: if each path of voice stream does not need to be processed, voice data which is subjected to voice mixing in the step 7 can be sent to an audio engine processing pipeline through a receiving plug-in as a sound source signal, and the processed voice data is output to an audio engine bus and other sound effects of an audio engine to be mixed.

(10) All local sound effects or music: audio designers use audio engines to design and process conventional sounds, including other sounds in the game that are not speech.

(11) Audio stream bus: the sound processed by the audio engine is sent to the audio output device for playing after passing through the audio stream bus. The audio engine will perform corresponding mixing and sample rate conversion based on the number of channels of the audio output device, the sampling frequency and format (i.e., channel-based audio or object-based audio).

(12) Playing the plug-in: the sound data processed by the audio engine is sent to a playing plug-in, and the playing plug-in can bypass all sound data output by the audio engine to an echo cancellation module of the GME besides performing some configuration work related to authentication. The echo cancellation module cancels echo in the audio input device (e.g., microphone) recorded signal based on the reference signal, which echo refers to any local audio recorded by the audio input device that is played from the audio output device (e.g., speaker), such that the signal to be sent to the voice room contains only the voice of the local user.

(13) Post-processing of voice: some signal processing is performed before the speech rendering, thereby making the speech more clear when played by a particular device.

Referring to fig. 11, fig. 11 is a schematic diagram showing an embodiment of an audio data processing device according to an embodiment of the present application, and an audio data processing device 30 includes:

an acquisition module 310 for acquiring first voice data derived from a first object;

The obtaining module 310 is further configured to obtain, in response to the first voice data, scene association information of a first virtual object for a virtual scene, where the first virtual object is a virtual object controlled by the first object, and the scene association information includes at least one of location information, environment information, emotion information, and role information of the first virtual object in the virtual scene;

the obtaining module 310 is further configured to obtain first adjustment data for the first voice data according to the scene association information, where the first adjustment data includes at least one of an audio attribute parameter and an accompanying audio data, the audio attribute parameter is used for playing effects on the voice data, and the accompanying audio data is used for providing at least one of background music and background sound effects;

The processing module 320 is configured to process the first voice data by using the first adjustment data to obtain first audio data;

and a playing module 330, configured to play the first audio data.

The embodiment of the application provides an audio data processing device. By adopting the device, after the voice component is plugged, the voice component is introduced into an audio engine local to the terminal, and voice data can be directly sent into the audio engine for processing. That is, according to scene association information of the virtual object in the virtual scene, adjustment data conforming to the current scene effect is generated. Real-time voice data is processed by utilizing the adjustment data so as to achieve rich sound processing effects and realize immersive voice effects along with scene changes.

Alternatively, on the basis of the embodiment corresponding to fig. 11, in another embodiment of the audio data processing device 30 provided in the embodiment of the present application,

An acquisition module 310, specifically configured to acquire, through an audio input device, original speech data derived from a first object;

The embodiment of the application provides an audio data processing device. By adopting the device, the collected voice data of the user is subjected to voice preprocessing, so that the processed voice data can represent the essential characteristics of voice, and a better audio processing effect is achieved.

The obtaining module 310 is specifically configured to perform an audio frame conversion process on the first processed voice data to obtain a first audio frame sequence, where the first audio frame sequence includes at least one first audio frame;

The embodiment of the application provides an audio data processing device. With the above arrangement, it is contemplated that both the APP speech system and the audio engine are used to process digital audio signals, and that the audio signal processing parameters processed by the two systems may be different. Therefore, in order to enable the two-part system to interact with audio data, conversion and unification can be performed at the time of data exchange, thereby ensuring the feasibility and operability of audio processing.

Optionally, in another embodiment of the audio data processing device 30 provided in the embodiment of the present application based on the embodiment corresponding to fig. 11, the scene association information includes location information, and the first adjustment data includes audio attribute parameters;

the obtaining module 310 is specifically configured to determine a spatial type of the first virtual object in the virtual scene according to the location information;

The embodiment of the application provides an audio data processing device. By adopting the device, the voice effect perceived by the user in the real environment can be simulated by combining the position information of the virtual object in the virtual scene. Therefore, the method can be closer to the transmission of sound in a real environment, and provides the effect of immersive voice for a user.

Optionally, in another embodiment of the audio data processing device 30 provided in the embodiment of the present application based on the embodiment corresponding to fig. 11, the scene-related information includes environmental information, and the first adjustment data includes accompanying audio data;

The obtaining module 310 is specifically configured to determine that the accompanying audio data for the first voice data includes audio data recorded based on the indoor environment if the environment information indicates that the first virtual object is in the indoor environment;

The embodiment of the application provides an audio data processing device. By adopting the device, the voice effect perceived by the user in the real environment can be simulated by combining the environment information of the virtual object in the virtual scene. Therefore, the method can be closer to the transmission of sound in a real environment, and provides the effect of immersive voice for a user.

Optionally, on the basis of the embodiment corresponding to fig. 11, in another embodiment of the audio data processing apparatus 30 provided by the embodiment of the present application, the scene association information includes emotion information, and the first adjustment data includes audio attribute parameters;

The obtaining module 310 is specifically configured to determine that the audio attribute parameter for the first voice data includes at least one of a time domain adjustment parameter, a first frequency domain adjustment parameter, and an up-scaling adjustment parameter if the emotion information indicates that the first virtual object is in the first emotion state, where the time domain adjustment parameter is used to increase the voice sound head, and the first frequency domain adjustment parameter is used to increase the voice high frequency component;

The embodiment of the application provides an audio data processing device. By adopting the device, the emotion information of the virtual object in the virtual scene is combined, so that the emotion of the user in the real environment can be simulated, and the character and the state of the virtual object are reflected. Therefore, the method can be closer to the transmission of sound in a real environment, and provides the effect of immersive voice for a user.

Optionally, on the basis of the embodiment corresponding to fig. 11, in another embodiment of the audio data processing apparatus 30 provided in the embodiment of the present application, the scene association information includes role information, and the first adjustment data includes audio attribute parameters;

the obtaining module 310 is specifically configured to determine that the audio attribute parameters for the first voice data include a pitch adjustment parameter, a timbre adjustment parameter, and a formant adjustment parameter if the character information indicates that the first virtual object belongs to the target character type.

The embodiment of the application provides an audio data processing device. By adopting the device, the specific voice sound changing effect conforming to the character information can be simulated by combining the character information of the virtual object in the virtual scene. Thereby increasing interest and being suitable for a particular use scenario (e.g., optimizing a particular sound, etc.).

The processing module 320 is specifically configured to adjust the first voice data by using the audio attribute parameter to obtain target voice data;

The embodiment of the application provides an audio data processing device. By adopting the device, the voice design is fused into the APP sound effect design, and the thought of a sound effect designer to design voice is widened. Based on the method, the ecology of the end-to-end processing of the voice is facilitated to be developed, so that other professional processing algorithms and middleware can be added into the ecology, and the voice experience of the user is improved together.

The playing module 330 is specifically configured to play the first audio data through the audio output device in response to an audio playing operation.

Or alternatively

The playing module 330 is specifically configured to send the first audio data to the second terminal, so that the second terminal plays the first audio data through the audio output device.

The embodiment of the application provides an audio data processing device. By adopting the device, the local terminal is supported to play the first audio data, and the remote terminal is also supported to play the first audio data. Thus, flexibility and diversity of voice data playback is increased.

The playing module 330 is specifically configured to obtain a voice scene type;

The embodiment of the application provides an audio data processing device. By adopting the device, the audio data can be subjected to information source and channel coding based on the voice scene type, so that the network transmission bandwidth is saved, and the weak network robustness is improved.

The playing module 330 is specifically configured to perform frame-splitting processing on the first audio data to obtain a second audio frame sequence, where the second audio frame sequence includes at least one third audio frame;

Optionally, on the basis of the embodiment corresponding to fig. 11, in another embodiment of the audio data processing device 30 provided in the embodiment of the present application, the audio data processing device further includes a receiving module 340;

The receiving module 340 is configured to receive second audio data sent by a second terminal, where the second audio data is obtained by processing second voice data by the second terminal using second adjustment data, the second voice data is derived from a second object, the second adjustment data is obtained according to scene association information of a second virtual object for a virtual scene, and the second virtual object is a virtual object controlled by the second object;

The playing module 330 is further configured to play the second audio data.

The embodiment of the application provides an audio data processing device. By adopting the device, the audio data transmitted from the far end can be directly played. Because the audio data also changes along with the change of the virtual scene, the voice heard by the user has stronger atmosphere, thereby improving the user experience.

The playing module 330 is specifically configured to perform frame-splitting processing on the second audio data to obtain a third audio frame sequence, where the third audio frame sequence includes at least one fifth audio frame;

The receiving module 340 is further configured to receive second audio data sent by the second terminal, where the second audio data is obtained by processing, by the second terminal, second voice data with second adjustment data, the second voice data is derived from a second object, the second adjustment data is obtained according to scene association information of a second virtual object with respect to a virtual scene, and the second virtual object is a virtual object controlled by the second object;

The obtaining module 310 is further configured to obtain object association information according to the position information of the first virtual object and the second virtual object in the virtual scene in response to the second audio data;

the obtaining module 310 is further configured to obtain third adjustment data for the second audio data according to the object association information;

the processing module 320 is further configured to process the second audio data by using the third adjustment data to obtain third audio data;

The playing module 330 is further configured to play the third audio data.

The embodiment of the application provides an audio data processing device. By adopting the device, the audio data transmitted from the far end can be further optimized. Therefore, the audio data can not only change along with the change of the virtual scene, but also take the relative position between the virtual objects into consideration, so that the voice heard by the user has stronger sense of reality, and the user experience is improved.

The playing module 330 is specifically configured to perform frame-splitting processing on the third audio data to obtain a fourth audio frame sequence, where the fourth audio frame sequence includes at least one seventh audio frame;

The receiving module 340 is specifically configured to receive, based on a data transmission policy, second audio data sent by the second terminal, where the data transmission policy includes at least one of a forward error correction policy, a packet loss compensation policy, an automatic retransmission policy, and a buffer anti-jitter policy, the forward error correction policy is used for correcting an error code occurring in a data transmission process, the packet loss compensation policy is used for compensating a missing audio frame, the automatic retransmission policy is used for requesting retransmission of the audio frame, and the buffer anti-jitter policy is used for adopting dynamic buffering in the data transmission process.

The embodiment of the application provides an audio data processing device. By adopting the device, when the weak network condition occurs, the better communication quality can be maintained at the receiving end.

The embodiment of the application also provides a terminal, as shown in fig. 12, referring to the method part of the embodiment of the application. In the embodiment of the application, a terminal is taken as a mobile phone for example to describe:

Fig. 12 is a block diagram showing a part of the structure of a mobile phone related to a terminal provided by an embodiment of the present application. Referring to fig. 12, the mobile phone includes: radio Frequency (RF) circuitry 410, memory 420, input unit 430, display unit 440, sensor 450, audio circuitry 460, wireless fidelity (WIRELESS FIDELITY, wiFi) module 470, processor 480, and power supply 490. Those skilled in the art will appreciate that the handset configuration shown in fig. 12 is not limiting of the handset and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The following describes the components of the mobile phone in detail with reference to fig. 12:

The RF circuit 410 may be used for receiving and transmitting signals during the process of receiving and transmitting information or communication, in particular, after receiving downlink information of the base station, the downlink information is processed by the processor 480; in addition, the data of the design uplink is sent to the base station. In general, RF circuitry 410 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (low noise amplifier, LNA), a duplexer, and the like. In addition, the RF circuitry 410 may also communicate with networks and other devices via wireless communications. The wireless communications may use any communication standard or protocol including, but not limited to, global System for Mobile communications (global system of mobile communication, GSM), general packet radio service (GENERAL PACKET radio service, GPRS), code division multiple access (code division multiple access, CDMA), wideband code division multiple access (wideband code division multiple access, WCDMA), long term evolution (long term evolution, LTE), email, short message service (short MESSAGING SERVICE, SMS), and the like.

The memory 420 may be used to store software programs and modules, and the processor 480 may perform various functional applications and data processing of the cellular phone by executing the software programs and modules stored in the memory 420. The memory 420 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, memory 420 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The input unit 430 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the handset. In particular, the input unit 430 may include a touch panel 431 and other input devices 432. The touch panel 431, also referred to as a touch screen, may collect touch operations thereon or thereabout by a user (e.g., operations of the user on the touch panel 431 or thereabout using any suitable object or accessory such as a finger, a stylus, etc.), and drive the corresponding connection device according to a predetermined program. Alternatively, the touch panel 431 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device and converts it into touch point coordinates, which are then sent to the processor 480, and can receive commands from the processor 480 and execute them. In addition, the touch panel 431 may be implemented in various types such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 430 may include other input devices 432 in addition to the touch panel 431. In particular, other input devices 432 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a mouse, joystick, etc.

The display unit 440 may be used to display information input by a user or information provided to the user as well as various menus of the mobile phone. The display unit 440 may include a display panel 441, and optionally, the display panel 441 may be configured in the form of a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), or the like. Further, the touch panel 431 may cover the display panel 441, and when the touch panel 431 detects a touch operation thereon or nearby, the touch operation is transmitted to the processor 480 to determine the type of the touch event, and then the processor 480 provides a corresponding visual output on the display panel 441 according to the type of the touch event. Although in fig. 12, the touch panel 431 and the display panel 441 are two independent components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 431 and the display panel 441 may be integrated to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 450, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 441 according to the brightness of ambient light, and the proximity sensor may turn off the display panel 441 and/or the backlight when the mobile phone moves to the ear. As one of the motion sensors, the accelerometer sensor can detect the acceleration in all directions (generally three axes), and can detect the gravity and direction when stationary, and can be used for applications of recognizing the gesture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and knocking), and the like; other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc. that may also be configured with the handset are not described in detail herein.

Audio circuitry 460, a speaker 461, a microphone 462 can provide an audio interface between the user and the handset. The audio circuit 460 may transmit the received electrical signal after the audio data conversion to the speaker 461, and the electrical signal is converted into a sound signal by the speaker 461 and output; on the other hand, the microphone 462 converts the collected sound signals into electrical signals, which are received by the audio circuit 460 and converted into audio data, which are processed by the audio data output processor 480 and sent to, for example, another cell phone via the RF circuit 410, or which are output to the memory 420 for further processing.

WiFi belongs to a short-distance wireless transmission technology, and a mobile phone can help a user to send and receive e-mails, browse web pages, access streaming media and the like through a WiFi module 470, so that wireless broadband Internet access is provided for the user. Although fig. 12 shows the WiFi module 470, it is understood that it does not belong to the necessary constitution of the mobile phone, and can be omitted entirely as required within the scope of not changing the essence of the invention.

Processor 480 is the control center of the handset, connects the various parts of the entire handset using various interfaces and lines, and performs various functions and processes of the handset by running or executing software programs and/or modules stored in memory 420, and invoking data stored in memory 420. Optionally, the processor 480 may include one or more processing units; alternatively, the processor 480 may integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., with a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 480.

The handset further includes a power supply 490 (e.g., a battery) for powering the various components, optionally in logical communication with the processor 480 through a power management system that performs functions such as managing charge, discharge, and power consumption.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which will not be described herein.

The steps performed by the terminal in the above embodiments may be based on the terminal structure shown in fig. 12.

The embodiment of the application also provides a computer device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the method described in each embodiment when executing the computer program.

Embodiments of the present application also provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the methods described in the foregoing embodiments.

Embodiments of the present application also provide a computer program product comprising a computer program which, when executed by a processor, implements the steps of the methods described in the foregoing embodiments.

It will be appreciated that in the specific embodiments of the present application, related data such as user voice data is involved, and when the above embodiments of the present application are applied to specific products or technologies, user permissions or consents need to be obtained, and the collection, use and processing of related data need to comply with related laws and regulations and standards of related countries and regions.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a server or a terminal, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media in which a computer program can be stored.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. An audio data processing method, applied to a first terminal, comprising:

Acquiring first voice data from a first object;

Acquiring scene association information of a first virtual object for a virtual scene in response to the first voice data, wherein the first virtual object is a virtual object controlled by the first object, and the scene association information comprises at least one of position information, environment information, emotion information and role information of the first virtual object in the virtual scene;

Acquiring first adjustment data for the first voice data according to the scene association information, wherein the first adjustment data comprises at least one of audio attribute parameters and accompanying audio data, the audio attribute parameters are used for playing effects on the voice data, and the accompanying audio data are used for providing at least one of background music and background sound effects;

and playing the first audio data.

2. The method of claim 1, wherein the acquiring the first voice data derived from the first object comprises:

Collecting original voice data from the first object through an audio input device;

And performing audio frame conversion processing on the first processed voice data to obtain the first voice data, wherein the audio frame conversion processing comprises at least one of sampling rate conversion processing, sound channel conversion processing, sampling depth conversion processing and audio frame processing.

3. The method of claim 2, wherein performing audio frame conversion processing on the first processed voice data to obtain the first voice data comprises:

Writing the first set of sample point data for each first audio frame into a circular buffer;

Generating the first voice data according to the at least one second sampling point data set, wherein the first voice data comprises at least one second audio frame, and each second audio frame corresponds to one second sampling point data set.

4. The method of claim 1, wherein the scene relating information includes the location information, and the first adjustment data includes the audio attribute parameter;

The obtaining, according to the scene association information, first adjustment data for the first voice data includes:

If the space type belongs to a first space type, determining that the audio attribute parameters for the first voice data comprise reverberation adjustment parameters, wherein the first space type represents a virtual space less than or equal to a space size threshold in the virtual scene;

If the space type belongs to a second space type, determining that the audio attribute parameters for the first voice data comprise echo adjustment parameters, wherein the second space type represents a virtual space in the virtual scene that is greater than the space size threshold.

5. The method of claim 1, wherein the scene relating information includes the environmental information, and the first adjustment data includes the companion audio data;

If the environment information indicates that the first virtual object is in an indoor environment, determining that the accompanying audio data for the first voice data comprises audio data recorded based on the indoor environment;

If the environment information indicates that the first virtual object is in an outdoor environment, determining that the accompanying audio data for the first voice data comprises audio data recorded based on the outdoor environment;

If the environment information indicates that the first virtual object is in a weather environment, determining that the accompanying audio data for the first voice data comprises audio data recorded based on the weather environment.

6. The method of claim 1, wherein the scene relating information includes the mood information, and the first adjustment data includes the audio attribute parameter;

If the emotion information indicates that the first virtual object is in a first emotion state, determining that the audio attribute parameters for the first voice data comprise at least one of a time domain adjustment parameter, a first frequency domain adjustment parameter and an up-tone adjustment parameter, wherein the time domain adjustment parameter is used for improving a voice sound head, and the first frequency domain adjustment parameter is used for enhancing a voice high-frequency component;

If the emotion information indicates that the first virtual object is in a second emotion state, determining that the audio attribute parameters for the first voice data comprise at least one of a speech adjustment parameter and a period adjustment parameter, wherein the speech adjustment parameter is used for slowing down speech and the period adjustment parameter is used for periodically changing speech;

If the emotion information indicates that the first virtual object is in a third emotion state, determining that the audio attribute parameters for the first voice data comprise at least one of second frequency domain adjustment parameters and tone reduction adjustment parameters, wherein the second frequency domain adjustment parameters are used for enhancing low-frequency components of the voice data.

7. The method of claim 1, wherein the scene relating information includes the character information, and the first adjustment data includes the audio attribute parameter;

if the character information indicates that the first virtual object belongs to a target character type, determining that the audio attribute parameters for the first voice data comprise a pitch adjustment parameter, a timbre adjustment parameter and a formant adjustment parameter.

8. The method of claim 1, wherein the first adjustment data includes the audio attribute parameters and the companion audio data;

The processing the first voice data by using the first adjustment data to obtain first audio data includes:

Adjusting the first voice data by adopting the audio attribute parameters to obtain target voice data;

And superposing the accompanying audio data and the second processed voice data to obtain the first audio data.

9. The method of claim 1, wherein said enabling playback of said first audio data comprises:

playing the first audio data through an audio output device in response to an audio playing operation;

Or alternatively

The implementation of playing the first audio data includes:

and sending the first audio data to a second terminal so that the second terminal plays the first audio data through an audio output device.

10. The method of claim 9, wherein the transmitting the first audio data to the second terminal comprises:

Acquiring a voice scene type;

If the voice scene type belongs to a first scene type, sending the first audio data to a second terminal, wherein the first audio data is mono audio data, and the first audio data adopts a first sampling rate;

And if the voice scene type belongs to a second scene type, sending the first audio data to a second terminal, wherein the first audio data are audio data of a stereo channel, the first audio data adopt a second sampling rate, and the second sampling rate is higher than the first sampling rate.

11. The method of claim 9, wherein the transmitting the first audio data to the second terminal comprises:

Framing the first audio data to obtain a second audio frame sequence, wherein the second audio frame sequence comprises at least one third audio frame;

Writing the third sample point data set of each third audio frame into a ring buffer;

And performing audio frame conversion processing on the at least one fourth sampling point data set, and sending the processed first audio data to the second terminal, wherein the first audio data comprises at least one fourth audio frame, and each fourth audio frame corresponds to one fourth sampling point data set.

12. The method according to claim 1, wherein the method further comprises:

receiving second audio data sent by a second terminal, wherein the second audio data is obtained by processing second voice data by the second terminal through second adjustment data, the second voice data is derived from a second object, the second adjustment data is obtained according to scene association information of a second virtual object aiming at the virtual scene, and the second virtual object is a virtual object controlled by the second object;

And playing the second audio data.

13. The method of claim 12, wherein the playing the second audio data comprises:

Writing said fifth set of sample point data for said each fifth audio frame into a circular buffer;

And performing audio frame conversion processing on the at least one sixth sampling point data set, and playing the processed second audio data, wherein the second audio data comprises at least one sixth audio frame, and each sixth audio frame corresponds to one sixth sampling point data set.

14. The method according to claim 1, wherein the method further comprises:

processing the second audio data by adopting the third adjustment data to obtain third audio data;

and playing the third audio data.

15. The method of claim 14, wherein the playing the third audio data comprises:

Writing said seventh set of sample point data for said each seventh audio frame into a circular buffer;

And performing audio frame conversion processing on the at least one eighth sampling point data set, and playing the processed third audio data, wherein the third audio data comprises at least one eighth audio frame, and each eighth audio frame corresponds to one eighth sampling point data set.

16. The method according to any one of claims 12 to 15, wherein the receiving the second audio data transmitted by the second terminal comprises:

And receiving the second audio data sent by the second terminal based on a data transmission strategy, wherein the data transmission strategy comprises at least one of a forward error correction strategy, a packet loss compensation strategy, an automatic retransmission strategy and a buffer anti-jitter strategy, the forward error correction strategy is used for correcting error codes occurring in a data transmission process, the packet loss compensation strategy is used for compensating missing audio frames, the automatic retransmission strategy is used for requesting retransmission of the audio frames, and the buffer anti-jitter strategy is used for adopting dynamic buffer in the data transmission process.

17. An audio data processing device, characterized by being applied to a first terminal, comprising:

The obtaining module is further configured to obtain scene association information of a first virtual object for a virtual scene in response to the first voice data, where the first virtual object is a virtual object controlled by the first object, and the scene association information includes at least one of location information, environment information, emotion information, and role information of the first virtual object in the virtual scene;

The acquisition module is further configured to acquire first adjustment data for the first voice data according to the scene association information, where the first adjustment data includes at least one of an audio attribute parameter and accompanying audio data, the audio attribute parameter is used for playing effects on the voice data, and the accompanying audio data is used for providing at least one of background music and background sound effects;

and the playing module is used for playing the first audio data.

18. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 16 when the computer program is executed.

19. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 16.

20. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method of any one of claims 1 to 16.