CN104869004B

CN104869004B - Audio data processing method and device

Info

Publication number: CN104869004B
Application number: CN201510248919.7A
Authority: CN
Inventors: 张洪彬; 郭启行; 姜俊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2015-05-15
Filing date: 2015-05-15
Publication date: 2017-07-25
Anticipated expiration: 2035-05-15
Also published as: CN104869004A

Abstract

The present invention proposes a kind of audio data processing method and device, and the audio data processing method includes receiving the voice data before the decoding sent as the terminal of sender, and, the voice data before the decoding is decoded, decoded voice data is obtained；It is determined as the current character of the terminal of recipient, if current character is spokesman, the voice data to be forwarded is obtained according to the voice data before the decoding, or, if current character is audience, the voice data to be forwarded is obtained according to the decoded voice data；By the voice data to be forwarded, the terminal as recipient is sent to.This method can take into account the advantage of Full transcoding patterns and relay patterns, and CPU consumption and network flow consumption are reduced as far as possible.

Description

Audio data processing method and device

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to an audio data processing method and apparatus.

Background

In audio conference communication, one of the prior art is a Full-coding and Full-decoding (Full-transcoding) mode of a server, in which an independent audio decoder (decoder) and an independent encoder (encoder) are allocated in the server (server) corresponding to each participating terminal, and after independent decoding, audio mixing processing is performed according to the speaking condition of a conference, and independent encoding is performed to transmit the audio to the terminal. In the second prior art, a relay mode is adopted, and under the condition of no decoding, the characteristics of audio data (audio) are extracted by using a special mark field of a code stream to prejudge the speaker in a conference and switch the speaker, and the corresponding data of the speaker is forwarded to all participating terminals, so that the CPU resource consumption of a background Server is greatly reduced.

However, in the first prior art, since the encoding needs to consume a lot of CPU resources, the encoder is configured independently, which causes huge CPU resource overhead. In the second prior art, the transmission of audio data without encoding and decoding increases the downlink bandwidth consumption of the terminal, resulting in more network traffic overhead.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, an object of the present invention is to provide an audio data processing method, which can take advantages of both Full-transcoding mode and relay mode into consideration, and reduce CPU consumption and network traffic consumption as much as possible.

Another object of the present invention is to provide an audio data processing apparatus.

In order to achieve the above object, an embodiment of the first aspect of the present invention provides an audio data processing method, including: receiving audio data before decoding sent by a terminal serving as a sender, and decoding the audio data before decoding to obtain decoded audio data; determining the current role of the terminal as a receiver, if the current role is a speaker, acquiring audio data to be forwarded according to the audio data before decoding, or if the current role is a listener, acquiring the audio data to be forwarded according to the decoded audio data; and sending the audio data to be forwarded to the terminal as the receiving party.

In the audio data processing method provided in the embodiment of the first aspect of the present invention, by acquiring the audio data before decoding and the audio data after decoding, and acquiring the audio data to be forwarded in different manners according to the difference of the current role, the advantages of the Full-transcoding mode and the relay mode can be taken into consideration, and the CPU consumption and the network traffic consumption are reduced as much as possible.

In order to achieve the above object, an audio data processing apparatus according to a second embodiment of the present invention includes: the receiving and decoding module is used for receiving the audio data before decoding sent by the terminal serving as the sender and decoding the audio data before decoding to obtain the decoded audio data; an obtaining module, configured to determine a current role of a terminal as a receiving party, and if the current role is a speaker, obtain audio data to be forwarded according to the audio data before decoding, or if the current role is a listener, obtain audio data to be forwarded according to the decoded audio data; and the sending module is used for sending the audio data to be forwarded to the terminal serving as the receiving party.

The audio data processing apparatus provided in the embodiment of the second aspect of the present invention may take into account the advantages of the Full-transcoding mode and the relay mode by acquiring the audio data before decoding and the audio data after decoding, and acquiring the audio data to be forwarded in different manners according to the difference of the current role, so as to reduce the CPU consumption and the network traffic consumption as much as possible.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a flowchart illustrating an audio data processing method according to an embodiment of the invention;

FIG. 2 is a schematic diagram of an audio processing system according to an embodiment of the invention;

FIG. 3 is a flow chart of an audio data processing method according to another embodiment of the invention;

FIG. 4 is a flow chart illustrating one implementation of determining audio data to forward in an embodiment of the present invention;

FIG. 5 is a flowchart illustrating an audio data processing method according to another embodiment of the invention;

FIG. 6 is a schematic structural diagram of an audio data processing apparatus according to another embodiment of the present invention;

fig. 7 is a schematic structural diagram of an audio data processing apparatus according to another embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar modules or modules having the same or similar functionality throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. On the contrary, the embodiments of the invention include all changes, modifications and equivalents coming within the spirit and terms of the claims appended hereto.

Fig. 1 is a schematic flow chart of an audio data processing method according to an embodiment of the present invention, where an execution main body of the embodiment is a server (server), and the method includes:

s11: receiving audio data before decoding sent by a terminal serving as a sender, and decoding the audio data before decoding to obtain decoded audio data.

Referring to fig. 2, assuming that the terminals are denoted by a, B, C, D, and E, respectively, the terminals can transmit audio data and can also receive audio data, wherein the terminals are denoted by terminals 21 as the sender.

In the server 22, a receiving module (Recv) and a decoding module (Dec) are respectively arranged corresponding to each sender, the receiving module is configured to receive audio data sent by the sender, the audio data is audio data before decoding, and the decoding module is configured to decode the received audio data to obtain decoded audio data. As shown in fig. 2, two types of data, namely, audio data before decoding and audio data after decoding, can be obtained corresponding to each sender through the receiving module and the decoding module, and in fig. 2, the two types of data are respectively represented by different filling manners.

S12: and determining the current role of the terminal as a receiver, and if the current role is a speaker, acquiring the audio data to be forwarded according to the audio data before decoding, or if the current role is a listener, acquiring the audio data to be forwarded according to the decoded audio data.

Referring to fig. 2, within the server, this step may be specifically performed by a forwarding decision module (forwarding decision module).

The current role may include: a speaker, or a listener.

It will be appreciated that the speaker or listener may be switched at different times.

Like the relay mode, the forwarding decision module may determine the speaker without decoding according to the audio characteristics, and determine the remaining terminals that are not speakers as listeners. The audio characteristic is, for example, an energy value of the audio data.

For example, the forwarding decision module may receive the pre-decoding audio data from a, B, C, D, and E, and then may detect an energy value of each pre-decoding audio data, and when the energy value of the pre-decoding audio data of a terminal is greater than a preset value, it indicates that the terminal is a speaker, otherwise, it is a listener.

According to the difference of the current role, different processing modes can be adopted to obtain the audio data to be forwarded. For example, for a speaker, audio data before decoding of other speakers may be packetized, and for a listener, decoded audio data of the speaker may be mixed and the mixed data may be encoded.

Optionally, referring to fig. 3, the obtaining the audio data to be forwarded according to the current role includes:

s31: if the current role is a speaker, acquiring audio data of other speakers except the speaker before decoding, and packaging the audio data before decoding to determine the audio data to be forwarded;

for example, referring to fig. 2, if a speaker includes: and A, B and C correspond to A, obtain the audio data before decoding of B and C, and determine the audio data before decoding of B and C as the audio data to be forwarded to A after packaging. And correspondingly B, acquiring the audio data before decoding of A and C, and determining the audio data before decoding of A and C as the audio data to be forwarded to B after packaging. And corresponding to the C, acquiring the audio data before decoding of the A and the B, and determining the audio data before decoding of the A and the B as the audio data to be forwarded to the C after packaging.

It will be appreciated that if there is only one speaker, there is no need to forward audio data to the corresponding recipient of that speaker.

S32: and if the current role is a listener, acquiring decoded audio data of a speaker, performing sound mixing processing on the decoded audio data, performing coding processing on the audio data subjected to the sound mixing processing to obtain coded audio data, and determining the coded audio data as audio data to be forwarded.

For example, referring to fig. 2, if a speaker includes: and A, B and C, if the audience is D and E, acquiring the decoded audio data of A, B and C corresponding to D and E, carrying out sound mixing processing on the decoded audio data, then carrying out coding processing on the audio data after sound mixing, and then respectively sending the audio data after coding processing to D and E.

The data to be forwarded is obtained by adopting different means according to different roles, so that the CPU consumption and the network flow overhead can be reduced as much as possible.

In addition, as shown in fig. 2, when the number of listeners is plural, it is not necessary to provide an encoder for each listener, and the same encoder may perform encoding processing on the audio data after the audio mixing processing. For example, referring to fig. 2, both of the codes D and E are encoded by the same Encoder (conf Encoder).

By adopting the same encoder to process different audiences, the number of the encoders can be reduced relative to a Full-transcoding mode, and the CPU overhead can be reduced because the CPU overhead is mainly caused by the encoder.

Further, considering that a speaker may not have multi-channel decoding capability, in this case, the processing flow similar to Full-transcoding may be used, that is, decoded audio data of other speakers are mixed and encoded to obtain audio data to be forwarded.

For example, referring to fig. 4, if the current role is a speaker and the number of other speakers is multiple, the specific process of determining the audio data to be forwarded may include:

s41: determining whether the speaker has multi-path decoding capability.

The server can perform signaling interaction with a preset format with the speaker, and acquire whether the speaker has multi-channel decoding capability through the signaling interaction. For example, the server sends a query signaling to a, where the query signaling is used to query whether a has multi-channel decoding capability, a obtains its own capability after receiving the signaling, and if the a has multi-channel decoding capability, feeds back a feedback signaling with multi-channel decoding capability to the server, otherwise feeds back a feedback signaling without multi-channel decoding capability. The format of the specific inquiry signaling and feedback signaling may be predefined, for example, a field with a specific value indicates the inquiry signaling, and a field with 1 or 0 indicates the capability of multi-path decoding or not.

S42: when the multi-channel decoding capability is provided, the audio data before decoding of other speakers except the speaker is obtained, and the audio data before decoding is packaged and determined as the audio data to be forwarded.

The details of this step can be found in S31, and are not described herein.

S43: and when the multi-path decoding capability is not available, acquiring decoded audio data of other speakers except the speaker, mixing the decoded audio data, encoding the mixed audio data, and determining the encoded audio data as the audio data to be forwarded.

For example, a is a speaker, but a does not have a multi-path decoding capability, the decoded audio data of B and C may be acquired, the decoded audio data of B and C may be subjected to audio mixing processing, the audio data subjected to audio mixing processing may be subjected to encoding processing, and the audio data subjected to encoding processing may be determined as the audio data to be forwarded to a.

By detecting the multi-path decoding capability, the compatibility problem that the terminal with or without the multi-path decoding capability simultaneously accesses the conference can be well solved.

In another embodiment, the number of speakers may also be limited to achieve a better listening impression. Referring to fig. 5, the method may further include:

s51: the number of speakers is detected, and when the number exceeds a preset number, a preset number of speakers is selected from the detected speakers.

Then, when the audio data to be forwarded is acquired according to the current role, and the audio data before or after decoding of the adopted speaker is acquired, the adopted speaker refers to the selected preset number of speakers.

Optionally, the preset number may be 3 or 2.

Taking 3 as an example, assuming that the number of detected speakers is 4, the speaker corresponding to the 3 audio data with the larger energy value may be selected as the selected speaker according to the audio characteristics, for example, the energy value of each audio data.

S13: and sending the audio data to be forwarded to the terminal as the receiving party.

This step may be performed by a sending module within the server. Referring to fig. 2, a terminal as a receiving side is denoted by 23.

Specifically, when the speakers are a, B, and C, and the receivers all have multi-channel decoding capability, the audio data before decoding of B and the audio data before decoding of C are packetized (B + C packet) and then sent to a, the audio data before decoding of a and the audio data before decoding of C are packetized (a + C packet) and then sent to B, and the audio data before decoding of a and the audio data before decoding of B are packetized (a + B packet) and then sent to C.

For listeners D and E, the decoded audio data of a, the decoded audio data of B, and the decoded audio data of C are mixed to obtain mixed audio data (a + B + C PCM), the mixed audio data is encoded (encoded by conf Encoder) to obtain encoded audio data, and the encoded audio data is transmitted to D and E.

It should be noted that although a plurality of decoders (decoders) are identified in D and E as receivers in fig. 2, only one Decoder is needed for D and E, and a plurality of decoders can be used when D and E are speakers.

In this embodiment, by acquiring the audio data before decoding and the audio data after decoding, and acquiring the audio data to be forwarded in different manners according to the difference of the current role, the advantages of the Full-transcoding mode and the relay mode can be taken into account, and the CPU consumption and the network traffic consumption can be reduced as much as possible. Specifically, in the prior art, an encoder needs to be set corresponding to each terminal, and only one encoder may be set in this embodiment, and since the encoder occupies a large CPU overhead, this embodiment can solve the problem of CPU cost of an audio communication system operation level architecture Server (Server), and significantly improve the concurrent processing capability of a single CPU. Because most of the terminals are in the audience state, the audio data corresponding to the audience is coded instead of directly transmitting the multi-channel audio data, and the network flow of most of the conference terminals can be reduced. The balance between the CPU consumption and the network flow is obtained through the optimization of the Server, namely, the IO network flow of the Server and the terminal (Client) is reduced under the condition of effectively reducing the CPU consumption.

Fig. 6 is a schematic structural diagram of an audio data processing apparatus according to another embodiment of the present invention, where the apparatus may be referred to as a server, and the apparatus 60 includes:

a receiving and decoding module 61, configured to receive audio data before decoding sent by a terminal serving as a sender, and decode the audio data before decoding to obtain decoded audio data;

An obtaining module 62, configured to determine a current role of the terminal as a receiving party, and if the current role is a speaker, obtain audio data to be forwarded according to the audio data before decoding, or if the current role is a listener, obtain audio data to be forwarded according to the decoded audio data;

the current role may include: a speaker, or a listener.

Optionally, referring to fig. 7, the obtaining module 62 includes:

a first unit 621 configured to determine a current role of a terminal as a receiving side;

the speaker may be determined without decoding based on the audio characteristics and the remaining terminals that are not speakers may be determined to be listeners. The audio characteristic is, for example, an energy value of the audio data.

A second unit 622, configured to, if the current role is a speaker, obtain audio data before decoding of other speakers except for the current role, and determine the audio data before decoding as audio data to be forwarded after packing the audio data before decoding; or, if the current role is a listener, acquiring decoded audio data of a speaker, performing audio mixing processing on the decoded audio data, performing encoding processing on the audio data subjected to the audio mixing processing to obtain encoded audio data, and determining the encoded audio data as audio data to be forwarded.

Optionally, when there are multiple listeners, the second unit 622 is configured to perform encoding processing on the audio data after the mixing processing, and includes:

and adopting the same encoder to encode the audio data after the audio mixing processing.

In another embodiment, referring to fig. 7, the apparatus 60 further comprises:

a selecting module 64, configured to detect the number of speakers, and select a preset number of speakers from the detected speakers when the number exceeds a preset number.

Optionally, the preset number may be 3 or 2.

By limiting the number of speakers, a more excellent listening effect can be obtained.

Optionally, referring to fig. 7, if the current role is a speaker and the number of other speakers is multiple, the obtaining module 62 further includes:

a third unit 623 for determining whether the speaker has multi-path decoding capability;

the third unit can perform signaling interaction with a preset format with the speaker, and acquire whether the speaker has multi-channel decoding capability through the signaling interaction. For example, the third unit sends an inquiry signaling to a, where the inquiry signaling is used to inquire whether a has multi-channel decoding capability, a obtains its own capability after receiving the signaling, and if so, feeds back a feedback signaling with multi-channel decoding capability to the server, otherwise, feeds back a feedback signaling without multi-channel decoding capability. The format of the specific inquiry signaling and feedback signaling may be predefined, for example, a field with a specific value indicates the inquiry signaling, and a field with 1 or 0 indicates the capability of multi-path decoding or not.

The second unit 622 is specifically configured to, when the third unit determines that the speaker has the multi-channel decoding capability, obtain audio data of speakers other than the speaker before decoding;

see S31 for details, which are not described herein.

A fourth unit 624, configured to, when there is no multi-path decoding capability, obtain decoded audio data of another speaker than the self, mix the decoded audio data, encode the mixed audio data, and determine the encoded audio data as audio data to be forwarded.

A sending module 63, configured to send the audio data to be forwarded to the terminal serving as the receiving party.

When the speakers are A, B and C and the receivers have multi-path decoding capability, the audio data before decoding of B and the audio data before decoding of C are packed (B + C packet) and then sent to A, the audio data before decoding of A and the audio data before decoding of C are packed (A + C packet) and then sent to B, and the audio data before decoding of A and the audio data before decoding of B are packed (A + B packet) and then sent to C.

It should be noted that the terms "first," "second," and the like in the description of the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present invention, the meaning of "a plurality" means at least two unless otherwise specified.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A method of audio data processing, comprising:

receiving audio data before decoding sent by a terminal serving as a sender, and decoding the audio data before decoding to obtain decoded audio data;

determining the current role of the terminal as a receiver, if the current role is a speaker, acquiring audio data to be forwarded according to the audio data before decoding, or if the current role is a listener, acquiring the audio data to be forwarded according to the decoded audio data;

sending the audio data to be forwarded to the terminal as a receiving party;

wherein the determining of the current role of the terminal as the receiving side includes:

and detecting the energy value of each audio data before decoding, and determining that the terminal as the receiver is the speaker if the energy value of the audio data before decoding of the terminal as the receiver is greater than a preset value, otherwise, determining that the terminal as the receiver is the listener.

2. The method of claim 1, wherein obtaining audio data to be forwarded according to the current role comprises:

if the current role is a speaker, acquiring audio data of other speakers except the speaker before decoding, and packaging the audio data before decoding to determine the audio data to be forwarded; or,

and if the current role is a listener, acquiring decoded audio data of a speaker, performing sound mixing processing on the decoded audio data, performing coding processing on the audio data subjected to the sound mixing processing to obtain coded audio data, and determining the coded audio data as audio data to be forwarded.

3. The method of claim 1, wherein before the obtaining audio data to be forwarded according to the current role, the method further comprises:

detecting the number of speakers, and when the number exceeds a preset number, selecting a preset number of speakers from the detected speakers so as to perform processing according to the selected speakers.

4. The method of claim 2, wherein the encoding process of the mixed audio data when there are a plurality of listeners comprises:

5. The method according to claim 2, wherein if the current role is a speaker and the number of other speakers is plural, before the acquiring the audio data before decoding of the other speakers except for itself, the method further comprises:

determining whether the speaker has a multi-channel decoding capability so as to acquire audio data before decoding of speakers other than the speaker when the speaker has the multi-channel decoding capability; or,

and when the multi-path decoding capability is not available, acquiring decoded audio data of other speakers except the speaker, mixing the decoded audio data, encoding the mixed audio data, and determining the encoded audio data as the audio data to be forwarded.

6. An audio data processing apparatus, comprising:

the receiving and decoding module is used for receiving the audio data before decoding sent by the terminal serving as the sender and decoding the audio data before decoding to obtain the decoded audio data;

an obtaining module, configured to determine a current role of a terminal as a receiving party, and if the current role is a speaker, obtain audio data to be forwarded according to the audio data before decoding, or if the current role is a listener, obtain audio data to be forwarded according to the decoded audio data;

a sending module, configured to send the audio data to be forwarded to the terminal serving as the receiving party;

wherein the acquisition module: specifically, the method is configured to detect an energy value of each piece of audio data before decoding, and when the energy value of the piece of audio data before decoding of the terminal serving as the receiving party is greater than a preset value, determine that the terminal serving as the receiving party is a speaker, otherwise, determine that the terminal is an audience.

7. The apparatus of claim 6, wherein the obtaining module comprises:

a first unit for determining a current role of a terminal as a receiving party;

a second unit, configured to, if the current role is a speaker, obtain audio data before decoding of other speakers except the current role, and determine the audio data before decoding as audio data to be forwarded after packing the audio data before decoding; or, if the current role is a listener, acquiring decoded audio data of a speaker, performing audio mixing processing on the decoded audio data, performing encoding processing on the audio data subjected to the audio mixing processing to obtain encoded audio data, and determining the encoded audio data as audio data to be forwarded.

8. The apparatus of claim 6, further comprising:

and the selection module is used for detecting the number of the speakers and selecting the speakers with the preset number from the detected speakers when the number exceeds the preset number.

9. The apparatus of claim 7, wherein when there are multiple listeners, the second unit is configured to perform encoding processing on the mixed audio data, and includes:

10. The apparatus of claim 7, wherein if the current role is speaker and the number of other speakers is multiple, the obtaining module further comprises:

a third unit for determining whether the speaker has a multi-path decoding capability;

the second unit is specifically configured to, when the third unit determines that the speaker has the multi-channel decoding capability, acquire audio data of speakers other than the third unit before decoding;

and a fourth unit, configured to, when the multi-path decoding capability is not available, acquire decoded audio data of a speaker other than the speaker, mix the decoded audio data, encode the mixed audio data, and determine the encoded audio data as audio data to be forwarded.