CN102324235A

CN102324235A - Sound mixing encoding method, device and system

Info

Publication number: CN102324235A
Application number: CN201110205093A
Authority: CN
Inventors: 张清; 苗磊; 李伟; 许剑峰; 许丽净; 杜正中; 胡晨; 杨毅; 齐峰岩
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2007-10-19
Filing date: 2007-10-19
Publication date: 2012-01-18

Abstract

The invention discloses a terminal side encoding method comprising the following steps of: setting a sound mixing identifier for sound information according to a sound mixing strategy, and encoding the sound information according to the sound mixing identifier information to get core encoded data; if the sound mixing identifier information indicates that sound mixing is needed, calculating dynamic side information, and generating and outputting an audio encoded code stream containing the sound mixing identifier, the core encoded data and the dynamic side information; and if the sound mixing identifier indicates that sound mixing is not needed, generating and outputting the audio encoded code stream containing the sound mixing identifier and the core encoded data by the terminal. The invention further discloses a corresponding network side sound mixing encoding method, and a device and a system for sound mixing encoding. According to the scheme of the invention, the problem of signal overflow and introduced error in a sound mixing process can be solved, and the encoding efficiency cannot be decreased.

Description

A kind of audio mixing coding method, device and system

Technical field

The present invention relates to the multimedia communication technology field, particularly a kind of audio mixing coding method, device and system.

Background technology

At present, the applied more and more of real-time multimedia communication service, in order to satisfy growing business demand, it is very important that for example multimedia conference system or the like, so various multimedia conference system correlation techniques seems.

In multimedia conferencing, audio interaction is the most basic key element.In centralized conference, all (Multi-point Controlling Unit, MCU) foundation is sent audio code stream and is received audio code stream from MCU to MCU in real time based on the connection of clean culture (unicast) with multipoint control unit at each terminal.Therefore, the input of MCU all is the audio code streams behind the various encoding scheme codings, and it is output as according to synthesis strategy and carries out the audio code stream after audio mixing is handled.

Be illustrated in figure 1 as a multimedia conference system synoptic diagram, wherein frame of broken lines can be regarded a MCU unit as.Terminal location 1, audio code stream such as input such as terminal location 2 grades is through decoding respectively, and decoded audio code stream is encoded respectively to the audio code stream behind the audio mixing behind audio mixing unit audio mixing again, outputs to relevant terminal again.Multimedia conference system as shown in Figure 1 has M terminal to participate in audio mixing.For specific moment t, each terminal can be sent voice data and MCU, and MCU at first decodes voice data, and every road signal is carried out the audio mixing CALCULATION OF PARAMETERS, finally the multipath decoding signal is carried out audio mixing and handles.The algorithms most in use that audio mixing is handled promptly adds and all road decoded datas, will add with after data again through encoder encodes, finally be sent to each terminal.

Adopt above-mentioned time domain stack audio mixing scheme, usually can introduce noise.This is that wherein min representes the lower limit of scope because all there is certain scope [min, max] at each terminal in the sound signal that transmits to MCU, and max representes the upper limit of scope.When directly adding and during the signal of all roads, exceeding signal span [min, max] possibly.Because there is the problem that quantizes upper and lower bound in digital audio and video signals, the stack computing causes the result to overflow possibly.Common processing means are to overflow detection, and then carry out saturation arithmetic, and the result who promptly surpasses the upper limit is changed to higher limit, and the value that surpasses lower limit is changed to lower limit.This computing itself has destroyed the original temporal signatures of voice signal, thereby has introduced noise, and Here it is the reason of explosion sound and voice non-continuous event can occur in some system.

Along with the terminal data of participating in audio mixing increases; The frequency that occurs overflowing also constantly rises, so there is a terminal number upper limit in this type time domain stack audio mixing scheme, and this higher limit is very low; The experiment proof; Under a lot of situation,, flow can't have been differentiated if its result just has a lot of noises with interrupted when 4 terminals participation audio mixings.

Summary of the invention

In view of this, the embodiment of the invention proposes a kind of audio mixing coding method, can overcome the noise problem of time domain audio mixing coding in the prior art.Said audio mixing coding method comprises the steps:

Acoustic information is provided with the audio mixing flag according to the audio mixing strategy, according to zone bit information said acoustic information is encoded, the result of coding is as the core encoder data;

If audio mixing flag information is the needs audio mixing, then calculate dynamic side information, generate and export the stream of audio codes that comprises said audio mixing flag, core encoder data and dynamic side information; If audio mixing flag information for not needing audio mixing, then generates and exports the stream of audio codes that comprises said audio mixing flag and core encoder data;

Network side receives the stream of audio codes of self terminal; Audio mixing flag information according to wherein judges whether that needs carry out audio mixing to this stream of audio codes and handle; Needs are carried out M ' the road stream of audio codes that audio mixing is handled, select N road stream of audio codes, the core encoder data of selected N road stream of audio codes are carried out audio mixing handle according to dynamic side information wherein; And the stream of audio codes behind the output audio mixing, wherein N is smaller or equal to M '.

The embodiment of the invention also proposes a kind of end side coding method, comprises the steps:

According to the audio mixing strategy acoustic information is provided with the audio mixing sign, according to said audio mixing identification information said acoustic information being encoded obtains the core encoder data;

If said audio mixing identification information is the needs audio mixing, then calculate dynamic side information, generate and export the stream of audio codes that comprises said audio mixing sign, core encoder data and dynamic side information; If said audio mixing identification information is not for needing audio mixing, then the terminal generates and exports the stream of audio codes that comprises said audio mixing sign and core encoder data.

The embodiment of the invention also proposes the coding method of a kind of network side audio mixing, comprises the steps:

Receive M road stream of audio codes, whether needs carry out audio mixing to this stream of audio codes handles according to wherein audio mixing identification information judgment, and needs are carried out M ' the road stream of audio codes that audio mixing is handled; Select N road stream of audio codes according to dynamic side information wherein; The core encoder data of selected N road stream of audio codes are carried out audio mixing handle, and the stream of audio codes behind the output audio mixing, wherein M, M ' and N are positive integer; N is smaller or equal to M ', and M ' is smaller or equal to M.

The embodiment of the invention proposes a kind of multimedia conference system, comprises M terminal and multipoint control unit;

Comprise M terminal and multipoint control unit, it is characterized in that,

Said terminal is used for the acoustic information collected is provided with the audio mixing flag according to the audio mixing strategy of this locality, according to zone bit information said acoustic information is encoded, and the result of coding is as the core encoder data; And the audio mixing flag is set according to the audio mixing strategy of this locality; Generate and output to comprise said core encoder data, audio mixing flag be to need the audio mixing and the dynamic stream of audio codes of side information, generate perhaps and export that to comprise said core encoder data be the stream of audio codes that does not need audio mixing with the audio mixing flag;

Said multipoint control unit is used to receive the stream of audio codes of self terminal; Value according to wherein audio mixing flag judges whether that needs carry out audio mixing to this stream of audio codes and handle; Needs are carried out M ' the road audio code stream that audio mixing is handled,, the core encoder data of selected N road audio code stream are carried out audio mixing handle according to selecting N road audio code stream in the dynamic side information wherein; And the stream of audio codes behind the output audio mixing; Wherein M, M ' and N are positive integer, and N is smaller or equal to M ', and M ' is smaller or equal to M.

The embodiment of the invention proposes a kind of multimedia conferencing terminal, comprising:

The sound collecting module is used to collect acoustic information;

The audio mixing policy module is used for according to the audio mixing strategy that is provided with in advance the collected acoustic information of said sound collecting module being provided with the audio mixing flag;

The core encoder module is used for said acoustic information is encoded, output core encoder data;

Become frame module; Be used for calculating dynamic side information according to the audio mixing flag of said audio mixing policy module setting; And according to the value of said audio mixing flag; Generation comprises the coded audio data frame of said core encoder data, audio mixing flag and dynamic side information, perhaps generates the coded audio data frame that comprises said core encoder data and audio mixing flag;

Output module, the coded audio data frame that is used for the said one-tenth frame module generation of externally output is as stream of audio codes.

The embodiment of the invention proposes a kind of multipoint control unit, comprising:

Selected cell; Be used for receiving stream of audio codes from M terminal; Value according to the audio mixing flag of said stream of audio codes judges whether that needs carry out audio mixing to this stream of audio codes and handle; Needs are carried out M ' the road stream of audio codes that audio mixing is handled, select N road stream of audio codes according to dynamic side information wherein;

The audio mixing unit is used for that the core encoder data of the selected N of said selected cell road stream of audio codes are carried out audio mixing and handles, and obtains the stream of audio codes behind the audio mixing of M ' road;

Transmitting element is used for the stream of audio codes from said audio mixing unit is sent to the corresponding target terminal.

Can find out from above technical scheme,, in encoding code stream, carry out the demarcation of audio mixing flag and increase the corresponding dynamic side information in end side; At network side, select the stream of audio codes of needs audio mixing to carry out the audio mixing processing according to audio mixing flag and dynamic side information, the noise problem in the time of can solving the audio mixing coding.

Description of drawings

Fig. 1 is a multimedia conference system synoptic diagram of prior art;

Fig. 2 is the multimedia conference system synoptic diagram of the embodiment of the invention;

Fig. 3 is the structural drawing of the coded frame data in the stream of audio codes of terminal cell encoder output of the embodiment of the invention;

Fig. 4 is the coding process flow diagram of the end side of the embodiment of the invention;

Fig. 5 is the audio mixing coding process flow diagram of the MCU side of the embodiment of the invention;

A kind of multimedia conferencing terminal block diagram that Fig. 6 proposes for inventive embodiments;

A kind of multipoint control unit block diagram that Fig. 7 proposes for the embodiment of the invention.

Embodiment

The embodiment of the invention proposes the audio mixing coding method based on the audio mixing flag; In the data stream of terminal output; Except the core encoder code stream of voice-bearer, also comprise audio mixing flag and dynamic side information, wherein dynamically side information carries the required information of audio mixing coding; If the audio mixing flag need to be set to audio mixing, dynamic side information is set then; If the audio mixing flag do not need to be set to audio mixing, dynamic side information is not set then.MCU carries out the audio mixing processing according to the core encoder code stream that said audio mixing flag selection needs carry out the audio mixing processing.

For making the object of the invention, technical scheme and advantage clearer, the present invention is done further to set forth in detail below in conjunction with accompanying drawing.

Fig. 2 shows the multimedia conference system synoptic diagram figure of the embodiment of the invention.In this multimedia conference system, comprise M terminal, i.e. terminal 1,2...... terminal, terminal M; Also comprise a MCU.

With terminal 1 is example, and this terminal comprises cell encoder 201, and the sound that the sound collection means at 201 pairs of terminals 1 of cell encoder such as microphone are collected is encoded, and generates the core encoder code stream that carries said acoustic information.The audio mixing strategy that cell encoder 201 also is provided with according to this locality is provided with the audio mixing flag.Said audio mixing strategy is used for confirming whether the acoustic coding of this terminal output need carry out the audio mixing processing; Needs according to reality can be provided with different audio mixing strategies; For example; Can different priority be set to different terminal, preferentially carry out audio mixing for audio code stream from the high terminal of priority; The acoustic energy threshold value can also be set, and the acoustic energy of collecting when the terminal surpasses this energy threshold and then the audio code stream at this terminal is carried out audio mixing or the like.And a plurality of audio mixing strategies can use simultaneously.

If the audio mixing flag that is provided with need to represent audio mixing, then cell encoder 201 also will generate dynamic side information, writes in the audio code stream; If the audio mixing flag do not need to represent audio mixing, then only comprise core encoder and audio mixing flag in the audio code stream of cell encoder 201 outputs.

Fig. 3 shows the structural drawing of the coded frame data in the stream of audio codes of terminal cell encoder output of the embodiment of the invention.If the total length of a Frame is the n bit, when the audio mixing flag was represented to need audio mixing, this coded frame data comprised the audio mixing flag of t bit shown in the last figure among Fig. 3, the dynamic side information of m bit, and the core encoder of n-m-t bit.Wherein, the audio mixing flag is arranged on frame head, is convenient to MCU identification.When the audio mixing flag was represented not need audio mixing, this coded frame data comprised the audio mixing flag of t bit and the core encoder of n-t bit shown in the figure below among Fig. 3.

For arrowband enhancement layer G.711 (Low Band Enhance, LBE) coding, the desirable following numerical value of various piece among Fig. 3: t=1, n=80, m=9.

Side information comprises: frame energy (Frame Energy) harmony cent value (Voicing score), if the side information code length is 9 bits, then wherein 6 bits are the frame energy of quantification, the sound score value of 3 bits for quantizing.

Wherein, the frame energy calculation is represented with formula (1):

Frame_Energy = \frac{Σ_{i = 0}^{Frame_Length - 1} S^{2} (i)}{Frame_Length} - - - (1)

Frame_Length is a frame length, and S (i) is that (i is the sampled value sequence number in the frame for Quadrature Mirror Filter, low band signal QMF) through Quadrature Mirror Filter QMF.

The sound score value calculates with formula (2):

Voicing_score = \frac{Zero_Cros \sin g_Rate}{Scale_factor} - - - (2)

Wherein, in zero-crossing rate (Zero_Crossing_Rate) the expression 10ms, time domain waveform zero passage number of times.The reduction factor (Scale_Factor) is the reduction constant that is provided with in advance, and value is [0,1].

According to actual conditions, dynamically side information also can be set to other amount that can be used for handling as audio mixing basis for estimation, for example, can be set to quiet activity detection (VAD).

After the audio code stream of terminal output sends to MCU, at first import selected cell 202.Selected cell 202 at first identifies the audio mixing flag from the stream of audio codes of receiving; Value according to the audio mixing flag; Determine whether that need carry out audio mixing to this road stream of audio codes handles; If do not need audio mixing to handle, then selected cell 202 exports this road stream of audio codes to the corresponding target terminal.For the stream of audio codes that all M ' (M ' smaller or equal to M) road needs audio mixing to handle, selected cell 202 is selected N (N is smaller or equal to M ') road stream of audio codes according to wherein dynamic side information; These stream of audio codes are sent to corresponding demoder respectively; After decoding, re-send to audio mixing unit 203 and carry out audio mixing and handle, obtain the audio code stream behind the audio mixing of M ' road; Again with this M ' road audio code stream respectively with after the encoder encodes, be sent to relevant terminal.

The cataloged procedure of the end side of the embodiment of the invention is as shown in Figure 4, comprises the steps:

Step 401: the acoustic information collected is provided with the audio mixing flag according to the audio mixing strategy of this locality, then said acoustic information is encoded, the result of coding is as the core encoder data;

Step 402: if the audio mixing flag is set is the needs audio mixing, then calculate dynamic side information, can calculate frame energy harmony cent value as dynamic side information according to aforementioned formula (1) and formula (2).

Step 403: generate and the output audio encoding code stream.Said generation stream of audio codes specifically comprises: if set audio mixing flag then generates the coded audio data frame that comprises said audio mixing flag, core encoder data and dynamic side information for effectively; If set audio mixing flag is invalid, then generate the coded audio data frame that comprises said audio mixing flag and core encoder data.Said audio mixing flag be arranged on Frame before, preferably, length is 1 bit.

The audio mixing cataloged procedure of the MCU side of the embodiment of the invention is as shown in Figure 5, comprises the steps:

Step 501:MCU receives the stream of audio codes of self terminal, judges whether that according to the value of wherein audio mixing flag needs carry out audio mixing to this stream of audio codes and handle, if then execution in step 502, otherwise, execution in step 503.

Step 502: this road stream of audio codes is directly sent to corresponding purpose terminal, and finish processing to this road stream of audio codes.

Step 503: the stream of audio codes of receiving for synchronization from the individual terminal of M '; And the audio mixing flag in these stream of audio codes is need carry out the audio mixing processing; MCU is according to the dynamic side information in these code streams; Therefrom select N road stream of audio codes, and abandon remaining M '-N road stream of audio codes.Wherein N is smaller or equal to M '.

Can be according to the value of energy in the side information, if greater than some threshold value T, audio mixing then is less than then not carrying out audio mixing.

504: the core encoder data to selected N road stream of audio codes are decoded respectively, decoded core encoder data are carried out audio mixing handle, and obtain the audio code stream behind the audio mixing of M ' road.

Step 505: the audio code stream behind the audio mixing of said M ' road is encoded respectively, the stream of audio codes behind coding of the M ' road behind the coding and the audio mixing is sent to the individual purpose of M ' terminal respectively.

Fig. 6 is a kind of multimedia conferencing terminal that inventive embodiments proposes, and comprising:

Sound collecting module 601 is used to collect acoustic information;

Audio mixing policy module 602 is used for according to the audio mixing strategy that is provided with in advance said sound collecting module 601 collected acoustic informations being provided with the audio mixing flag;

Core encoder module 603 is used for said acoustic information is encoded, output core encoder data; If audio mixing policy module 602 audio mixing flag do not need to be set to audio mixing, when then core encoder module 603 is encoded, need not to consider the Bit Allocation in Discrete of dynamic side information; If this audio mixing flag need to be set to audio mixing, when then core encoder module 603 is encoded, need to consider the Bit Allocation in Discrete of dynamic side information.For example; If total bit number of coded frame data is the n bit, the audio mixing flag is the t bit, and dynamically side information is the m bit; Then for the situation that need not consider the Bit Allocation in Discrete of dynamic side information, the core encoder data length that core encoder module 603 codings obtain is the n-t bit; Consider the situation of the Bit Allocation in Discrete of dynamic side information for needs, the core encoder data length that core encoder module 603 codings obtain is the n-m-t bit.

Become frame module 604; Be used for calculating dynamic side information according to the audio mixing flag that said audio mixing policy module 603 is provided with; And according to the value of said audio mixing flag; Generation comprises the audio data frame of said core encoder data, audio mixing flag and dynamic side information, perhaps generates the audio data frame that comprises said core encoder data and audio mixing flag;

Output module 605 is used for the audio data frame that said one-tenth frame module 604 generates is externally exported as stream of audio codes.

Fig. 7 is a kind of multipoint control unit that the embodiment of the invention proposes, and comprising:

Selected cell 701; Be used for receiving stream of audio codes from M terminal; Value according to the audio mixing flag of said stream of audio codes judges whether that needs carry out audio mixing to this stream of audio codes and handle; Needs are carried out M ' the road stream of audio codes that audio mixing is handled, select N road stream of audio codes according to dynamic side information wherein;

Audio mixing unit 702 is used for that the core encoder data of the selected N of said selected cell road stream of audio codes are carried out audio mixing and handles, and obtains the audio code stream behind the audio mixing of M ' road;

Transmitting element 703 is used for the audio code stream from said audio mixing unit is sent to the corresponding target terminal.

The stream of audio codes that said selected cell 701 will not need audio mixing to handle sends to said transmitting element 703; Then said transmitting element 703 will send to the corresponding target terminal from the stream of audio codes of said selected cell.

Said multipoint control unit further comprises: demoder 704 is used for the core encoder data of said selected cell 701 selected stream of audio codes are decoded, and decoded core encoder data is sent to said audio mixing unit 702;

Scrambler 705 be used for encoding from the audio code stream behind the audio mixing of said audio mixing unit 702, and the stream of audio codes after will encoding sends to said transmitting element 703.

Embodiment of the invention scheme is carried out the demarcation of audio mixing flag and is increased the corresponding dynamic side information in encoding code stream, according to audio mixing flag and dynamic assignment side information Bit Allocation in Discrete.MCU according to the audio mixing flag and dynamically side information select the stream of audio codes of needs audio mixing to carry out audio mixing to handle, can introduce the problem of error in the time of can solving that signal overflows and large-signal carried out audio mixing, and reduce the computation complexity of MCU; When not carrying out audio mixing, can make full use of the code stream Bit Allocation in Discrete, improve the core encoder quality.The present invention program both can be used for mixer system, can use the codec of coding/decoding system commonly used again, and the Based Intelligent Control of favourable realization encoding code stream strengthens MCU unit interactivity.

The above is merely preferred embodiment of the present invention, not in order to restriction the present invention, all any modifications of within spirit of the present invention and principle, being done, is equal to and replaces and improvement etc., all should be included within protection scope of the present invention.

Claims

1. an audio mixing coding method is characterized in that, comprises the steps:

2. method according to claim 1 is characterized in that, said dynamic side information comprises frame energy, sound score value and/or quiet activity detection.

3. method according to claim 2 is characterized in that, the dynamic side information of said calculating comprises: according to formula

Frame_Energy = \frac{Σ_{i = 0}^{Frame_Length - 1} S^{2} (i)}{Frame_Length}

Calculate the frame energy, wherein, Frame_Energy representes the frame energy, and S (i) is the low band signal through Quadrature Mirror Filter QMF, and i is the sampled value sequence number in the frame.

4. method according to claim 2 is characterized in that, the dynamic side information of said calculating comprises: according to formula

Voicing_Score = \frac{Zero_Cros Sin g_Rate}{Scale_Factor}

Calculate the sound score value, wherein Zero_Crossing_Rate represented in the schedule time, the time domain waveform zero passage number of times of said acoustic information; Scale_Factor is the reduction constant that is provided with in advance, and value is [0,1].

5. method according to claim 1; It is characterized in that; The information of said basis audio mixing flag wherein judges whether that needs carry out audio mixing to this stream of audio codes and handle; Its judged result is handled for not carrying out audio mixing to this stream of audio codes, then further comprises: export said stream of audio codes to the purpose terminal.

6. according to each described method of claim 1 to 5; It is characterized in that; Said core encoder data to selected N road stream of audio codes are carried out audio mixing and are handled; And the audio code stream behind the output audio mixing comprises: the core encoder data in the audio code stream of selected N road are decoded respectively, decoded N road core encoder data are carried out audio mixing handle, and obtain the audio code stream behind the audio mixing of M ' road; Audio code stream behind the audio mixing of said M ' road is encoded respectively, the stream of audio codes behind coding of the M ' road behind the coding and the audio mixing is sent to the individual purpose of M ' terminal respectively.

7. an end side coding method is characterized in that, comprises the steps:

8. the audio mixing coding method of a network side is characterized in that, comprises the steps:

9. a multimedia conference system comprises M terminal and multipoint control unit, it is characterized in that,

10. a multimedia conferencing terminal is characterized in that, comprising:

The sound collecting module is used to collect acoustic information;

11. a multipoint control unit is characterized in that, comprising:

12. multipoint control unit according to claim 11 is characterized in that, the stream of audio codes that said selected cell will not need audio mixing to handle sends to said transmitting element; Then said transmitting element will send to the corresponding target terminal from the stream of audio codes of said selected cell.

13. according to claim 11 or 12 described multipoint control units; It is characterized in that; Said multipoint control unit further comprises: demoder; Be used for the core encoder data of the selected stream of audio codes of said selected cell are decoded, and decoded core encoder data are sent to said audio mixing unit;

Scrambler be used for encoding from the audio code stream behind the audio mixing of said audio mixing unit, and the stream of audio codes after will encoding sends to said transmitting element.