CN104167210A

CN104167210A - Lightweight class multi-side conference sound mixing method and device

Info

Publication number: CN104167210A
Application number: CN201410414450.5A
Authority: CN
Inventors: 王田; 蔡奕侨; 钟必能; 陈永红; 田晖; 张国亮
Original assignee: Huaqiao University
Current assignee: Huaqiao University
Priority date: 2014-08-21
Filing date: 2014-08-21
Publication date: 2014-11-26

Abstract

Provided is a lightweight class multi-side conference sound mixing method and device. The method comprises the steps that (1) after a client side uses an AMR encoder for encoding voice, voice PCM data and data length are obtained, the encoded PCM data are subjected to framing processing, each frame voice energy value is computed, the fact that a frame is a voice frame or a non-voice frame is determined according to the frame voice energy value and the data length, and accordingly the probability values of the voice frames in the voice PCM data are obtained in a statistics mode; and (2) a server side selects current voice streams of two speakers with the highest voice probability values according to the received voice probability values, whether the superposition principle is used for carrying out sound mixing on the at most two selected voice streams is determined according to the two voice probability values, and finally a voice packet obtained after sound mixing is transferred. According to the method, the shortcoming that portable equipment such as a mobile phone is weak in computing capacity is ingeniously overcome, meanwhile, the computing amount of a server for sound mixing operation is greatly lowered, and the lightweight class multi-side conference sound mixing method and device can be widely used in a multimedia multi-side conference system.

Description

A kind of Multi-Party Conference sound mixing method and device of lightweight

Technical field

The present invention relates to Multi-Party Conference sound mixing method and the device of Multi-Party Conference technical field of communication, particularly a kind of lightweight.

Background technology

In multipart video-meeting system, audio mixing is an important technology.Audio mixing is that the audio frequency of multiple audio-source is mixed into a road audio frequency output according to audio frequency superposition principle, makes the recipient of audio frequency feel the effect that multi-person conference exchanges.

It is server end that audio mixing can be realized in media controller, and also can realize in terminal is client.

Directly realize at server end, be that client is passed through encoder encodes voice data PCM voice signal separately, then send to server end, server is first by the audio decoder of multiple audio-source, then be mixed into the road audio frequency output of encoding again according to audio frequency superposition principle, make the recipient of audio frequency feel the effect that multi-person conference exchanges.But because server end needs multipath decoding, finally encode again, therefore calculated amount and time complexity are all larger, cause time delay also larger simultaneously.This has also just limited the range of application of this scheme.

Directly realize audio mixing in terminal, be that client is passed through encoder encodes voice data PCM voice signal, send to server end, server end is the audio frequency of client by each terminal, send to all terminals except source, each terminal is synthesized all audio streams that receive.The calculating pressure of audio mixing is in each terminal, and this scheme can cause larger pressure to network.One calculated amount of carrying out terminal increases, and this,, for the weak mobile terminal of some computing powers, cannot bear the pressure that audio mixing calculates.The voice packet of two each terminals will be transmitted to the terminal except source, takies network bandwidth resources.

Also have some schemes, do not need Code And Decode, terminal is directly issued server end voice packet, and then server end carries out audio mixing.Because terminal is not encoded and just directly given out a contract for a project voice packet, seriously take the network bandwidth.

Summary of the invention

Fundamental purpose of the present invention is the practical application request for Multi-Party Conference, takes into account the personal characteristics of the portable skinny devices such as mobile phone simultaneously, proposes a kind of novelty and simple Multi-Party Conference sound mixing method and the device of real-time lightweight fast.

The present invention adopts following technical scheme:

A kind of Multi-Party Conference sound mixing method of lightweight, it is characterized in that: 1) customer end adopted AMR scrambler obtains voice PCM data and data length after voice are encoded, to a point frame processing for the voice PCM data acquisition after coding, calculate every frame speech energy value, and determine that in conjunction with this frame speech energy value and data length thereof this frame is speech frame or non-speech frame, thereby count the probable value of speech frame in voice PCM data; 2) server end is selected two spokesmans' that current speech probability value is the highest voice flow by the speech probability value receiving, and determine whether use superposition principle that maximum two-way voice flows of selecting are carried out to audio mixing, finally to forward the voice packet after audio mixing according to these two speech probability value sizes.

Preferably, preset: client grabs a frame voice signal at set intervals, every frame voice signal comprises m sampled value, and the energy of each sampled value is r _i; Set statistical window and comprise continuous n frame voice signal, the energy relative reference value of present frame is E _refer; Step 1) specifically comprise as follows:

1.1) the output length after client input voice PCM data and AMR coding, the energy value of calculating present frame voice PCM data

1.2) judge whether the present frame output length after AMR coding equals 31, if so, records the energy value of this frame, as speech energy reference value, judges this frame as speech frame and adds in statistical window, enters step 1.4); If not, record the energy value of this frame, as non-voice energy reference value, enter step 1.3);

1.3) judge whether present frame energy value is greater than its energy relative reference value E _refer, if so, judge that this frame is as speech frame, if not, judge that this frame is as non-speech frame; Add in new statistical window, enter step 1.4)

1.4) judge that whether statistical window is full, the if so, accounting of speech frame in counting statistics window, is expressed as 0 to 100 speech probability value; If not, enter next frame, skip to step 1.1);

Preferably, the maximal value of the non-voice energy reference value of front n successive frame of setting present frame is E _noise, and the maximal value of speech energy reference value is expressed as E _voise, the energy relative reference value E of present frame _refercalculate with following formula:

E _refer＝E _noise+(E _voice-E _noise)/10。

Preferably, step 2) specific as follows:

2.1) server receives the speech probability value that client sends over, and selects two voice flow F1, F2 that speech probability value is the highest, and its speech probability value is respectively P1, P2, P1>P2;

2.2) judge whether P1>2P2 sets up, if so, only by P ₁corresponding voice flow output; If not, these two voice flows are carried out exporting after audio mixing.

A Multi-Party Conference device sound mixing for lightweight, comprises client and server, it is characterized in that:

Client comprises: obtain the AMR scrambler of voice PCM data and data length for voice are encoded, for the speech energy calculation element of every frame speech energy value of the voice PCM data after calculation code, determine in conjunction with speech energy value and data length thereof the decision maker that this frame is speech frame or non-speech frame, and count the statistic device of the probable value of speech frame in the statistical window of voice PCM data;

Server comprises: for receiving speech probability value and selecting the reception selecting arrangement of two spokesmans' that current speech probability value is the highest voice flow, determine whether use superposition principle maximum two-way voice flows of selecting to be carried out to the device sound mixing of audio mixing according to these two speech probability value sizes, and forward the dispensing device of voice packet.

From the above-mentioned description of this invention, compared with prior art, the present invention has following beneficial effect:

1, adopt the method for probability analysis, client is analyzed voice flow, and the speech probability value that server end utilization receives is carried out decision-making, makes full use of the resource of server end and client, allow it jointly share calculating pressure, algorithm is simple, easily realize, extensibility is good;

2, the calculating pressure of server end and client is little, and the reaction time is fast.Aspect client, only need to carry out AMR coding, and calculate the energy value of each Frame, and judge that every frame data are voice, quiet or noise, aspect server end, speech probability value that only need to more each client, does not need to carry out audio mixing encoding operation most of time, at most only need to carry out audio mixing to 2 road voice.

3, applied range, can adapt to the application of the lightweight such as PDA, mobile phone equipment.

Brief description of the drawings

Fig. 1 is client workflow diagram of the present invention;

Fig. 2 is server workflow diagram of the present invention.

Embodiment

Below by embodiment, the invention will be further described.

For the practical application request of Multi-Party Conference, take into account the personal characteristics of the portable skinny devices such as mobile phone simultaneously, a kind of novelty is proposed and the simple Multi-Party Conference sound mixing method of real-time lightweight fast.The basic thought of this scheme is, according to the feature of conference speech, in most cases, one-man is in speech, and maximum two people make a speech simultaneously, and other are all audiences.Therefore, the present invention carries out audio mixing at the server end the highest two-way voice of probability of selecting at most to make a speech, and the voice after audio mixing are sent to client, thereby client does not need to do audio mixing, and the audio mixing calculated amount of server end is also little simultaneously.

A Multi-Party Conference sound mixing method for lightweight, presets: client grabs a frame voice signal every 20ms, and every frame voice signal comprises 160 sampled values, and the energy of each sampled value is r _i; Set statistical window and comprise 20 continuous frame voice signals, the energy relative reference value of present frame is E _refer.The maximal value of setting the non-voice energy reference value of front 20 successive frames of present frame is E _noise, and the maximal value of speech energy reference value is expressed as E _voise, the energy relative reference value E of present frame _refercalculate with following formula:

E _refer＝E _noise+(E _voice-E _noise)/10。

Wherein, if being session, present frame starts certain 1 frame in rear first 20 frames, for example the 2nd frame, and 1 frame energy value before using is as energy relative reference value, if the 3rd frame is brought in formula and calculated with regard to the respective value of 1,2 frames with above, by that analogy.

Comprise the steps:

1) customer end adopted AMR scrambler obtains voice PCM data and data length after voice are encoded, to a point frame processing for the voice PCM data acquisition after coding, calculate every frame speech energy value, and determine that in conjunction with this frame speech energy value and data length thereof this frame is speech frame or non-speech frame, thereby count the probable value of speech frame in voice PCM data.AMR scrambler to voice PCM data encoding after, obtain coding after data and data length, data length represents with nsize.According to the rule of AMR coding output, nsize only has three values, in the time that nsize is 1, is mute state; In the time that nsize is 6, it is noise state; In the time that nsize is 31, it is voice status.But this division methods is inaccurate, in the time that nsize is 31, is essentially voice status, but in the time that nsize is 6, is also likely but voice status.Therefore,, in the time that nsize is 6, need the energy value of the comprehensive PCM of analysis data.With reference to Fig. 1, flow process is as follows

1.3) judge whether this frame energy value is greater than energy relative reference value E _refer, if so, judge that this frame is as speech frame, if not, judge that this frame is as non-speech frame; Add in statistical window, enter step 1.4)

2) server end is selected two spokesmans' that current speech probability value is the highest voice flow by the speech probability value receiving, and determine whether use superposition principle that maximum two-way voice flows of selecting are carried out to audio mixing, finally to forward the voice packet after audio mixing according to these two speech probability value sizes.Concrete, with reference to Fig. 2, flow process is as follows:

The present invention also proposes a kind of Multi-Party Conference device sound mixing of lightweight, comprises client and server.

Client comprises: obtain the AMR scrambler of voice PCM data and data length for voice are encoded, for the speech energy calculation element of every frame speech energy value of the voice PCM data after calculation code, determine in conjunction with speech energy value and data length thereof the decision maker that this frame is speech frame or non-speech frame, and count the statistic device of the probable value of speech frame in the statistical window of voice PCM data.

This device is encoded in client, and in conjunction with two of speech energy value and AMR coded data sizes because usually distinguishing speech frame and non-speech frame, thereby count its speech probability value.Server end, is gone out current speaker (maximum two s') voice flow by the decision-making of speech probability value, and uses superposition principle that maximum two-way streams of selecting are carried out to audio mixing, finally forwards the voice packet after audio mixing.The method has made up the weak defect of the portable skinny device computing powers such as mobile phone dexterously, and the calculated amount that the while greatly reduces again server carries out audio mixing operation, can be widely used in multimedia multiparty conference system.

Above are only the specific embodiment of the present invention, but design concept of the present invention is not limited to this, allly utilizes this design to carry out the change of unsubstantiality to the present invention, all should belong to the behavior of invading protection domain of the present invention.

Claims

1. the Multi-Party Conference sound mixing method of a lightweight, it is characterized in that: 1) customer end adopted AMR scrambler obtains voice PCM data and data length after voice are encoded, to a point frame processing for the voice PCM data acquisition after coding, calculate every frame speech energy value, and determine that in conjunction with this frame speech energy value and data length thereof this frame is speech frame or non-speech frame, thereby count the probable value of speech frame in voice PCM data; 2) server end is selected two spokesmans' that current speech probability value is the highest voice flow by the speech probability value receiving, and determine whether use superposition principle that maximum two-way voice flows of selecting are carried out to audio mixing, finally to forward the voice packet after audio mixing according to these two speech probability value sizes.

2. the Multi-Party Conference sound mixing method of a kind of lightweight as claimed in claim 1, is characterized in that: preset: client grabs a frame voice signal at set intervals, and every frame voice signal comprises m sampled value, and the energy of each sampled value is r _i; Set statistical window and comprise continuous n frame voice signal, the energy relative reference value of present frame is E _refer; Step 1) specifically comprise as follows:

1.4) judge that whether statistical window is full, the if so, accounting of speech frame in counting statistics window, is expressed as 0 to 100 speech probability value; If not, enter next frame, skip to step 1.1).

3. the Multi-Party Conference sound mixing method of a kind of lightweight as claimed in claim 2, is characterized in that: the maximal value of setting the non-voice energy reference value of front n successive frame of present frame is E _noise, and the maximal value of speech energy reference value is expressed as E _voise, the energy relative reference value E of present frame _refercalculate with following formula:

E _refer＝E _noise+(E _voice-E _noise)/10。

4. the Multi-Party Conference sound mixing method of a kind of lightweight as claimed in claim 1, is characterized in that: step 2) specific as follows:

5. a Multi-Party Conference device sound mixing for lightweight, comprises client and server, it is characterized in that: