CN110677208B

CN110677208B - Sound mixing method and system for conference system

Info

Publication number: CN110677208B
Application number: CN201910860802.2A
Authority: CN
Inventors: 周建明; 康元勋; 冯万健
Original assignee: Xiamen Yealink Network Technology Co Ltd
Current assignee: Xiamen Yilian Communication Technology Co ltd
Priority date: 2019-09-11
Filing date: 2019-09-11
Publication date: 2021-06-25
Anticipated expiration: 2039-09-11
Also published as: CN110677208A

Abstract

The invention provides a sound mixing method and a sound mixing system for a conference system, which comprises the steps of responding to the detected voice signals of all members, and calculating the voice energy of the voice signals of the voice input end of each member; dividing each member into a first set and a second set based on the voice energy and a preset first threshold value, wherein the members with the voice energy larger than or equal to the first threshold value are collected into the first set, and the members with the voice energy smaller than the first threshold value are collected into the second set; in response to the difference value between the maximum value of the voice energy in the second set and the minimum value of the voice energy in the first set being greater than or equal to a preset second threshold value, replacing the member of the maximum value of the voice energy in the second set with the member of the minimum value of the voice energy in the first set, and updating the first set and the second set; and mixing and synthesizing the voice signals in the first set and outputting the voice signals. And outputting meaningful voice signals based on dynamic grouping guarantee, and avoiding influence on audio quality due to excessive noise.

Description

Sound mixing method and system for conference system

Technical Field

The invention relates to the field of computer technology application, in particular to a sound mixing method and system for a conference system.

Background

The MCU is called a multi control unit in English, and is a multipoint control unit. In order to implement a multipoint conference television system, an MCU must be provided. The MCU is essentially a multimedia information switch, which performs multi-point calling and connection, implements functions of video broadcasting, video selection, audio mixing, data broadcasting, etc., and completes the tandem and switching of signals of each terminal. The MCU differs from the existing switches in that the switch performs point-to-point connection of signals, while the MCU performs multipoint-to-multipoint switching, tandem or broadcast.

With the development of network communication technology, the research and application of multi-user voice systems have become one of the current hotspots, and multi-user voice systems play an important role in work and entertainment of people: for example, when a network conference is carried out, a plurality of persons are required to speak in voice. One of the most important technologies of the multi-user speech system is multi-level mixing, which mainly mixes audio signals from multiple sources, each of which occupies one channel.

For a common sound mixing algorithm, after the number of sound mixing paths is configured, all configured audio data are mixed, and under the condition of large background noise, when the number of sound mixing paths exceeds a certain number, the sound mixing effect is poor, and the speech content is difficult to hear clearly.

Disclosure of Invention

The invention provides a sound mixing method and a sound mixing system for a conference system.

In one aspect, the present invention provides a mixing method for a conference system, including the steps of:

s1: responding to the detected voice signals of all members, and calculating the voice energy of the voice signals of all member voice input ends;

s2: dividing all members into a first set and a second set based on the voice energy and a preset first threshold value, wherein the members with the voice energy larger than or equal to the first threshold value are collected into the first set, and the members with the voice energy smaller than the first threshold value are collected into the second set;

s3: in response to the difference value between the maximum value of the voice energy in the second set and the minimum value of the voice energy in the first set being larger than or equal to a preset second threshold value, exchanging the member of the maximum value of the voice energy in the second set with the member of the minimum value of the voice energy in the first set, and updating the first set and the second set;

s4: and synthesizing and outputting the updated voice signal mixed sound in the first set.

In a specific embodiment, the speech energy in step S1 is represented by an RMS value, and the RMS value of the current speech frame is specifically calculated by the following formula:

wherein x₁，x₂，…，x_LIndicating that the current speech frame includes L speech data. The speech energy is represented by the RMS value, and the speech energy value can be digitized to facilitate comparison of different speech signals.

In a particular embodiment, the RMS value of the current speech frame_curRMS value relative to historical speech frames_i-1Smoothly obtaining the final speech energy RMS of the current speech frame_iThe specific calculation formula is as follows: RMS_i＝αRMS_cur+(1-α)RMS_i-1Where α represents a smoothing factor and i is the sequence number of the current speech frame. The relatively accurate voice energy is obtained by utilizing the smoothness of the current voice frame and the historical frame, and the relatively accuracy of the grouping based on the voice signals is improved.

In a preferred embodiment, the first threshold value in step S2 is selected from the range of 50-80 dB. The voice signals are effectively divided into two sets by setting the first threshold value, so that final mixed sound output is facilitated.

In a specific embodiment, step S3 further includes:

in response to the difference between the maximum value of the speech energy in the second set and the minimum value of the speech energy in the first set being smaller than a second threshold value and the maximum value of the speech energy in the second set being smaller than the minimum value of the speech energy in the first set, keeping the members of the first set and the second set unchanged and updating the speech energy of each member of the first set and the second set;

if the continuous frame number meeting the condition is larger than a preset frame number threshold value, exchanging a member of the maximum value of the voice energy in the second set with a member of the minimum value of the voice energy in the first set, wherein the condition is that the difference value between the maximum value of the voice energy in the second set and the minimum value of the voice energy in the first set is smaller than the second threshold value, and the maximum value of the voice energy in the second set is larger than the minimum value of the voice energy in the first set; and if the continuous frame number meeting the condition is smaller than a preset frame number threshold value, keeping the members of the first set and the second set unchanged and updating the voice energy of each member in the first set and the second set.

In a preferred embodiment, the second threshold value is selected from the range of 3-6 dB. The second threshold value is set, so that the value in the set with larger difference value can be directly replaced, and the validity of the output voice signal is ensured.

In a preferred embodiment, the frame number threshold is selected from the range of 4-6 frames. By judging the voice energy value under the continuous voice frame number, the effective voice signal updating is ensured, and the interference of noise is avoided.

In a specific embodiment, step S4 specifically includes: for each voice output end in the updated first set, outputting the audio mixing signals of other voice signals in the updated first set except the own voice signal; and outputting the mixed sound signals of all the voice signals in the updated first set for each voice output end in the updated second set. And outputting corresponding mixed sound signals according to different sets, thereby ensuring the validity and accuracy of the output voice signals.

According to a second aspect of the present invention, a computer-readable storage medium is proposed, on which a computer program is stored, which computer program, when being executed by a computer processor, is adapted to carry out the above-mentioned method.

According to a third aspect of the present invention, there is provided a mixing system for a conference system, the system comprising:

an energy calculation module: configured to calculate a speech energy of the speech signal at the speech input of each member in response to detecting the speech signal of each member;

a grouping module: the method comprises the steps that all members are divided into a first set and a second set based on voice energy and a preset first threshold value, wherein the members with the voice energy larger than or equal to the first threshold value are collected into the first set, and the members with the voice energy smaller than the first threshold value are collected into the second set;

a dynamic update module: configured to update the first set and the second set by swapping members of the maximum value of speech energy in the second set with members of the minimum value of speech energy in the first set in response to the difference between the maximum value of speech energy in the second set and the minimum value of speech energy in the first set being greater than a preset second threshold value;

a sound mixing output module: and synthesizing and outputting the updated voice signal mixed sound in the first set.

The method comprises the steps of calculating the voice energy of the voice signals, dividing the voice signals into a first set and a second set according to a first threshold value, dynamically analyzing the maximum value of the voice energy in the second set and the minimum value of the voice energy in the first set, updating the first set and the second set, and mixing and synthesizing the voice signals in the first set with larger voice energy and outputting the voice signals to the corresponding voice signal set. The threshold value is utilized to set a reasonable voice signal set, and the dynamic voice signal set is updated, so that the effectiveness and the accuracy of the voice set are ensured, and then limited audio signals are selected for sound mixing, and the influence on the audio quality due to excessive noise can be avoided.

Drawings

The accompanying drawings are included to provide a further understanding of the embodiments and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments and together with the description serve to explain the principles of the invention. Other embodiments and many of the intended advantages of embodiments will be readily appreciated as they become better understood by reference to the following detailed description. Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

fig. 1 is a flowchart of a mixing method for a conference system according to an embodiment of the present application;

FIG. 2 is a block diagram of the initial stages of a particular embodiment of the present application;

FIG. 3 is a diagram illustrating dynamic grouping during a call in accordance with an exemplary embodiment of the present application;

FIG. 4 is a schematic diagram of mixed-sound speech output according to an embodiment of the present application;

fig. 5 is a block diagram of a mixing system for a conference system according to an embodiment of the present application;

FIG. 6 is a schematic illustration of conference system speech output for a particular embodiment of the present application;

FIG. 7 is a block diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 illustrates a flowchart of a mixing method for a conference system according to an embodiment of the present application. As shown in fig. 1, the method comprises the steps of:

s101: in response to detecting the voice signal of each member, the voice energy of the voice signal at the voice input of each member is calculated. The voice energy value of each frame signal is calculated by carrying out frame processing on the voice signal of the voice input end, and the voice energy value is represented by a root mean square value, so that the energy value of the voice signal can be digitized, and the comparison of the voice energy of the voice signal is facilitated.

In a specific embodiment, the speech energy is represented by an RMS (root Mean square) value, and the RMS value of the current speech frame is specifically calculated by the following formula:

wherein x₁，x₂，…，x_LIndicating that the current speech frame includes L speech data. The RMS value of the current speech frame and the historical frame are smoothed to obtain the final speech energy, and the specific calculation formula is as follows: RMS_i＝αRMS_cur+(1α)RMS_i-1Where α represents a smoothing factor and i is the sequence number of the current speech frame. The relatively accurate voice energy is obtained by utilizing the smoothness of the current voice frame and the historical frame, and the relatively accuracy of the grouping based on the voice signals is improved.

S102: and dividing the members into a first set and a second set based on the voice energy and a preset first threshold value, wherein the members with the voice energy larger than or equal to the first threshold value are collected into the first set, and the members with the voice energy smaller than the first threshold value are collected into the second set.

In a preferred embodiment, the first threshold value is selected from the range of 50-80 dB. The voice signals are effectively divided into two sets by setting the first threshold value, so that final mixed sound output is facilitated. It should be appreciated that the first threshold may be selected in a range other than 50-80dB, specifically, according to an application scenario of an actual conference system, and a meaningful speech signal set is selected for performing mixing processing.

In a specific embodiment, as shown in fig. 2, a grouping diagram of a start stage of the specific embodiment, a grouping method is as follows: step 201: calculating the voice energy of each member; the speech energy of each member is calculated by the RMS calculation as above. Step 202: sorting all members according to the voice energy; and sequencing the members based on the calculated voice energy of the members in the step 201 to obtain the ranking of the voice energy of the members. Step 203: the X path with larger energy is amplified to be a set A, and the Y path with smaller energy is amplified to be a set B. Based on a preset first threshold value, dividing voice energy into a set A and a set B, wherein X paths with voice energy larger than or equal to the first threshold value are collected into the set A, and Y paths with voice energy smaller than the first threshold value are collected into the set B.

S103: and in response to the difference value between the maximum value of the voice energy in the second set and the minimum value of the voice energy in the first set being greater than or equal to a preset second threshold value, replacing the member of the maximum value of the voice energy in the second set with the member of the minimum value of the voice energy in the first set, and updating the first set and the second set. The difference value between the maximum value of the voice energy in the second set and the minimum value of the voice energy in the first set is large enough to indicate that the maximum voice output signal in the second set is far larger than the minimum output signal in the first set, and the maximum voice output signal and the minimum output signal are replaced by the updating set to enable the output voice to be more effective.

In a specific embodiment, when the difference between the maximum value of the speech energy in the second set and the minimum value of the speech energy in the first set is smaller than the second threshold value, and the maximum value of the speech energy in the second set is smaller than the minimum value of the speech energy in the first set, it indicates that the maximum speech output signal in the second set is smaller than the minimum output signal in the first set, and the members of the first set and the second set are maintained and the speech energy values in the first set and the second set are updated to ensure the validity of the output speech signal.

In a specific embodiment, based on the conditions: if the conditions are met and the continuous frame number is greater than a preset frame number threshold value, the maximum voice output signal in the second set is greater than the minimum output signal in the first set under the continuous frame number, the member of the maximum voice energy in the second set is replaced with the member of the minimum voice energy in the first set, and the voice signals in the sets are updated; if the conditions are met and the continuous frame number is smaller than the preset frame number threshold value, which indicates that noise possibly exists in the continuous frame number to cause sudden increase of voice energy, members of the first set and the second set are reserved and the voice energy values in the first set and the second set are updated, so that the effect of audio mixing of the output voice signals is prevented from being influenced by sudden change of voice energy caused by the noise.

In a preferred embodiment, the second threshold value is selected from the range of 3-6 dB; the second threshold value is set, so that the value in the set with larger difference value can be directly replaced, and the validity of the output voice signal is ensured. It should be appreciated that the second threshold may be selected from values other than 3-6dB, and the grouping of the voice energy may be performed more specifically according to the application scenario of the actual conference system.

In a preferred embodiment, the frame number threshold is selected from the range of 4-6 frames. By judging the voice energy value under the continuous voice frame number, the effective voice signal updating is ensured, and the interference of noise is avoided. It should be appreciated that the frame number threshold may be selected in a range of other values besides 4-6 frames, specifically, the threshold is selected according to an application scenario of an actual conference system, so as to eliminate an influence on a packet caused by a sudden increase of speech energy due to occurrence of noise, and improve accuracy of a dynamic packet.

S104: and mixing and synthesizing the voice signals in the first set and outputting the voice signals. The first set with larger voice energy is mixed, so that the mixing effect can be improved.

In a specific embodiment, for each voice output end in the first set, outputting a mixed sound signal of other voice signals in the first set except for the own voice signal; and outputting the mixed sound signals of all the voice signals in the first set for each voice output end in the second set.

In a specific embodiment, fig. 4 shows a schematic diagram of mixed-sound speech output of the specific embodiment. The set A comprises members A1, A2, … and Ax, the output voice signals are mixed synthesis of A2, A3, … and Ax for the voice output of the member A1, the output voice signals are mixed synthesis of A1, A3, … and Ax for the voice output of the member A2, namely, the voice output signals of the members in the set A are mixed synthesis of the voice signals of other members in the set A except the voice signal of the member, and the output voice signals are mixed synthesis of the voice signals of all the members in the set A, namely, mixed synthesis of the voice signals of A1, A2, … and Ax for the members in the set B.

Fig. 3 is a diagram illustrating dynamic grouping during a call according to an embodiment of the present invention. The method specifically comprises the following steps:

step 301: calculating the voice energy of each member; and the RMS value is used for representing the voice energy of each member, and the RMS value of each member is obtained through calculation.

Step 302: the members in the two sets are respectively sorted according to energy; determining a grouping rule through the voice energy value of each member and a preset threshold value, dividing the members into two sets, and sequencing the members in the sets according to the voice energy.

Step 303: the energy of the minimum member Amin in the set A is EAmin, and the energy of the maximum member Bmax in the set B is EBmax;

step 304: judging that EBmax-EAmin is larger than ET; where ET represents the second threshold value, typically set to 3-6 dB.

If EBmax-EAmin > ET, go to step 305: amin exchanges reset with Bmax, Amin _ last, Bmax _ last is an initial value, and the packet ends 312.

If EBmax-EAmin < ET, go to step 306: judging that EBmax is more than EAmin, and when EBmax is more than EAmin, entering step 307: judging Amin _ last, Bmax _ last is an initial value, or Amin is Amin _ last, Bmax is Bmax _ last, if yes, continuing to judge the next frame number T is T +1, namely step 308, until step 309 is satisfied: the continuous frame number T > Th (where Th is a preset continuous frame number threshold), step 305 is entered: amin exchanges reset with Bmax, Amin _ last, Bmax _ last is an initial value, and the packet ends 312. If step 309 is not satisfied: t > Th, then Amin _ last, Bmax _ last, packet end 312 is retained.

If step 306: EBmax > EAmin is no, or step 307: if Amin _ last and Bmax _ last are determined to be the initial values, or Amin _ last and Bmax _ last are determined to be negative, the process proceeds to step 310: when the consecutive frame number T is 0, step 311: amin _ last, Bmax _ last, packet end 312 are updated.

Fig. 5 is a block diagram of a mixing system for a conference system according to a specific embodiment of the present invention, which includes an energy calculation module 501, a grouping module 502, a dynamic update module 503, and a mixing output module 504 connected in sequence. Wherein the energy calculating module 501 is configured to calculate the voice energy of the voice signal at the voice input terminal of each member in response to detecting the voice signal of each member; the grouping module 502 is configured to divide the voice energy into a first set and a second set based on the voice energy and a preset first threshold, wherein members with voice energy greater than or equal to the first threshold are grouped into the first set, and members with voice energy less than the first threshold are grouped into the second set; the dynamic update module 503 is configured to, in response to a difference between the maximum value of the speech energy in the second set and the minimum value of the speech energy in the first set being greater than a preset second threshold, replace a member of the maximum value of the speech energy in the second set with a member of the minimum value of the speech energy in the first set, and update the first set and the second set; the mixing output module 504 is configured to mix and synthesize the voice signals in the first set and output the voice signals.

Fig. 6 is a schematic diagram of a conference system voice output according to an embodiment of the present application, where members 1 to N obtain voice energy of each member through an energy calculation module, perform grouping and sequencing, then ensure that an audio signal set with large energy is mixed through a dynamic grouping module, and finally select a mixing module to output members 1 to N respectively in a targeted manner, so as to ensure that meaningful audio signals are mixed.

In a specific embodiment, for audio signals transmitted from different conference terminals, selection processing needs to be performed according to the content of the audio signals, the content of the received audio signals can be divided into voice and noise, and for an optimal situation of a mixing algorithm, a meaningful voice signal is mixed without mixing noise, but in practical application, noise hardly occurs. The following is an analysis of different scenarios.

In a specific embodiment, the strategies for mixing different conference scenes are as follows:

scene 1: no one speaks in the conference, and the received signals are all background noise. When no one speaks, VAD judges that the voice is noise and keeps unchanged;

scene 2: only one person is speaking at the same time in the conference, and then possibly a plurality of persons speak in turn. Only the speaker is subjected to sound mixing, so that other interference sound sources are prevented from being mixed;

scene 3: a speaks in the conference, then B talks in an intervening mode, the A and B speak simultaneously, and only B speaks after a period of time. Mixing sound aiming at both a speaker and a interlocutor to ensure the effectiveness and the accuracy of a voice set;

scene 4: a speaks in the conference, B does not speak after the conversation is interrupted, B does not speak after a period of time, and A continues speaking. Mixing sound aiming at both a speaker and a interlocutor to ensure the effectiveness and the accuracy of a voice set;

scene 5: in the conference, a plurality of people speak at the same time, and only a few people speak after a period of time. The method has the advantages that the limited audio signals are selected for sound mixing aiming at the largest speaker energy, so that the influence on the audio quality due to excessive noise is avoided;

scene 6: several people in the conference take turns to speak briefly, while others do not. Only the speaker is selected for sound mixing, and other members are not selected, so that the influence on the audio quality due to excessive noise is avoided.

Referring now to FIG. 7, shown is a block diagram of a computer system 700 suitable for use in implementing the electronic device of an embodiment of the present application. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU)701, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the system 700 are also stored. The CPU 701, the ROM 702, and the RAM 703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Liquid Crystal Display (LCD) and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program, when executed by a Central Processing Unit (CPU)701, performs the above-described functions defined in the method of the present application. It should be noted that the computer readable storage medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, a computer readable signal medium may include a propagated data signal or a voice signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable storage medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an energy calculation module, a grouping module, a dynamic update module, and a mix output module. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.

As another aspect, the present application also provides a computer-readable storage medium, which may be included in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable storage medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: in response to detecting the voice signals, calculating voice energy of the voice signals at each voice input end; dividing the voice energy into a first set and a second set based on the voice energy and a preset first threshold value, wherein the voice energy with the voice energy being more than or equal to the first threshold value is collected into the first set, and the voice energy with the voice energy being less than the first threshold value is collected into the second set; in response to the difference value between the maximum value of the voice energy in the second set and the minimum value of the voice energy in the first set being greater than or equal to a preset second threshold value, replacing the voice signal with the maximum value of the voice energy in the second set and the voice signal with the minimum value of the voice energy in the first set, and updating the first set and the second set; and mixing and synthesizing the voice signals in the first set and outputting the voice signals.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A mixing method for a conference system, comprising the steps of:

s1: in response to detecting the voice signal of each member, calculating the voice energy of the voice signal of each member voice input end;

s2: dividing all members into a first set and a second set based on the voice energy and a preset first threshold value, wherein the members with the voice energy being more than or equal to the first threshold value are collected into the first set, and the members with the voice energy being less than the first threshold value are collected into the second set;

s3: in response to the dynamically analyzed difference value between the maximum value of the voice energy in the second set and the minimum value of the voice energy in the first set being greater than or equal to a preset second threshold value, exchanging members of the maximum value of the voice energy in the second set with members of the minimum value of the voice energy in the first set, so as to update the first set and the second set, wherein the difference value between the maximum value of the voice energy in the second set and the minimum value of the voice energy in the first set being greater than or equal to the preset second threshold value means that the maximum value of the voice energy in the second set is far greater than the minimum value of the voice energy in the first set;

s4: synthesizing and outputting the updated voice signal mixed sound in the first set;

wherein the speech energy is RMS value of current speech frame_curRMS value relative to historical speech frames_i-1Smoothing the final speech energy RMS of the obtained current speech frame_iThe specific calculation formula is as follows: RMS_i＝αRMS_cur+(1-α)RMS_i-1Where α represents a smoothing factor and i is the sequence number of the current speech frame.

2. Mixing method for conference system according to claim 1, characterized in that RMS value RMS of current speech frame_curThe specific calculation formula is as follows:

wherein x₁，x₂，...，x_LIndicating that the current speech frame includes L speech data.

3. The mixing method for a conference system as claimed in claim 1, wherein the first threshold value in the step S2 is selected from a range of 50-80 dB.

4. The mixing method for a conference system according to claim 1, wherein the step S3 further comprises:

in response to the dynamically analyzed maximum value of speech energy in the second set differing from the minimum value of speech energy in the first set by less than the second threshold value and the maximum value of speech energy in the second set being less than the minimum value of speech energy in the first set, keeping members of the first set and the second set unchanged;

if the continuous frame number meeting the condition is larger than a preset frame number threshold value, exchanging the member of the maximum value of the voice energy in the second set with the member of the minimum value of the voice energy in the first set, wherein the condition is that the difference value between the maximum value of the voice energy in the second set and the minimum value of the voice energy in the first set which are dynamically analyzed is smaller than the second threshold value, and the maximum value of the voice energy in the second set is larger than the minimum value of the voice energy in the first set;

and if the continuous frame number meeting the condition is smaller than a preset frame number threshold value, keeping the members of the first set and the second set unchanged.

5. The mixing method for a conference system according to claim 4, wherein the second threshold value is selected from a range of 3-6 dB.

6. The mixing method for a conference system according to claim 4, wherein the frame number threshold value is selected from a range of 4-6 frames.

7. The mixing method for the conference system according to claim 1, wherein the step S4 specifically includes: for each voice output end in the updated first set, outputting a mixed voice signal of other voice signals in the updated first set except the own voice signal; and for each voice output end in the updated second set, outputting the mixed voice signals of all the voice signals in the updated first set.

8. A computer-readable storage medium having one or more computer programs stored thereon, which when executed by a computer processor perform the method of any one of claims 1 to 7.

9. A mixing system for a conferencing system, the system comprising:

an energy calculation module: the voice energy calculating device is configured to respond to the detected voice signals of all members and calculate the voice energy of the voice signals of all member voice input ends;

a grouping module: the method comprises the steps that all members are divided into a first set and a second set based on the voice energy and a preset first threshold value, wherein the members with the voice energy larger than or equal to the first threshold value are collected into the first set, and the members with the voice energy smaller than the first threshold value are collected into the second set;

a dynamic update module: configuring, in response to a difference between a maximum value of speech energy in the second set and a minimum value of speech energy in the first set being greater than a preset second threshold value, dynamically analyzing the speech energy, and exchanging members of the maximum value of speech energy in the second set with members of the minimum value of speech energy in the first set, thereby updating the first set and the second set, wherein a difference between the maximum value of speech energy in the second set and the minimum value of speech energy in the first set being greater than or equal to the preset second threshold value indicates that the maximum value of speech energy in the second set is much greater than the minimum value of speech energy in the first set;

a sound mixing output module: the updated first set of speech signal mixes are configured to be synthesized and output;

wherein the voice energy isRMS value RMS of a previous speech frame_curRMS value relative to historical speech frames_i-1Smoothing the final speech energy RMS of the obtained current speech frame_iThe specific calculation formula is as follows: RMS_i＝αRMS_cur+(1-α)RMS_i-1Where α represents a smoothing factor and i is the sequence number of the current speech frame.