CN111753769A

CN111753769A - Terminal audio acquisition control method, electronic equipment and readable storage medium

Info

Publication number: CN111753769A
Application number: CN202010605095.5A
Authority: CN
Inventors: 陈强
Original assignee: Goertek Techology Co Ltd
Current assignee: Goertek Techology Co Ltd
Priority date: 2020-06-29
Filing date: 2020-06-29
Publication date: 2020-10-09

Abstract

The invention discloses a terminal audio acquisition control method, which comprises the steps of firstly judging whether a user speaks or not based on a face image of the user of a terminal, if so, sending a first current audio acquired by the terminal to a server, otherwise, not sending the first current audio acquired by the terminal to the server. The invention also discloses an electronic device and a readable storage medium, which have the same beneficial effects as the terminal audio acquisition control method.

Description

Terminal audio acquisition control method, electronic equipment and readable storage medium

Technical Field

The invention relates to the technical field of audio conferences, in particular to a terminal audio acquisition control method, electronic equipment and a readable storage medium.

Background

With the obvious rise of the demand of remote office of people, it is becoming more and more common for many people to carry out remote conferences through terminals such as mobile phones and sound boxes. However, when a plurality of terminals are used simultaneously in a small range, the microphones of the plurality of terminals are turned on simultaneously, and the microphones can record the sound played by the speakers of other terminals again, so that the echo suppression effect of the terminal is deteriorated or even disabled, and further, the problem of high noise or even howling of the terminal may occur. The prior art solves the problem that a person who does not speak mutes a microphone of a terminal manually, the noise or howling problem of the terminal can be reduced to a certain extent, but after the person mutes the microphone, the problem that the person who speaks cannot be heard by others due to the fact that the person forgets to mute the microphone often occurs, and the effect is poor.

Disclosure of Invention

The invention aims to provide a terminal audio acquisition control method, electronic equipment and a readable storage medium, which reduce the noise or squeal of a terminal, realize the automatic control of voice acquisition and avoid the situation that a user speaks but is not heard by others.

In order to solve the technical problem, the invention provides a terminal audio acquisition control method, which is applied to a terminal and comprises the following steps:

acquiring a face image of a user of the terminal;

judging whether the user speaks or not based on the face image;

if so, sending a first current audio acquired by the terminal to the server;

and if not, not sending the first current audio acquired by the terminal to the server.

Preferably, the determining whether the user speaks based on the face image includes:

identifying lips from the face image;

and judging whether the lips act for multiple times within a first preset time length, if so, judging that the user speaks, and otherwise, judging that the user does not speak.

Preferably, when acquiring the face image of the user of the terminal, the method further includes:

acquiring a second current audio acquired by the terminal, and storing the second current audio into a memory;

after determining that the user is speaking, the method further comprises:

acquiring a second current audio of a second preset time length before the current time of the user from the memory;

sending the first current audio collected by the terminal to the server, including:

and transmitting the second current audio and the first current audio which are acquired from the memory to the server.

Preferably, before acquiring the face image of the user of the terminal, the method further includes:

the server receives a gateway MAC sent by a conference system and an IP corresponding to the terminal accessed to the conference system one by one;

if the terminal with the same gateway MAC as other gateway MACs in the conference system exists, the terminal with the same gateway MAC is divided into a conference group;

and sending a dense mode instruction to the terminals in the conference group based on the IP so that the terminals enter a step of acquiring the face images of the users of the terminals after receiving the dense mode instruction.

Preferably, the method further comprises the following steps:

and if a terminal with a gateway MAC different from the gateway MACs of other terminals in the conference system exists, sending a loosening mode instruction to the terminal with the gateway MACs different from the other terminals in the conference system based on the IP, so that the terminal with the gateway MAC different from the gateway MACs of the other terminals in the conference system directly transmits the audio of the user to the server when acquiring the audio of the user after receiving the loosening mode instruction.

Preferably, after assigning the terminals with the same gateway MAC to a conference group, the method further includes:

the server sorts all the terminals in the conference group based on the IP;

controlling each terminal in the conference group to send a detection audio signal in sequence;

receiving detection audio signals received by the terminals and sent by other terminals in the conference group;

determining a distance between the respective terminals based on the strength of the probe audio signal;

judging whether a terminal with the minimum distance to other terminals in the conference group larger than a distance threshold exists;

if the terminal exists, rejecting the conference group from the terminals with the distances between the terminals and a plurality of other terminals in the conference group larger than a distance threshold;

and sending a loose mode instruction to the terminal with the removed conference group based on the IP, so that the terminal with the removed conference group directly transmits the audio of the user to the server when acquiring the audio of the user after receiving the loose mode instruction.

Preferably, the method further comprises the following steps:

and when receiving the audio sent by the terminals in the intensive mode in the conference group, the server determines a speaking terminal according to a preset speaking priority principle, controls the speaking terminal to send subsequent audio, and controls other terminals in the conference group not to send subsequent audio.

Preferably, the method further comprises the following steps:

and controlling a display device to perform corresponding display according to the state of the terminal.

In order to solve the above technical problem, the present invention further provides an electronic device, including:

a memory for storing a computer program;

and the processor is used for realizing the steps of the terminal audio acquisition control method when the computer program is executed.

In order to solve the technical problem, the present invention further provides a readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the steps of the terminal audio collection control method are implemented.

The invention provides a terminal audio acquisition control method, which comprises the steps of firstly judging whether a user speaks or not based on a face image of the user of a terminal, if so, sending a first current audio acquired by the terminal to a server, otherwise, not sending the first current audio acquired by the terminal to the server.

The invention also provides electronic equipment and a readable storage medium, which have the same beneficial effects as the terminal audio acquisition control method.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed in the prior art and the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a process flow diagram of a terminal audio acquisition control method according to the present invention;

FIG. 2 is a schematic diagram of a relationship between a plurality of terminals and a server according to the present invention;

fig. 3 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

The core of the invention is to provide a terminal audio acquisition control method, an electronic device and a readable storage medium, which reduce the noise or squeal of a terminal, realize the automatic control of voice acquisition and avoid the situation that a user speaks but is not heard by others.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a process flow chart of a terminal audio acquisition control method according to the present invention.

The method is applied to the terminal and comprises the following steps:

s11: acquiring a face image of a user of a terminal;

s12: judging whether the user speaks based on the face image, if so, entering S13, and otherwise, entering S14;

s13: sending a first current audio acquired by a terminal to a server;

s14: and not sending the first current audio collected by the terminal to the server.

In order to reduce noise and howling in the prior art, a user manually turns off a microphone of the user terminal when not speaking, but often forgets to turn on the microphone again when speaking afterwards, so that the user does not hear the speaking. In order to solve the technical problem, the application provides a terminal audio collection control method, wherein the audio of the user is sent to a server only when the user speaks, otherwise, the audio of the user is not sent, so that the audio collected by a microphone of the user terminal who does not speak (for example, the audio includes sound emitted by other terminal microphones or environmental noise) is not sent to the server.

It should be noted that the terminal audio acquisition control method provided by the application is particularly suitable for a scene in which a plurality of terminals work simultaneously in a small range.

Specifically, it is considered that the face image may uniquely correspond to the user of the terminal, and the face of the user may have some features different between when the user speaks and when the user does not speak. Therefore, in order to accurately detect whether a user speaks, in the method, firstly, a face image of a user of a terminal is obtained, then whether the user of the terminal speaks is judged based on the face image, if the user of the terminal speaks is judged, a first current audio frequency collected by a sound collection device of the terminal is sent to a server, so that the server sends the first current audio frequency to other terminals of a conference system accessed by the terminal and plays the first current audio frequency; and if the user of the terminal is judged not to speak, the first current audio collected by the sound collection device of the terminal is not sent to other terminals of the conference system accessed by the terminal. The sound collection device here may be a microphone. Therefore, whether the voice of the terminal user is uploaded or not can be automatically controlled by the mode, and the user does not need to manually turn off or turn on the microphone.

In practical application, the face of a user can be shot through a camera of the terminal to obtain a face video of the user, then face images are periodically obtained from the face video, and the face images obtained in two adjacent periods are compared to judge whether the user speaks.

In summary, according to the terminal audio collection control method provided by the invention, only when it is determined that the user speaks, the first current audio collected by the terminal is sent to the server, otherwise, the first current audio collected by the terminal is not sent to the server.

On the basis of the above-described embodiment:

as a preferred embodiment, the determining whether the user is speaking based on the face image includes:

identifying lips from the face image;

Specifically, according to the embodiment, after the face image of the user is obtained, the lips can be identified from the face image, whether the lips act for multiple times within a first preset time length is judged, if yes, the user speaks, at the moment, a first current audio collected by the terminal is sent to the server, otherwise, the user does not speak, and at the moment, the first current audio collected by the terminal is not sent to the server. Therefore, whether the user of the terminal speaks or not can be judged simply and accurately through the method.

In addition, since the above-mentioned S11-S14 are periodically repeated during the operation of the terminal, it is determined that the user has finished speaking when it is determined that the lips of the user have not continuously operated for the third preset time after the user has spoken. Here, the third preset time may be 1 s.

The first predetermined time period may be, but is not limited to, 500ms, and the plurality of times may be, but is not limited to, 3 times. Of course, other ways of determining whether the end user is speaking may be used, and the present application is not limited thereto.

As a preferred embodiment, when acquiring the face image of the user of the terminal, the method further includes:

after determining that the user is speaking, the method further comprises:

sending a first current audio collected by a terminal to a server, comprising:

and transmitting the second current audio and the first current audio acquired from the memory to the server.

Considering that the audio of the user's speech is not transmitted to the server during the period of determining whether the user is speaking and determining whether the user is speaking, although the period is short, some audio information may still be missed by users of other terminals. In order to solve the problem, in the application, in the process of using the terminal, the microphone of the terminal is always in an open state, that is, the terminal always collects audio, and only the collected audio is subsequently uploaded to the server according to whether the user speaks.

Specifically, a second current audio collected by the terminal is obtained while a face image of a user of the terminal is obtained, the second current audio is stored in the memory, and subsequently, when the user speaks, a second current audio with a second preset time before the current moment of the user is obtained from the memory, and the second current audio and the first current audio are sent to the server, so that other terminals can hear the complete audio of the user. The second preset time period may be, but is not limited to, 1 s. Further, the memory herein may be, but is not limited to, a cache.

Based on this approach, subsequent to determining that the user is not speaking, the user may proceed back to a state where the microphone recorded and the audio was saved to memory but the recording was not sent to the server.

Therefore, the user accessing other terminals in the conference system can hear the complete audio of the terminal by the mode, so that the sound loss is avoided, and the reliability of the voice communication is improved.

As a preferred embodiment, before acquiring the face image of the user of the terminal, the method further includes:

the server receives a gateway MAC sent by the conference system and an IP corresponding to the terminal accessed to the conference system one by one;

and sending an intensive mode instruction to the terminals in the conference group based on the IP so that the terminals enter a step of acquiring the face images of the users of the terminals after receiving the intensive mode instruction.

Referring to fig. 2, fig. 2 is a schematic diagram illustrating a relationship between a plurality of terminals and a server according to the present invention.

Considering that there are usually a plurality of terminals accessing the conference system, and the positions of the terminals may be far apart or close to each other, in order to ensure the communication quality and reduce noise and howling, in the present application, the voice control method provided by the present application is adopted for the terminals close to each other.

In particular, the present application also considers that multiple terminals normally access the gateway for voice communication. After the terminal accesses the gateway, the gateway allocates an IP to the terminal, the IP corresponds to the terminal one by one, and one gateway may access a plurality of terminals. Further, when the positions of the terminals are close, that is, the terminals are in a small physical range, the terminals may be accessed to the same gateway, and based on the situation, the terminals accessed to the conference system are judged to be in the dense mode or the loose mode through the gateway MAC.

Specifically, after the terminal is accessed to the conference system, the conference system acquires a gateway MAC of a gateway connected to the terminal accessed to the conference system and an IP corresponding to the terminal accessed to the conference system one to one, wherein the gateway MAC corresponds to the gateway one to one, the gateway MAC can represent the gateway, and the IP corresponds to the terminal one to one. If terminals with the same gateway MAC as other gateway MACs in the conference system exist, the fact that the terminals with the same gateway MAC are close to each other and possibly in a small physical range is probably indicated by a large probability, at the moment, the terminals with the same gateway MAC are divided into a conference group by the server, an intensive mode instruction is sent to the terminals in the conference group based on IP, and after receiving the intensive mode instruction, each terminal enters a step of acquiring a face image of a user of the terminal.

Therefore, the terminal in the dense mode accessed into the conference system can be simply and reliably determined through the method, so that the subsequent server can control each terminal in the dense mode to enter the step of acquiring the face image of the user of the terminal, noise and squeal are reduced, and the call quality is ensured.

As a preferred embodiment, further comprising:

if the terminal with the gateway MAC different from the gateway MAC of other terminals in the conference system exists, a loose mode instruction is sent to the terminal with the gateway MAC different from the gateway MAC of other terminals in the conference system based on the IP, so that the terminal with the gateway MAC different from the gateway MAC of other terminals in the conference system directly transmits the audio of the user to the server when the audio of the user is collected after receiving the loose mode instruction.

Specifically, if the server determines that the gateway MAC of the gateway connected to the terminal in the conference system is different from the gateway MACs of other terminals in the conference system, it indicates that the distance between the approximate location of the terminal and the locations of the other terminals may be relatively long, the terminal is in a loosening mode, the server sends a loosening mode instruction to the terminal, and after receiving the loosening mode instruction, the terminal directly transmits the audio of the user to the server when acquiring the audio of the user, so as to simplify the voice acquisition and uploading process.

As a preferred embodiment, after assigning terminals with the same gateway MAC to a conference group, the method further includes:

the server sequences all terminals in the conference group based on the IP;

receiving detection audio signals sent by other terminals in a conference group and received by each terminal;

determining a distance between the respective terminals based on the intensity of the detection audio signal;

if the distance between the terminal and the terminal is greater than the distance threshold, the terminal with the distance between the terminal and the other terminals in the conference group is removed from the conference group;

In order to further improve the accuracy of determining the mode of the plurality of terminals accessing the conference system, in this embodiment, the mode of the terminal is further determined based on the distance between the terminals in the conference group.

Specifically, the server ranks the terminals in the conference group based on the IP, where the purpose of the ranking is to facilitate the terminals in the subsequent conference group to sequentially send the sounding audio signals with a preset duration (which may be, but not limited to, 100ms) according to the sequence, if the distance between any two terminals in the conference group is short, the signal strength of the sounding audio signals sent by the other terminal received by the two terminals will be strong, and if the distance between any two terminals in the conference group is long, the signal strength of the sounding audio signals sent by the other terminal received by the two terminals will be weak, or even the sounding audio signals sent by the other terminal cannot be received (at this time, it can be understood that the signal strength of the received sounding audio signals is 0). Therefore, each terminal in the conference group receives the detection audio signal sent by other terminals in the conference group, and the detection audio signals are uploaded to a server, the server determines the distance between the terminals based on the strength of the detection audio signals after receiving the detection audio signals sent by the terminals, if the distance exists, then, whether a terminal with the minimum distance to other terminals in the conference group larger than the distance threshold exists is judged, if yes, the terminal is far away from other terminals in the conference group, and the terminal with the distance between the terminal and a plurality of other terminals in the conference group larger than the distance threshold value is rejected from the conference group, and sending a loosening mode instruction to the terminal with the removed conference group based on the IP, and directly transmitting the audio of the user to the server when the terminal with the removed conference group acquires the audio of the user after receiving the loosening mode instruction. Wherein the minimum distance may be, but is not limited to, 10 m.

It should be noted that, in practical applications, the terminal may calculate the distance between the terminal itself and the other terminal after receiving the detection audio signal sent by the other terminal, and upload the distance to the server, so that the server directly performs subsequent determination based on the distance. In addition, in order to further improve the accuracy of obtaining the distance, in practical applications, each terminal may be controlled to sequentially send the detection audio signals in sequence, and the sending process is repeated in a round-robin manner, for example, the sending process is repeated twice, so that a subsequent server or terminal calculates the distance after averaging the multiple detection audio signals after receiving the detection audio signals, or calculates the distance before averaging. In addition, after obtaining the distance between each terminal in the conference group, the server may further generate a relative distance table between all terminals in the conference group, and send the relative distance table to each terminal in the conference group.

In addition, in practical application, after sequencing each terminal in the conference group, the server may sequentially control each terminal to send the detection audio signal in sequence, or may send the sequence number in the whole conference group to each terminal in the conference group, so that each terminal starts sending the detection audio signal after one terminal sends the detection audio signal, and the application is not particularly limited in how to specifically control each terminal to send the detection audio signal in sequence, and is determined according to actual conditions.

Therefore, the embodiment further determines whether each terminal in the conference group is in a small physical range through the distance between each terminal in the conference group, so that the accuracy of determining the terminal in the dense mode accessed into the conference system is improved, and in addition, the terminal in the loose mode can be controlled to directly transmit the audio of the user to the server when the audio of the user is acquired, so that the voice acquisition and uploading process is simplified. Therefore, the embodiment realizes different voice acquisition and uploading processes of the terminals in different modes, and realizes the balance of noise reduction and voice acquisition and uploading simplification.

As a preferred embodiment, further comprising:

when receiving audio sent by a plurality of terminals in a dense mode in a conference group, a server determines a speaking terminal according to a preset speaking priority principle, controls the speaking terminal to send subsequent audio, and controls other terminals in the conference group not to send the subsequent audio.

It is considered that a plurality of terminals may simultaneously transmit audio to the server at the same time, that is, the speakers of the plurality of terminals may emit sounds, which may be collected by the microphones of other terminals, thereby generating noise or howling.

In order to solve the above technical problem, in this embodiment, when the server detects that a local multi-person speaks, that is, when receiving audio sent by multiple terminals in a dense mode in a conference group, the server determines a speaking terminal according to a preset speaking priority principle, controls the speaking terminal to send subsequent audio, and controls other terminals in the conference group not to send subsequent audio. The preset speaking priority rule may be that speaking time is allocated according to a preset conference role or speaking frequency, for example, a conference host takes precedence, a priority with a large speaking time, a priority with a small speaking time, and the like.

Therefore, the method can enable the multiple terminals in the intensive mode in the conference group to have only one terminal speaking at the same time, thereby reducing noise and squeal and ensuring the reliability of multi-terminal communication.

As a preferred embodiment, further comprising:

In order to facilitate the user to know the state of the terminal in time, the processor can control the display device to display correspondingly according to the state of the terminal.

Taking the display device as an example:

the terminal is in the mute period, the display lamp is normally on in a first color such as red;

during the speaking of the terminal, the display lamp is normally on in a second color, such as blue.

When the terminal is in networking, the display lamp quickly flickers; when the networking is successful, the display lamp flickers in a breathing shape in two colors, and stops after 3 seconds;

when the terminal detects the audio frequency, the display lamp shows in a horse race lamp mode; and after the test is finished, the display lamp is turned off after being fully turned on for 3 seconds.

The display of the display device in different states of the terminal may be set according to actual conditions, and the present application is not particularly limited thereto.

Referring to fig. 3, fig. 3 is a schematic structural diagram of an electronic device provided in the present invention, the electronic device includes:

a memory for storing a computer program;

and the processor is used for realizing the steps of the terminal audio acquisition control method when executing the computer program.

Specifically, the electronic device herein may be a terminal, or include a terminal and a server, where the terminal may be a mobile phone, a sound box, or the like.

The terminal comprises a memory and a processor, and also comprises a display screen with touch and display functions, a loudspeaker with a sound playing function, a microphone with a sound recording function, a camera with an image acquisition function, a display lamp for displaying a reminder or providing a light effect, and a wireless mode for establishing communication to perform data transmission, wherein the wireless module can be WiFi, an operator network module (3G/4G/5G) and the like. For the introduction of the electronic device provided by the present invention, please refer to the above method embodiments, which are not described herein again.

The invention also provides a readable storage medium, wherein a computer program is stored on the readable storage medium, and when being executed by a processor, the computer program realizes the steps of the terminal audio acquisition control method.

For the introduction of the readable storage medium provided by the present invention, please refer to the above method embodiments, which are not repeated herein.

It is to be noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A terminal audio acquisition control method is applied to a terminal and is characterized by comprising the following steps:

acquiring a face image of a user of the terminal;

judging whether the user speaks or not based on the face image;

if so, sending a first current audio acquired by the terminal to the server;

2. The terminal audio collection control method of claim 1, wherein determining whether the user is speaking based on the face image comprises:

identifying lips from the face image;

3. The terminal audio collection control method according to claim 1, further comprising, when obtaining the face image of the user of the terminal:

after determining that the user is speaking, the method further comprises:

4. The terminal audio acquisition control method according to any one of claims 1 to 3, further comprising, before acquiring the face image of the user of the terminal:

5. The terminal audio acquisition control method of claim 4, further comprising:

6. The terminal audio collection control method according to claim 4, wherein after assigning terminals with the same gateway MAC to a conference group, the method further comprises:

the server sorts all the terminals in the conference group based on the IP;

7. The terminal audio acquisition control method of claim 4, further comprising:

8. The terminal audio acquisition control method of claim 7, further comprising:

9. An electronic device, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the terminal audio acquisition control method according to any one of claims 1 to 8 when executing the computer program.

10. A readable storage medium, characterized in that the readable storage medium has stored thereon a computer program which, when being executed by a processor, realizes the steps of the terminal audio acquisition control method according to any one of claims 1 to 8.