CN110767228A

CN110767228A - Sound acquisition method, device, equipment and system

Info

Publication number: CN110767228A
Application number: CN201810826055.6A
Authority: CN
Inventors: 齐昕
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2018-07-25
Filing date: 2018-07-25
Publication date: 2020-02-07
Anticipated expiration: 2038-07-25
Also published as: CN110767228B

Abstract

The embodiment of the invention provides a sound acquisition method, a device, equipment and a system, wherein the method comprises the following steps: firstly, analyzing lip images of personnel, and acquiring sound collected by sound collection equipment under the condition that lip movement of the personnel is judged; it can be understood that if there is lip movement of a person, the person has a high probability of speaking at the opening, and in this case, the sound collected by the sound collection device is acquired again, which reduces the probability of acquiring only the noise.

Description

Sound acquisition method, device, equipment and system

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method, an apparatus, a device, and a system for acquiring a sound.

Background

In various fields such as the field of smart homes and the field of vehicle-mounted devices, voice recognition is generally required to conveniently control home devices, vehicle-mounted devices and the like. Existing identification schemes generally include: the voice recognition equipment collects the sound in the environment, analyzes the collected sound and obtains a control instruction or other interactive information sent by a user.

However, in the above scheme, the speech recognition device cannot distinguish between noise and the speech of the user, and if the collected sound only includes noise in the environment, the speech recognition device may also analyze the noise, which wastes device resources.

Disclosure of Invention

An object of the embodiments of the present invention is to provide a method, an apparatus, a device, and a system for acquiring a sound, so as to reduce the probability of acquiring only noise.

In order to achieve the above object, an embodiment of the present invention provides a sound acquiring method, including:

acquiring a lip image of a person acquired by image acquisition equipment;

judging whether the person has lip movement or not by analyzing the lip image;

and if so, acquiring the sound acquired by the sound acquisition equipment after the lip action of the person exists.

Optionally, the acquiring sound collected by the sound collecting device after the lip action of the person includes:

determining a direction of the person relative to a sound collection device;

generating acquisition parameters of the sound acquisition equipment according to the determined direction;

sending an acquisition instruction containing the acquisition parameters to the sound acquisition equipment;

and receiving the sound collected by the sound collection equipment according to the acquisition instruction.

Optionally, after acquiring the sound collected by the sound collection device after the lip action of the person, the method further includes:

and executing a first type of interaction task based on the acquired sound and the lip image.

Optionally, when it is determined that the person has a lip movement, the method further includes:

extracting the features of the lip images to obtain lip language features of the personnel;

after the acquiring the sound collected by the sound collecting device after the lip action of the person, the method further comprises:

carrying out feature extraction on the obtained sound to obtain the sound features of the personnel;

the executing of a first type of interaction task based on the acquired sound and the lip image comprises:

and inputting the lip language features and the voice features into a recognition network obtained by pre-training, and executing a first type of interaction task based on an output result.

Optionally, the inputting the lip language features and the voice features to a recognition network obtained by pre-training, and executing a first type of interaction task based on an output result includes:

sending the lip language features and the sound features to a cloud server so that the cloud server inputs the lip language features and the sound features to a recognition network obtained through pre-training to obtain an output result and obtain an interactive resource corresponding to the output result;

receiving the interactive resources sent by the cloud server; and executing the first type of interaction tasks based on the interaction resources.

matching the acquired sound with a plurality of sound models stored in advance;

and executing a second type of interaction task corresponding to the successfully matched sound model.

In order to achieve the above object, an embodiment of the present invention further provides a sound acquiring apparatus, including:

the first acquisition module is used for acquiring a lip image of a person acquired by the image acquisition equipment;

the judging module is used for judging whether the person has lip movement or not by analyzing the lip image; if the current time slot exists, triggering a second acquisition module;

and the second acquisition module is used for acquiring the sound acquired by the sound acquisition equipment after the lip action of the person exists.

Optionally, the second obtaining module is specifically configured to:

determining a direction of the person relative to a sound collection device;

Optionally, the apparatus further comprises:

and the first interaction module is used for executing a first type of interaction task based on the acquired sound and the lip image.

Optionally, the apparatus further comprises:

the first extraction module is used for extracting the features of the lip images to obtain the lip language features of the personnel under the condition that the lip action of the personnel is judged;

the second extraction module is used for extracting the characteristics of the acquired sound to obtain the sound characteristics of the personnel;

the first interaction module is specifically configured to: and inputting the lip language features and the voice features into a recognition network obtained by pre-training, and executing a first type of interaction task based on an output result.

Optionally, the first interaction module is specifically configured to:

Optionally, the apparatus further comprises:

the matching module is used for matching the acquired sound with a plurality of sound models stored in advance;

and the second interaction module is used for executing a second type of interaction task corresponding to the successfully matched sound model.

In order to achieve the above object, an embodiment of the present invention further provides an electronic device, including a processor and a memory;

a memory for storing a computer program;

and a processor for implementing any of the sound acquisition methods described above when executing the program stored in the memory.

In order to achieve the above object, an embodiment of the present invention further provides a sound acquiring system, including: an image pickup apparatus, a sound pickup apparatus, and a processing apparatus, wherein,

the image acquisition equipment is used for acquiring lip images of personnel and sending the lip images to the processing equipment;

the processing equipment is used for analyzing the lip image and judging whether the person has lip action; if yes, sending an acquisition instruction to the sound acquisition equipment;

and the sound acquisition equipment is used for sending the sound acquired after the acquisition instruction is received to the processing equipment.

Optionally, the system further includes: a cloud server;

the processing equipment is further used for extracting features of the lip images to obtain lip language features of the personnel under the condition that the lip action of the personnel is judged; extracting the characteristics of the acquired sound to obtain the sound characteristics of the personnel; sending the lip language features and the sound features to the cloud server;

the cloud server is further used for inputting the lip language features and the voice features to a recognition network obtained through pre-training to obtain an output result and obtain interactive resources corresponding to the output result; sending the interaction resource to the processing device;

the processing device is further configured to execute a first type of interaction task based on the interaction resource.

When the embodiment of the invention is applied to sound acquisition, the lip image of a person is analyzed, and the sound acquired by the sound acquisition equipment is acquired under the condition that the lip action of the person is judged; it can be understood that if there is lip movement of a person, the person has a high probability of speaking at the opening, and in this case, the sound collected by the sound collection device is acquired again, which reduces the probability of acquiring only the noise.

Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a first flowchart of a sound acquiring method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of an identification network according to an embodiment of the present invention;

fig. 3 is a second flowchart of a sound acquiring method according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a sound capture device according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a sound acquiring system according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to solve the above technical problems, embodiments of the present invention provide a sound acquisition method, apparatus, device, and system, where the method and apparatus may be applied to a voice recognition device, such as a vehicle-mounted voice recognition device, a home voice recognition device, and the like; or the method may also be applied to a sound collection device, such as a smart speaker, or may also be applied to other electronic devices, such as a robot, and the like, without limitation.

First, a sound acquiring method according to an embodiment of the present invention will be described in detail. Fig. 1 is a first flowchart of a sound obtaining method according to an embodiment of the present invention, including:

s101: and acquiring the lip image of the person acquired by the image acquisition equipment.

The electronic device (execution main body, hereinafter referred to as electronic device) executing the scheme may be connected to the image capturing device, or the electronic device may have the image capturing device built therein, which is not limited specifically. The image acquisition equipment acquires a lip image of a person.

For example, assuming that the present solution is applied to the field of vehicles, taking a car as an example, and assuming that there are four seats in the car, image capturing devices may be respectively disposed near the four seats to capture images of people on the seats. Specifically, the image capturing device may be disposed in front of the seat, or at both left and right ends, or in front of the seat, and the specific position is not limited.

An image capture device may be positioned to be aimed at the lips of the person to obtain an image of the lips of the person. Alternatively, the image capturing device may capture a whole body image, a half body image, a head image, and the like of the person, and then segment the captured image to obtain a lip image of the person.

Alternatively, in other scenarios, only the lip image of the driver may be acquired. For example, in a bus, an image capturing device is provided only near a driver, a head image of the driver is captured, and then a lip image of the driver is segmented in the head image.

For another example, assuming that the scheme is applied to the field of smart homes, the image acquisition device may be set at a designated position, acquire the person image, and segment the person image to obtain the lip image, or the image acquisition device directly acquires the lip image of the person.

S102: by analyzing the lip image, it is determined whether there is a lip movement for the person. If so, S103 is performed.

For example, the electronic device may store a lip movement model in advance, that is, a model of the presence of lip movement, match the lip image with the lip movement model, and if the matching is successful, indicate that the person has lip movement. Alternatively, the electronic device may store a model in which no lip movement exists, match the lip image with the model in which no lip movement exists, and if the matching is successful, it indicates that the person does not have a lip movement.

Or, the distance between the two lips in the lip image may be analyzed, if the distance is smaller than a preset threshold, it indicates that the person does not have the lip motion, and if the distance is not smaller than the preset threshold, it indicates that the person does have the lip motion.

S103: and acquiring the sound acquired by the sound acquisition equipment after the lip action of the person exists.

In one embodiment, when the determination result in S102 is yes, the electronic device sends an acquisition instruction to the sound collection device, and the sound collection device sends the sound collected after receiving the acquisition instruction to the electronic device.

For example, in one case, the fetch instruction may be a launch instruction. That is, the sound collection device is not started, and when the determination result in S102 is yes, the electronic device sends a start instruction to the sound collection device, and after receiving the start instruction, the sound collection device starts collecting sound and sends the collected sound to the electronic device.

Taking the above car scene as an example, a sound collection device may be provided near each seat to collect sound for the person in each seat: it is assumed that a sound collection device 1 is provided near a seat 1, a sound collection device 2 is provided near the seat 2, a sound collection device 3 is provided near the seat 3, and a sound collection device 4 is provided near the seat 4; the 4 sound collection devices are in an off state, and after the electronic device judges that the lip action exists on the person on the seat 3, the electronic device sends a starting instruction to the sound collection device 3 to start the sound collection device 3 to collect sound.

In the embodiment, the sound collection equipment is started to collect the sound only under the condition that the lip action exists in the personnel, so that on one hand, the probability of only obtaining the noise is reduced, and on the other hand, the resource utilization rate of the sound collection equipment is improved.

As another example, the sound collection device may be in a working state all the time, and the sound collection device sends the collected sound to the electronic device only after receiving the acquisition instruction sent by the electronic device.

Still taking the scene of the car as an example, the 4 sound collection devices are always in an operating state, and after the electronic device determines that there is a lip motion of a person on the seat 3, the electronic device sends an acquisition instruction to the sound collection device 3, and the sound collection device 3 sends the sound collected after receiving the acquisition instruction to the electronic device.

In another embodiment, the sound collection device is always in a working state and sends collected sound to the electronic device in real time; in this case, the electronic apparatus determines the sound received in the case where the determination of S102 is yes as a valid sound, that is, determines the sound collected by the sound collecting apparatus after the presence of lip motion of the person as a valid sound, and subsequently reads only the valid sound for analysis processing.

As an embodiment, S103 may include: determining a direction of the person relative to a sound collection device; generating acquisition parameters of the sound acquisition equipment according to the determined direction; sending an acquisition instruction containing the acquisition parameters to the sound acquisition equipment; and receiving the sound collected by the sound collection equipment according to the acquisition instruction.

In this embodiment, the sound collection device can collect sounds in different directions. For example, the direction of the person relative to the sound collection device may be determined as the collection direction based on the lip image and/or the sound collected by the sound collection device.

Continuing with the above example, assuming that an image pickup device is provided near each seat in the car, and the lip motion of the person on the seat 3 is determined by analyzing the lip image picked up by each image pickup device, the direction of the seat 3 with respect to the sound pickup device is determined as the pickup direction. Alternatively, if the determination result in S102 is yes, the direction of the person with respect to the sound collection device may be located based on the sound collected by the sound collection device. Alternatively, the analysis result of the lip image and the direction of the collected sound locator with respect to the sound collecting apparatus may be combined, so that the positioning is more accurate.

For example, the sound collection device may be rotatable, in which case, the generated collection parameters may include rotation parameters of the sound collection device, such as a rotation direction, a rotation angle, and the like. Still taking the scene of the car as an example, a sound collection device may be disposed in the car, and by rotating the sound collection device, sound collection may be performed for each person in each seat. Suppose that after the electronic device determines that the lip of the person on the seat 3 moves, the electronic device generates a rotation parameter rotating in the direction of the seat 3, and sends an acquisition instruction containing the rotation parameter to the sound collection device, so that the sound collection device rotates in the direction of the seat 3 to collect sound.

As another example, the sound collection device may be a microphone array. Still taking a car as an example, the sound collection device may be a 6-microphone uniform linear array, the array may be located in the center above the front window of the car, the array may be always in a start state to continuously collect sound in the car, and the sound collected within a certain time period may be buffered.

After determining the direction of the person relative to the sound collection device, the collection parameters of each microphone in the microphone array may be determined based on the direction; controlling the microphone array to perform sound collection based on the determined collection parameters.

It will be appreciated that the microphone array may be directed to sound collection for different directions. In particular, by adjusting the microphone parameters such that it collects sound in some directions and suppresses sound in other directions, i.e. by controlling the microphone array to perform directional beamforming, sound collection can be achieved for only the collection direction.

In the embodiment, when the lip action exists in the person, the sound collection device is controlled to collect the sound of the person in the direction relative to the sound collection device, so that on one hand, the probability of acquiring the noise is reduced, and on the other hand, the sound collection is performed in the direction relative to the sound collection device, so that the noise in the collected sound is less.

The electronic device can analyze and process the sound acquired in S103, and it can be understood that the sound acquired in S103 includes the voice sent by the user, so that the probability of analyzing only the noise is reduced, and the sound analysis efficiency is improved.

As an embodiment, after S103, the first type of interaction task may be further performed based on the acquired sound and the lip image.

The "first type of interactive task" refers to a task assigned to the electronic device by a person, and for the purpose of distinguishing from the following interactive task, the interactive task in the present embodiment is referred to as a first type of interactive task. The first type of interaction task may be a control instruction, such as "play a certain song", "broadcast weather", or may also be a conversation between a person and an electronic device, and is not particularly limited.

In the existing scheme, after the voice of a person is obtained, the voice is generally analyzed, and then a corresponding interaction task is executed; in the scheme, the voice and the lip image are combined, the analysis result is more accurate, and the first type of interaction task is more accurate to execute. It can be understood that lip language of the person can be analyzed based on the lip image, and real sound emitted by the person can be analyzed by combining the sound and the lip language.

In one case, the noise reduction processing may be performed on the sound obtained in S103, and the analysis result is more accurate and the execution of the first type of interaction task is more accurate by combining the noise-reduced sound and the lip image. If the sound collection device is a microphone array, after the direction of a person relative to the sound collection device is determined, the determined direction can be used as a collection direction, and noise reduction processing can be performed on the sound in the non-collection direction.

For example, when the determination result in S102 is yes, feature extraction may be performed on the lip image to obtain lip language features of the person; and extracting the characteristics of the sound obtained in the step S103 to obtain the sound characteristics of the person; and then inputting the lip language features and the voice features into a recognition network obtained by pre-training, and executing a first type of interaction task based on an output result.

The recognition Network may be as shown in fig. 2, the voice feature and the lip language feature may be input to a CNN (Convolutional Neural Network) respectively corresponding to the voice feature and the lip language feature, the two CNNs are connected to a Bi-GRU (Bi: Bi-directional, gate controlled recovery Unit, Gated cycle Unit), output results of the two CNNs are input to the Bi-GRU, the Bi-GRU is connected to an FC (full connected layer), output results of the Bi-GRU are input to the FC, and output results of the FC are output results of the recognition Network, and may be specifically interaction information sent by a human.

Specifically, the CNN corresponding to the sound feature, that is, the CNN on the left side in fig. 2, may be a two-layer 1D-CNN network (1D: one-dimensional), the convolution kernel is 5, and stride (step size) of the two layers are 1 and 2, respectively. The CNN corresponding to the lip language feature, i.e. the CNN on the right side in fig. 2, may be two layers of STCNN (Spatial Transformer Convolutional neural network), and there is one Spatial max-posing layer behind each layer. The Bi-GRU can be five layers, and the hidden size (the number of hidden layer units) of the Bi-GRU can be 1024.

And executing the first type of interaction tasks based on the interaction information. For example, the interaction information may be a conversation between a person and the electronic device, and assuming that the interaction information sent by the person is "i have a trial and error today", the interaction task performed by the electronic device may be: the answer is "wish you good luck". A number of fixed dialog templates may be stored in the electronic device in advance, so that the electronic device may perform a dialog with a person according to the dialog templates. This electronic equipment can be intelligent audio amplifier, and under this kind of condition, the execution main part can be intelligent audio amplifier, also can be the speech recognition equipment who is connected with intelligent audio amplifier. There are many situations in the specific interaction process, which are not listed one by one.

CNN, Bi-GRU, FC in FIG. 2 are all part of the identification network. For example, the CNN may comprise 2 layers, the Bi-GRU may comprise 5 layers, and the sound characteristic may be an 80-dimensional Fbank. The recognition network can use end-to-end training, and the training framework can be based on fig. 2, and connect a loss function, such as CTC-loss, after FC. And (4) through iterative training, minimizing the loss of the loss function, and then ending the iteration to obtain the trained recognition network. Or, a loss threshold may be set, and when the loss of the loss function is smaller than the loss threshold, iteration is ended to obtain a trained recognition network. Or, iteration times can be set, and after the iteration times are reached, the iteration is finished to obtain the trained recognition network. The specific training process is not limited.

In one embodiment, the identification network may be located in the cloud server, or in another device connected to the cloud server; in this way, the obtained lip language features and the sound features can be sent to a cloud server, so that the cloud server inputs the lip language features and the sound features to a recognition network obtained through pre-training to obtain an output result, and an interactive resource corresponding to the output result is obtained; receiving the interactive resources sent by the cloud server; and executing the first type of interaction tasks based on the interaction resources.

The device (the electronic device) executing the scheme may communicate with the cloud server through a network such as 3G, 4G, WIFI, and a specific communication method is not limited.

This electronic equipment can understand local equipment, and local equipment's memory space is limited, and in this embodiment, by high in the clouds server discernment lip language characteristic and sound characteristic, saved local equipment's memory space. On the other hand, the electronic device does not transmit the original data of the sound and the image, but transmits the lip language characteristic and the sound characteristic, and the data volume of the lip language characteristic and the sound characteristic is smaller than that of the original data, so that the implementation mode improves the transmission efficiency and occupies less transmission resources.

As described above, the output result of the recognition network may be interactive information issued by a human. Still taking a car as an example, assuming that the interaction information is "play a certain song", the cloud server may search for the corresponding song, send the song as an interaction resource to an in-car sound box, and play the received song by the sound box.

Supposing that the interactive information is 'broadcast weather', the cloud server can search the current weather information, send the weather information to the sound box in the vehicle as the interactive resource, and the sound box broadcasts the received weather information. The equipment for executing the scheme can be a sound box, and can also be voice recognition equipment connected with the sound box.

Or, assuming that the interactive information is "i have a facial test today", the cloud server may search for corresponding answer content from a dialog template stored in the cloud server, for example, the answer content may be "wish you to be good fortune", and send the answer content as an interactive resource to the in-vehicle sound box; the sound box plays 'wish you good luck'. There are many situations in the specific interaction process, which are not listed one by one.

The cloud server can identify a long conversation, and the electronic equipment can execute a complex interaction task through the cloud server.

As an embodiment, after S103, the acquired sound may be matched with a plurality of sound models stored in advance; and executing a second type of interaction task corresponding to the successfully matched sound model.

In one case, the second type of interaction tasks may be simpler than the first type of interaction tasks. For example, the second type of interaction task may be some simple control instructions, such as starting an air conditioner, starting a sound box, and the like; or, the person may also send out a wakeup word of some devices, so that the interaction task is to wake up the corresponding device, and the like, which is not limited specifically.

Some sound models, such as sound models of some keywords, for example, sound models of "turn on air conditioner", "turn on sound box", or sound models of some wake-up words, may also be stored in the electronic device in advance, and the details are not limited. The sound models are simple, the occupied storage space is small, interaction with a cloud server is not needed, response is fast, and user experience is good.

In one case, the electronic device may update its stored acoustic model. For example, the cloud server can push a new sound model to the electronic device, or the electronic device can also pull the new sound model to the cloud server periodically.

As can be understood by those skilled in the art, in the field of vehicle-mounted, the external noise is very high when a vehicle runs, and the echo in a closed compartment can also influence the voice recognition; on one hand, by the aid of the scheme, under the condition that the lip action of the personnel is judged, the sound collected by the sound collection equipment after the lip action of the personnel is judged, the probability of obtaining the noise is reduced, waste of equipment resources is reduced, and the utilization rate of the equipment resources is improved; on the other hand, the interaction task is executed by combining the sound and the lip image, and the accuracy is better.

Fig. 3 is a schematic flow chart of a sound acquiring method according to an embodiment of the present invention, including:

s301: and acquiring the lip image of the person acquired by the image acquisition equipment.

S302: and analyzing the lip image to judge whether the person has lip action, and if so, executing S303-S308.

S303: and carrying out feature extraction on the lip image to obtain the lip language features of the person.

S304: the direction of the person relative to the sound collection device is determined as the collection direction.

The execution order of S303 and S304-S306 is not limited.

For example, the direction of the person relative to the sound collection device may be determined as the collection direction based on the lip image and/or the sound collected by the sound collection device.

Continuing with the above example, assuming that an image pickup device is provided near each seat in the car, and the lip motion of the person on the seat 3 is determined by analyzing the lip image picked up by each image pickup device, the direction of the seat 3 with respect to the sound pickup device is determined as the pickup direction. In one case, the sound collection device may be always in an activated state, so that the direction of the person with respect to the sound collection device may be located according to the sound collected by the sound collection device in the case where the determination result of S302 is yes. Alternatively, the analysis result of the lip image and the direction of the collected sound locator with respect to the sound collecting apparatus may be combined, so that the positioning is more accurate.

S305: generating acquisition parameters of the sound acquisition equipment according to the determined acquisition direction; sending an acquisition instruction containing acquisition parameters to sound acquisition equipment; and receiving the sound collected by the sound collection equipment according to the acquisition instruction.

For example, the sound collection device may be a microphone array. Still taking a car as an example, the sound collection device may be a 6-microphone uniform linear array, the array may be located in the center above the front window of the car, the array may be always in a start state to continuously collect sound in the car, and the sound collected within a certain time period may be buffered.

In this embodiment, under the condition that the person has a lip action, the sound collection device is controlled to collect the sound of the person in the direction relative to the sound collection device, so that on one hand, the probability of acquiring the noise is reduced, and on the other hand, the sound collection is performed in the direction in which the person is located, so that the noise in the collected sound is less.

S306: and carrying out feature extraction on the received sound to obtain the sound features of the person.

It is understood that the sound received in S305 includes the sound of a person who has a lip motion, and therefore, the sound feature of the person can be obtained by performing feature extraction on the sound received in S305.

S307: and sending the lip language characteristics and the sound characteristics to a cloud server.

And the cloud server inputs the lip language features and the voice features to a recognition network obtained by pre-training to obtain an output result, and obtains interactive resources corresponding to the output result.

S308: receiving an interactive resource sent by a cloud server; a first type of interaction task is performed based on the interaction resource.

Still taking a car as an example, assuming that the interaction information is "play a certain song", the cloud server may search for the corresponding song, and send the song as an interaction resource to the in-car sound box; the loudspeaker plays the received song, i.e. performs the interactive task.

If the interactive information is 'broadcast weather', the cloud server can search the current weather information and send the weather information as an interactive resource to the in-vehicle sound box; the loudspeaker broadcasts the received weather information, i.e. performs an interactive task. Etc., and the specific interactive contents are not limited. The equipment for executing the scheme can be a sound box, and can also be voice recognition equipment connected with the sound box.

The electronic device can communicate with the cloud server through networks such as 3G and 4G, WIFI, and the specific communication mode is not limited.

By applying the embodiment of the invention, on one hand, under the condition that the lip action of the personnel is judged, the sound collected by the sound collection equipment after the lip action of the personnel is judged, the probability of obtaining the noise is reduced, the waste of equipment resources is further reduced, the utilization rate of the equipment resources is improved, on the other hand, the interaction task is executed by combining the sound and the lip image, and the accuracy is better.

Corresponding to the above method embodiment, an embodiment of the present invention further provides a sound acquiring apparatus, as shown in fig. 4, including:

a first obtaining module 401, configured to obtain a lip image of a person collected by an image collecting device;

a determining module 402, configured to determine whether there is a lip motion of the person by analyzing the lip image; if the current time slot exists, triggering a second acquisition module;

a second obtaining module 403, configured to obtain a sound collected by the sound collecting device after the lip motion of the person exists.

As an implementation manner, the second obtaining module 403 is specifically configured to:

determining a direction of the person relative to a sound collection device;

As an embodiment, the apparatus further comprises:

and a first interaction module (not shown in the figure) for executing a first type of interaction task based on the acquired sound and the lip image.

As an embodiment, the apparatus further comprises: a first extraction module and a second extraction module (not shown in the figures), wherein,

As an implementation manner, the first interaction module is specifically configured to:

As an embodiment, the apparatus further comprises: a matching module and a second interaction module (not shown), wherein,

When the embodiment shown in fig. 4 of the invention is applied to sound acquisition, the lip images of the personnel are analyzed, and the sound acquired by the sound acquisition equipment is acquired under the condition that the lip action of the personnel is judged; it can be understood that if there is lip movement of a person, the person has a high probability of speaking at the opening, and in this case, the sound collected by the sound collection device is acquired again, which reduces the probability of acquiring only the noise.

An embodiment of the present invention further provides an electronic device, as shown in fig. 5, including a processor 501 and a memory 502;

a memory 502 for storing a computer program;

the processor 501 is configured to implement any of the above-described sound acquisition methods when executing the program stored in the memory 502.

The electronic equipment can be voice recognition equipment, such as vehicle-mounted voice recognition equipment, home voice recognition equipment and the like; or the sound collection device may also be a sound collection device, such as a smart speaker, or other electronic devices, such as a robot, and the like, which is not limited specifically.

The Memory mentioned in the above electronic device may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.

An embodiment of the present invention further provides a sound acquiring system, as shown in fig. 6, including: an image pickup apparatus, a sound pickup apparatus, and a processing apparatus, wherein,

For example, the obtaining instruction may be a start instruction, and the sound collection device is in a closed state before receiving the start instruction and is in an operating state after receiving the start instruction.

Or, the sound collection device may be in a working state all the time, and the sound collection device sends the collected sound to the processing device only after receiving the acquisition instruction sent by the processing device.

Or the sound acquisition equipment is always in a working state and transmits acquired sound to the processing equipment in real time; in this case, the processing device determines the sound received after determining that there is a lip motion of the person as a valid sound, and subsequently reads only the valid sound for analysis processing.

Alternatively, the acquisition instruction may include acquisition parameters of the sound acquisition device. The processing device determines the orientation of the person relative to the sound collection device; generating acquisition parameters of the sound acquisition equipment according to the determined direction; and sending an acquisition instruction containing the acquisition parameters to sound acquisition equipment. And the sound collection equipment collects sound according to the collection parameters and sends the collected sound to the processing equipment.

Alternatively, the sound collection device may be a microphone array; the processing equipment determines the direction of a person relative to the microphone array as a collecting direction, and based on the collecting direction, the processing equipment determines collecting parameters of each microphone in the microphone array and sends an acquisition instruction containing the collecting parameters to the microphone array. The microphone array collects sound according to the collection parameters.

As an embodiment, the system may further include: a cloud server;

Some acoustic models may be stored in the processing device; the processing device may also match the person's voice to a pre-stored voice model; and executing a second type of interaction task corresponding to the successfully matched sound model. The cloud server may push a new acoustic model to the processing device, or the processing device may periodically pull a new acoustic model to the cloud server.

The processing device may communicate with the cloud server through a network such as 3G, 4G, WIFI, and the specific communication manner is not limited. The processing device may also apply any of the sound acquisition methods described above.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, and the system embodiment, since they are substantially similar to the method embodiment, the description is relatively simple, and in relation to the description, reference may be made to some of the description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A sound acquisition method, comprising:

acquiring a lip image of a person acquired by image acquisition equipment;

judging whether the person has lip movement or not by analyzing the lip image;

2. The method of claim 1, wherein the capturing sound captured by a sound capture device after lip activity by the person comprises:

determining a direction of the person relative to a sound collection device;

3. The method of claim 1, further comprising, after said capturing sound captured by a sound capture device after lip activity of the person,:

4. The method of claim 3, wherein in the event that it is determined that lip action is present for the person, further comprising:

5. The method of claim 4, wherein the inputting the lip language features and the voice features into a recognition network trained in advance, and the performing a first type of interaction task based on the output result comprises:

6. The method of claim 1, further comprising, after said capturing sound captured by a sound capture device after lip activity of the person,:

matching the acquired sound with a plurality of sound models stored in advance;

7. A sound pickup apparatus, comprising:

8. The apparatus of claim 7, wherein the second obtaining module is specifically configured to:

determining a direction of the person relative to a sound collection device;

9. The apparatus of claim 7, further comprising:

10. The apparatus of claim 9, further comprising:

11. The apparatus of claim 10, wherein the first interaction module is specifically configured to:

12. The apparatus of claim 7, further comprising:

13. An electronic device comprising a processor and a memory;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1-6 when executing a program stored in the memory.

14. A sound acquisition system, comprising: an image pickup apparatus, a sound pickup apparatus, and a processing apparatus, wherein,

15. The system of claim 14, further comprising: a cloud server;