CN115762519A

CN115762519A - Voice recognition method, device, equipment and storage medium

Info

Publication number: CN115762519A
Application number: CN202211339220.8A
Authority: CN
Inventors: 瞿盛; 安康
Original assignee: Goertek Techology Co Ltd
Current assignee: Goertek Techology Co Ltd
Priority date: 2022-10-28
Filing date: 2022-10-28
Publication date: 2023-03-07

Abstract

The invention belongs to the technical field of voice recognition, and discloses a voice recognition method, a voice recognition device, voice recognition equipment and a storage medium. The method comprises the following steps: lip language acquisition images fed back by the positioning camera modules in a plurality of directions are obtained; determining coordinates of a target lip language and a user lip according to each lip language acquisition image; determining the voice pickup direction of the voice pickup array according to the lip coordinates of the user; acquiring collected voice fed back by a voice pickup array according to the voice pickup direction; and recognizing the voice content of the collected voice according to the target lip language. By the mode, the target lip language and the user lip coordinates are determined according to the lip language acquisition images fed back by positioning camera shooting in the plurality of directions, the voice pickup direction is determined based on the user lip coordinates, directional pickup of the acquired voice is achieved, voice content of the acquired voice is identified based on the target lip language, the noise interference in the environment is reduced by combining the content under various modes, the pure signal is restored, and the voice listening feeling and the voice identification accuracy are improved.

Description

Voice recognition method, device, equipment and storage medium

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a speech recognition method, apparatus, device, and storage medium.

Background

Due to the continuous development of artificial intelligence technology, voice interaction has gradually become one of mainstream interaction technologies, and in order to improve the accuracy of voice recognition in interaction, noise interference needs to be reduced. If the environment is very noisy, the sound source is far away, or the reverberation is serious, even if hardware of the mic array is added, the voice instruction cannot be accurately captured, and therefore the accuracy of voice recognition is difficult to ensure only by the mic array.

Disclosure of Invention

The invention mainly aims to provide a voice recognition method, a voice recognition device, voice recognition equipment and a storage medium, and aims to solve the technical problem of how to improve the accuracy of voice recognition in the voice interaction process in the prior art.

In order to achieve the above object, the present invention provides a speech recognition method, including:

lip language acquisition images fed back by the positioning camera modules in a plurality of directions are obtained;

determining coordinates of a target lip language and a user lip according to each lip language acquisition image;

determining the voice pick-up direction of the voice pick-up array according to the user lip coordinates;

acquiring collected voice fed back by the voice pickup array according to the voice pickup direction;

and recognizing the voice content of the collected voice according to the target lip language.

Optionally, the determining coordinates of the target lip language and the lips of the user according to each lip language acquisition image includes:

combining images of the lip language acquisition images to obtain a target lip language image;

extracting the features of the target lip language image to determine the lip language features;

determining a target lip language according to the lip language features and a preset lip language identification model;

and determining the coordinates of the lips of the user according to the target lip language image.

Optionally, the determining lip coordinates of the user according to the target lip language image includes:

acquiring a comparison lip language image according to a preset image selection rule;

determining the three-dimensional position information of the face of the user according to the target lip language image and the target space grid;

and determining the lip coordinates of the user according to the three-dimensional face position information, the comparison lip language image and the target lip language image.

Optionally, the determining the coordinates of the lips of the user according to the three-dimensional position information of the face, the comparison lip language image and the target lip language image includes:

performing lip movement judgment on the lips of the user according to the comparison lip language image and the target lip language image to determine whether the lips are effective lip movement;

when the comparison lip language image and the target lip language image are determined to be effective lip movements, determining the relative position of lips according to the target lip language image;

and determining the lip coordinates of the user according to the relative lip position and the three-dimensional face position information.

Optionally, before determining the three-dimensional position information of the face of the user according to the target lip language image and the target spatial grid, the method further includes:

acquiring reference positions fed back by positioning camera modules in a plurality of directions;

calculating the space range of the target area according to each reference position;

and carrying out grid division on the target area according to the space range to obtain a target space grid.

Optionally, the recognizing the voice content of the collected voice according to the target lip language includes:

performing feature extraction on the collected voice to determine voice features;

determining a target audio frequency according to the voice characteristics and a preset voice recognition model;

performing frame-level matching on the target lip language and the target audio to obtain a frame-level matching result;

and determining the voice content according to the frame level matching result.

Optionally, after recognizing the voice content of the collected voice according to the target lip language, the method further includes:

adjusting the tracking direction of each positioning camera module according to the lip coordinates of the user to obtain each adjusted positioning camera module;

and acquiring new lip language acquisition images fed back by each adjusted positioning camera module, and performing voice recognition through the new lip language acquisition images.

In addition, to achieve the above object, the present invention further provides a speech recognition apparatus, including:

the acquisition module is used for acquiring lip language acquisition images fed back by the positioning camera modules in a plurality of directions;

the determining module is used for determining the coordinates of the target lip language and the lips of the user according to each lip language acquisition image;

the determining module is further used for determining the voice pick-up direction of the voice pick-up array according to the user lip coordinates;

the acquisition module is further used for acquiring the collected voice fed back by the voice pickup array according to the voice pickup direction;

and the recognition module is used for recognizing the voice content of the collected voice according to the target lip language.

In addition, to achieve the above object, the present invention further provides a voice recognition apparatus, including: a memory, a processor and a speech recognition program stored on the memory and executable on the processor, the speech recognition program being configured to implement a speech recognition method as described above.

Furthermore, to achieve the above object, the present invention further provides a storage medium having a speech recognition program stored thereon, the speech recognition program implementing the speech recognition method as described above when executed by a processor.

The method comprises the steps of acquiring lip language acquisition images fed back by positioning camera modules in multiple directions; determining coordinates of a target lip language and a user lip according to each lip language acquisition image; determining the voice pick-up direction of the voice pick-up array according to the user lip coordinates; acquiring collected voice fed back by the voice pickup array according to the voice pickup direction; and recognizing the voice content of the collected voice according to the target lip language. Through the mode, the target lip language and the user lip coordinate are determined according to the lip language collecting images fed back by positioning camera shooting in the multiple directions, the voice collecting direction is determined based on the user lip coordinate, directional collecting of collected voice is achieved, voice content of the voice is collected based on target lip language recognition, the noise interference in the environment is reduced by combining the content under multiple modes, far-field voice interaction and noise reduction effects are fully guaranteed, non-wearing voice interaction is enabled to be more perfect, good robustness is achieved, pure signal restoration is achieved, and voice audibility and voice recognition accuracy rate are improved.

Drawings

FIG. 1 is a schematic diagram of a speech recognition device in a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a first embodiment of a speech recognition method according to the present invention;

FIG. 3 is a diagram of a speech recognition system according to an embodiment of the speech recognition method of the present invention;

FIG. 4 is a flowchart illustrating a second embodiment of a speech recognition method according to the present invention;

FIG. 5 is a schematic overall flowchart of a speech recognition method according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a positioning process of an embodiment of a speech recognition method according to the present invention;

FIG. 7 is a block diagram of a first embodiment of a speech recognition device according to the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a speech recognition device in a hardware operating environment according to an embodiment of the present invention.

As shown in fig. 1, the voice recognition apparatus may include: a processor 1001, such as a Central Processing Unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. The communication bus 1002 is used to implement connection communication among these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a Wireless interface (e.g., a Wireless-Fidelity (Wi-Fi) interface). The Memory 1005 may be a Random Access Memory (RAM) Memory, or may be a Non-Volatile Memory (NVM), such as a disk Memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the configuration shown in fig. 1 is not intended to be limiting of speech recognition devices and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a storage medium, may include therein an operating system, a network communication module, a user interface module, and a voice recognition program.

In the voice recognition apparatus shown in fig. 1, the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 of the voice recognition apparatus of the present invention may be provided in the voice recognition apparatus, which calls the voice recognition program stored in the memory 1005 through the processor 1001 and executes the voice recognition method provided by the embodiment of the present invention.

An embodiment of the present invention provides a speech recognition method, and referring to fig. 2, fig. 2 is a flowchart illustrating a first embodiment of a speech recognition method according to the present invention.

The voice recognition method comprises the following steps:

step S10: and lip language acquisition images fed back by the positioning camera modules in a plurality of directions are acquired.

It should be noted that the execution main body of this embodiment is a controller in a speech recognition system, and includes a speech pickup array, a controller, and multiple positioning camera modules, where the positioning camera modules are cooperative camera modules with UWB (Ultra Wide Band), the number of the positioning camera modules is not less than 3, and the positioning camera modules can acquire images of users from different viewing angles, and may be regarded as a micro mobile UWB base station, as shown in fig. 3, when a user (i.e., a sound object) speaks, the positioning camera modules acquire images of lips of the user, the controller acquires lip acquisition images fed back by the positioning camera modules in multiple directions, determines a target lip and a user lip coordinate according to each lip acquisition image, determines a speech pickup direction of the speech pickup array according to the user lip coordinate, acquires a collected speech fed back by the speech pickup array according to the speech pickup direction, and finally recognizes speech content of the collected speech based on the target lip.

It can be understood that the positioning camera module is a cooperative camera module with UWB, which is equivalent to a micro mobile UWB base station, and the number of the positioning camera modules is not less than 3 and located at different view angles and directions in order to ensure that the lip collection images of the user at different view angles can be obtained. The cooperation of a plurality of location camera modules not only realizes the space positioning feedback that people's lip moved under the inactive state, also can realize the space positioning feedback that people's lip moved in the activity to can be more nimble accurate satisfy the multiple visual angle lip moving image shooting and the picking up of sound, realize the dynamic calibration of space region position simultaneously, thereby make the scope of surveying changeable, do not receive the restriction of fixed space.

In specific implementation, when a user starts to perform voice interaction, the positioning camera modules in all the directions start to perform image acquisition, so that a lip language acquisition image is obtained, the lip language acquisition image contains lips of the user, and the lip language acquisition images fed back by the positioning camera modules in multiple directions not only comprise forward visual angle images of the lips of the user, but also comprise visual angle images in non-specific postures and positions.

It should be noted that the controller acquires lip language acquisition images fed back by the positioning camera modules in multiple directions in real time.

Step S20: and determining the coordinates of the target lip language and the lips of the user according to each lip language acquisition image.

The target lip language refers to the utterance content determined by the current lip action of the user, and the lip coordinates of the user are three-dimensional coordinates determined based on the relative position of the face of the user in the target grid space.

It can be understood that the controller performs image splicing, correction and view angle conversion combination on the lip language acquisition images in all directions to obtain a spliced image in a wide angle, and performs lip feature extraction and calculation on the spliced image in the wide angle, so as to determine the coordinates of the target lip language and the lip language of the user.

Step S30: and determining the voice pick-up direction of the voice pick-up array according to the user lip coordinates.

It should be noted that the voice pickup array is a microphone array capable of performing voice pickup on a user, and after determining the lip coordinates of the user, the controller takes the lip coordinates of the user as negative feedback input of a main lobe direction of a beam formed by the voice pickup array, where the main lobe direction of the beam is the voice pickup direction.

Step S40: and acquiring the collected voice fed back by the voice pickup array according to the voice pickup direction.

It should be noted that the voice pickup direction of the voice pickup array is precisely aligned to the user to perform voice pickup, so that the acquired voice is guaranteed to be a high-quality voice signal directly acquired from the position of the sound source, and the acquired voice is sent to the controller.

Step S50: and recognizing the voice content of the collected voice according to the target lip language.

It should be noted that, after the controller obtains the target lip language and the collected voice, the controller performs audio-visual cooperative recognition based on the target lip language and the collected voice, so as to obtain an interactive content when performing voice interaction on the user, where the interactive content during voice interaction is the voice content.

It can be understood that, in order to ensure the accuracy of the speech recognition process, further, the recognizing the speech content of the collected speech according to the target lip speech includes: carrying out feature extraction on the collected voice to determine voice features; determining a target audio frequency according to the voice characteristics and a preset voice recognition model; performing frame-level matching on the target lip language and the target audio to obtain a frame-level matching result; and determining the voice content according to the frame level matching result.

In a specific implementation, the preset speech recognition model is obtained by training a large amount of sample speech and speech content corresponding to the sample speech, and the preset speech recognition model can acquire corresponding speech audio content according to input speech characteristics.

It should be noted that, after the controller acquires the collected voice, the controller performs feature extraction on the collected voice to obtain features of the collected voice, such as a short-time zero-crossing rate, a short-time energy, a short-time autocorrelation function, a short-time average amplitude, a spectrum difference amplitude, a spectrum centroid, a spectrum width, and a mel-frequency cepstrum coefficient, and the voice features include, but are not limited to, the short-time zero-crossing rate, the short-time energy, the short-time autocorrelation function, the short-time average amplitude, the spectrum difference amplitude, the spectrum centroid, the spectrum width, and the mel-frequency cepstrum coefficient of the collected voice.

It can be understood that after the voice features are obtained, the voice features are input to a preset voice recognition model for content recognition, so that target audio content corresponding to the collected voice is determined, and the target audio content can also be approximate interactive content determined by the user during voice interaction.

In the specific implementation, the image frame number and the voice frame number obtained every second are the same, so that the obtained target lip language and the target audio correspond to specific image frames and image frame corresponding information, the target lip language and the target audio are subjected to frame level matching based on the frame level information attached to the target lip language and the target audio, so that frame level matching results of the target lip language and the target audio are obtained, the target lip language and the target audio with the same content are matched in the frame level matching results, and the voice content is determined based on the target lip language and the target audio with the same content matching. For example, based on 12:00:00 to 12:00: determining that the target lip is 'me' from the first frame to the third frame of lip acquisition images at 05, and based on 12:00:00 to 12:00: when 05, collecting voices from the first frame to the third frame to determine that the target audio is me, and determining that the contents of the two are the same in matching, the voice content is me.

It should be noted that, in order to implement dynamic tracking of the lip of the user by the positioning camera module, so as to improve accuracy of voice recognition, after recognizing the voice content of the collected voice according to the target lip, the method further includes: adjusting the tracking direction of each positioning camera module according to the lip coordinates of the user to obtain each adjusted positioning camera module; and acquiring new lip language acquisition images fed back by the adjusted positioning camera modules, and performing voice recognition through the new lip language acquisition images.

It can be understood that the controller takes the lip coordinates of the user as negative feedback input of each positioning camera module, adjusts the tracking direction of each positioning tracking camera module in real time, and realizes dynamic calibration of the position of a space region, so that the detection range is variable and is not limited by a fixed space. And each adjusted positioning camera module performs image acquisition in real time, and sends a new lip language acquisition image to the controller, and the controller performs voice recognition according to the new lip language acquisition image.

In the embodiment, lip language acquisition images fed back by positioning camera modules in a plurality of directions are acquired; determining coordinates of a target lip language and a user lip according to each lip language acquisition image; determining the voice pickup direction of a voice pickup array according to the user lip coordinates; acquiring collected voice fed back by the voice pickup array according to the voice pickup direction; and recognizing the voice content of the collected voice according to the target lip language. Through the mode, the target lip language and the user lip coordinate are determined according to the lip language collecting images fed back by positioning camera shooting in the multiple directions, the voice collecting direction is determined based on the user lip coordinate, directional collecting of collected voice is achieved, voice content of the voice is collected based on target lip language recognition, the noise interference in the environment is reduced by combining the content under multiple modes, far-field voice interaction and noise reduction effects are fully guaranteed, non-wearing voice interaction is enabled to be more perfect, good robustness is achieved, pure signal restoration is achieved, and voice audibility and voice recognition accuracy rate are improved.

Referring to fig. 4, fig. 4 is a flowchart illustrating a speech recognition method according to a second embodiment of the present invention.

Based on the first embodiment, the step S20 in the speech recognition method of this embodiment includes:

step S21: and carrying out image combination on each lip language acquisition image to obtain a target lip language image.

It should be noted that the lip language acquisition image acquired by the controller is an image in each direction, and the lip language acquisition image in each direction needs to be subjected to image splicing, correction and view angle conversion combination to obtain a spliced image in a wide angle, where the spliced image in the wide angle is a target lip language image.

Step S22: and performing feature extraction on the target lip language image to determine lip language features.

It should be noted that, after the controller acquires the target lip language image, the controller detects the target lip language image, identifies the user lip and the image block corresponding to the user lip, and performs feature extraction on the image block corresponding to the user lip, so as to determine features such as the outer lip height, the inner lip height, the lip width of the user lip and the first derivative of the three along with time, and the lip language features include but are not limited to the outer lip height, the inner lip height, the lip width of the user lip and the first derivative of the three along with time.

Step S23: and determining a target lip language according to the lip language feature and a preset lip language recognition model.

It should be noted that the preset lip language recognition model is a model obtained after training according to a large number of sample lip language features and corresponding contents of the sample lip language features, the lip language features are input into the preset lip language model, and speaking contents corresponding to the lip language features can be obtained, and the speaking contents corresponding to the lip language features are the target lip language.

Step S24: and determining the coordinates of the lips of the user according to the target lip language image.

It should be noted that, the grid position of the user face in the target space grid may be determined according to the target lip language image, and the lip coordinates of the user may be determined based on the relative position of the user lip on the user face and the grid position of the user face in the target space grid.

It is understood that, in order to obtain accurate user lip coordinates, further, the determining the user lip coordinates according to the target lip language image includes: acquiring a comparison lip language image according to a preset image selection rule; determining the three-dimensional face position information of the user according to the target lip language image and the target space grid; and determining the lip coordinates of the user according to the three-dimensional face position information, the comparison lip language image and the target lip language image.

In a specific implementation, the preset image selection rule refers to selecting a last frame of lip language image of the target lip language image in unit time, and the comparison lip language image is the last frame of lip language image of the target lip language image in unit time. The target spatial grid is determined by networking based on a plurality of positioning camera modules as calibration.

It should be noted that, the grid position of the user face in the target space grid is extracted according to the target lip language image, so as to determine the three-dimensional stereo space image position information of the user face, and the three-dimensional stereo space image position information of the user face is the face three-dimensional position information.

It is understood that after the face three-dimensional position information is determined, the lip coordinates of the user can be determined based on the face three-dimensional position information, the comparison lip language image and the target lip language image.

In a specific implementation, in order to accurately determine a position based on three-dimensional position information of a human face, a comparison lip language image, and a target lip language image, further, determining a lip coordinate of a user according to the three-dimensional position information of the human face, the comparison lip language image, and the target lip language image includes: performing lip movement judgment on the lips of the user according to the comparison lip language image and the target lip language image to determine whether the lips are effective lip movement; when the comparison lip language image and the target lip language image are determined to be effective lip movements, determining the relative position of lips according to the target lip language image; and determining the lip coordinates of the user according to the relative position of the lips and the three-dimensional position information of the face.

The lip movement detection method includes the steps that the comparison lip language image and the target lip language image are compared to determine whether the lips of a user change or not, when the lips of the user change, effective lip movement is achieved, at the moment, the relative position of the lips of the user on the face of the user is calculated according to the target lip language image, the relative position of the lips of the user on the face of the user is the relative position of the lips, the three-dimensional coordinates of the lips of the user in the networking range of the multiple positioning camera modules can be calculated according to the relative position of the lips and the three-dimensional position information of the face, and the three-dimensional coordinates of the lips of the user in the networking range of the multiple positioning camera modules are the coordinates of the lips of the user.

It can be understood that, in order to implement the dynamic calibration of the position of the spatial region and the accuracy of the dynamic tracking, further before determining the three-dimensional position information of the face of the user according to the target lip language image and the target spatial grid, the method further includes: acquiring reference positions fed back by positioning camera modules in a plurality of directions; calculating the space range of the target area according to each reference position; and carrying out grid division on the target area according to the space range to obtain a target space grid.

In the specific implementation, the UWB positioning camera module position on each positioning camera module, the controller is used as calibration after networking is carried out based on the reference position fed back by each positioning camera module, the space range of the area which can be currently subjected to voice acquisition of the voice recognition system can be determined, the area which can be currently subjected to voice acquisition of the voice recognition system is the target area, and the controller is used for carrying out grid division on the target area according to the space range, so that the target space grid obtained by the target area division of the voice recognition system is obtained.

It should be noted that, as shown in fig. 5 and 6, UWB in each positioning camera module scales a reference position of the positioning camera module, and feeds the reference position back to the controller to calculate a spatial range of a target area, a spatial grid is divided, the controller performs stitching, correction, and view angle conversion combination according to each lip captured image to obtain a target lip language image, and extracts a grid position of a user face in the target spatial grid based on the target lip language image and determines three-dimensional face position information, a previous frame image and the target lip language image are compared in a unit time, whether the lip movement is effective is determined, when the lip movement is effective, a lip relative position of the user lip on the user face is calculated according to the target lip language image, a lip coordinate of the user is determined based on the lip relative position and the three-dimensional face position information, a voice pickup direction of a voice pickup array and a tracking direction of each positioning camera module are adjusted according to the lip coordinate of the user, and each positioning camera module performs position scaling again after adjusting the tracking direction. The controller extracts continuous lip language features according to the target lip language image, inputs the lip language features to a preset lip language recognition model to obtain target lip language, extracts voice features according to collected voice, inputs the voice features to the preset voice recognition model to determine target audio, performs audio-visual cooperative recognition based on the target audio and the target lip language, and finally obtains a voice recognition result so as to determine voice content of a user during voice interaction.

In the embodiment, the target lip language image is obtained by combining the images of all the lip language acquisition images; extracting the features of the target lip language image to determine the lip language features; determining a target lip language according to the lip language features and a preset lip language recognition model; and determining the coordinates of the lips of the user according to the target lip language image. The target lip language images are obtained by combining the lip language acquisition images, and the target lip language and the lip coordinates of the user are determined based on the target lip language images in the wide angle, so that the accuracy of the subsequent image dynamic tracking, voice acquisition and voice recognition processes is ensured.

In addition, referring to fig. 7, an embodiment of the present invention further provides a speech recognition apparatus, where the speech recognition apparatus includes:

and the acquisition module 10 is used for acquiring lip language acquisition images fed back by the positioning camera modules in multiple directions.

And the determining module 20 is configured to determine coordinates of the target lip language and the user lip according to each lip language acquisition image.

The determining module 20 is further configured to determine a voice pickup direction of the voice pickup array according to the lip coordinates of the user.

The obtaining module 10 is further configured to obtain a collected voice fed back by the voice pickup array according to the voice pickup direction.

And the recognition module 30 is configured to recognize the voice content of the collected voice according to the target lip language.

In the embodiment, lip language acquisition images fed back by positioning camera modules in a plurality of directions are obtained; determining coordinates of a target lip language and a user lip according to each lip language acquisition image; determining the voice pick-up direction of the voice pick-up array according to the user lip coordinates; acquiring collected voice fed back by the voice pickup array according to the voice pickup direction; and recognizing the voice content of the collected voice according to the target lip language. Through the mode, the target lip language and the user lip coordinate are determined according to the lip language acquisition images fed back by the positioning camera in the multiple directions, the voice pickup direction is determined based on the user lip coordinate, the directional pickup of the acquired voice is realized, the voice content of the voice is acquired based on the target lip language identification, the interference of noise in the environment is reduced by combining the content under multiple modes, the far-field voice interaction and the noise reduction effect are fully ensured, the non-wearing voice interaction is more perfect, the robustness is good, the pure signal restoration is realized, and the voice audibility and the voice identification accuracy rate are increased.

In an embodiment, the determining module 20 is further configured to perform image combination on each lip language acquisition image to obtain a target lip language image;

performing feature extraction on the target lip language image to determine lip language features;

determining a target lip language according to the lip language features and a preset lip language recognition model;

and determining the lip coordinates of the user according to the target lip language image.

In an embodiment, the determining module 20 is further configured to obtain a comparison lip language image according to a preset image selection rule;

determining the three-dimensional face position information of the user according to the target lip language image and the target space grid;

In an embodiment, the determining module 20 is further configured to perform lip movement judgment on the lips of the user according to the comparative lip language image and the target lip language image, and determine whether the lips are effective lip movement;

when the comparison lip language image and the target lip language image are determined to be effective lip movement, determining the relative position of the lips according to the target lip language image;

and determining the lip coordinates of the user according to the relative position of the lips and the three-dimensional position information of the face.

In an embodiment, the determining module 20 is further configured to obtain reference positions fed back by positioning camera modules in multiple orientations;

In an embodiment, the recognition module 30 is further configured to perform feature extraction on the collected voice, and determine a voice feature;

and determining the voice content according to the frame level matching result.

In an embodiment, the identification module 30 is further configured to adjust a tracking direction of each positioning camera module according to the user lip coordinates, so as to obtain each adjusted positioning camera module;

and acquiring new lip language acquisition images fed back by the adjusted positioning camera modules, and performing voice recognition through the new lip language acquisition images.

Since the present apparatus employs all technical solutions of all the above embodiments, at least all the beneficial effects brought by the technical solutions of the above embodiments are achieved, and are not described in detail herein.

Furthermore, an embodiment of the present invention further provides a storage medium, where a speech recognition program is stored, and the speech recognition program, when executed by a processor, implements the steps of the speech recognition method as described above.

Since the storage medium adopts all technical solutions of all the embodiments, at least all the beneficial effects brought by the technical solutions of the embodiments are achieved, and no further description is given here.

It should be noted that the above-mentioned work flows are only illustrative and do not limit the scope of the present invention, and in practical applications, those skilled in the art may select some or all of them according to actual needs to implement the purpose of the solution of the present embodiment, and the present invention is not limited herein.

In addition, the technical details that are not described in detail in this embodiment may refer to the speech recognition method provided in any embodiment of the present invention, and are not described herein again.

Furthermore, it should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrases "comprising one of 8230; \8230;" 8230; "does not exclude the presence of additional like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention or portions thereof that contribute to the prior art may be embodied in the form of a software product, where the computer software product is stored in a storage medium (e.g. Read Only Memory (ROM)/RAM, magnetic disk, optical disk), and includes several instructions for enabling a terminal device (e.g. a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A speech recognition method, characterized in that the speech recognition method comprises:

2. The speech recognition method of claim 1, wherein determining target lip language and user lip coordinates from each lip language captured image comprises:

3. The speech recognition method of claim 2, wherein said determining user lip coordinates from the target lip language image comprises:

4. The speech recognition method of claim 3, wherein the determining user lip coordinates from the three-dimensional face position information, the comparison lip language image and the target lip language image comprises:

5. The speech recognition method of claim 3, wherein before determining the three-dimensional position information of the face of the user according to the target lip language image and the target space grid, the method further comprises:

6. The speech recognition method of claim 1, wherein the recognizing the speech content of the captured speech according to the target lip speech comprises:

carrying out feature extraction on the collected voice to determine voice features;

determining a target audio according to the voice characteristics and a preset voice recognition model;

and determining the voice content according to the frame level matching result.

7. The speech recognition method according to any one of claims 1 to 6, wherein after recognizing the speech content of the captured speech according to the target lip language, the method further comprises:

8. A speech recognition apparatus, characterized in that the speech recognition apparatus comprises:

9. A speech recognition device, characterized in that the device comprises: a memory, a processor, and a speech recognition program stored on the memory and executable on the processor, the speech recognition program configured to implement the speech recognition method of any one of claims 1 to 7.

10. A storage medium, characterized in that the storage medium has stored thereon a speech recognition program which, when executed by a processor, implements the speech recognition method according to any one of claims 1 to 7.