CN111081222A

CN111081222A - Speech recognition method, speech recognition apparatus, storage medium, and electronic apparatus

Info

Publication number: CN111081222A
Application number: CN201911399924.2A
Authority: CN
Inventors: 刘洋; 唐大闰
Original assignee: Beijing Mininglamp Software System Co ltd
Current assignee: Beijing Mininglamp Software System Co ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2020-04-28

Abstract

The invention provides a voice recognition method, a voice recognition device, a storage medium and an electronic device, which can solve the problem that the voice recognition is difficult to be effectively carried out in a real environment in the related technology by acquiring target audio data mixed with noise data and recognizing the target audio data containing the noise data by using a first model, thereby achieving the technical effects of enabling a voice model to be more suitable for the real environment and improving the accuracy of the voice recognition in the real environment.

Description

Speech recognition method, speech recognition apparatus, storage medium, and electronic apparatus

Technical Field

The present invention relates to the field of communications, and in particular, to a voice recognition method, apparatus, storage medium, and electronic apparatus.

Background

The quality of the voice training data plays a crucial role in the voice recognition model, wherein the accuracy of voice recognition can be guaranteed in a quiet environment, but in a daily real environment, the voice model can be interfered by a noise environment, so that the accuracy of voice recognition is greatly reduced, the existing marked training data cannot effectively simulate the real environment, and voice recognition in the real environment is difficult to effectively perform.

Aiming at the problem that the voice recognition in the real environment is difficult to effectively carry out in the related art, an effective solution is not provided at present.

Disclosure of Invention

Embodiments of the present invention provide a speech recognition method, apparatus, storage medium, and electronic apparatus, so as to at least solve the problem in the related art that it is difficult to perform speech recognition effectively in a real environment.

According to an embodiment of the present invention, there is provided a speech recognition method including: acquiring target audio data; identifying the target audio data by using a first model, and determining the identified voice corresponding to the target audio data, wherein the first model is trained by machine learning by using a plurality of sets of training data, and each set of data in the plurality of sets of training data comprises: and a voice corresponding to the audio data, a first set of data included in the plurality of sets of training data, the first audio data included in the first set of data having noise data mixed therein.

According to another embodiment of the present invention, there is provided a voice recognition apparatus including: the acquisition module is used for acquiring target audio data; a processing module, configured to recognize the target audio data by using a first model, and determine a recognized voice corresponding to the target audio data, where the first model is trained through machine learning by using multiple sets of training data, and each set of data in the multiple sets of training data includes: and a voice corresponding to the audio data, a first set of data included in the plurality of sets of training data, the first audio data included in the first set of data having noise data mixed therein.

According to a further embodiment of the present invention, there is also provided a computer-readable storage medium having a computer program stored thereon, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.

According to the invention, the acquired target audio data mixed with the noise data is utilized, the first model is used for identifying the target audio data containing the mixed noise data, the data of the real environment is simulated through the mixed noise and is added into the audio data for training, and the identified voice corresponding to the target audio data is determined, so that the problem that the voice identification is difficult to effectively perform in the real environment in the related technology can be solved, and the technical effects of enabling the voice model to be more suitable for the real environment and improving the voice identification accuracy in the real environment are achieved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a schematic view of an application scenario of a speech recognition method according to an embodiment of the present invention;

FIG. 2 is a flow diagram of a method of speech recognition according to an embodiment of the present invention;

fig. 3 is a block diagram of a speech recognition apparatus according to an embodiment of the present invention.

Detailed Description

The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

The method embodiments provided in the present application may be executed in a mobile terminal, a computer terminal or a similar computing device. Taking the example of the operation on the mobile terminal, fig. 1 is a hardware structure block diagram of the mobile terminal of a speech recognition method according to an embodiment of the present invention. As shown in fig. 1, the mobile terminal 10 may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, and optionally may also include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration, and does not limit the structure of the mobile terminal. For example, the mobile terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store computer programs, for example, software programs and modules of application software, such as a computer program corresponding to the speech recognition method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer programs stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the mobile terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

In this embodiment, a speech recognition method operating in a terminal or a server is provided, and fig. 1 is a flowchart of a speech recognition method according to an embodiment of the present invention, as shown in fig. 1, the flowchart includes the following steps:

step S202, acquiring target audio data;

step S204, using a first model to identify the target audio data, and determining the identified voice corresponding to the target audio data, wherein the first model is trained by machine learning using multiple sets of training data, and each set of data in the multiple sets of training data includes: audio data and a voice corresponding to the audio data, a first set of data is included in the plurality of sets of training data, and noise data is mixed in first audio data included in the first set of data.

Optionally, in this embodiment, the target audio data may include, but is not limited to, pre-collected audio data, for example, voice data collected by a sound collection device, the collected audio data is labeled, the first model is obtained through machine learning training as training data, the labeled audio data in a quiet environment may also include, but is not limited to, the labeled audio data may be obtained through purchasing or obtaining free, and the labeled audio data is obtained through machine learning training as training data in the first model. The different audio data may be used as the same or different sets of training data to obtain the first model through machine learning training.

Alternatively, in the present embodiment, the first audio data is mixed with noise data, and the noise data may include, but is not limited to, artificially synthesized noise data or noise data collected by a sound collection device from a real environment, such as a restaurant, a transportation station, a school, and a hospital, which has a large traffic and a large amount of noise data.

Through the steps, the acquired target audio data mixed with the noise data is utilized, the first model is used for identifying the target audio data mixed with the noise data, the data of the real environment is simulated through the mixed noise and is added into the audio data for training, and the identified voice corresponding to the target audio data is determined, so that the problem that the voice identification is difficult to effectively perform in the real environment in the related technology can be solved, the technical effects that the voice model is more suitable for the real environment, and the voice identification accuracy in the real environment is improved are achieved.

In an optional embodiment, before the target audio data is recognized using the first model and the recognized speech corresponding to the target audio data is determined, the method further comprises: acquiring the noise data and original audio data emitted by a target object; mixing the noise data and the original audio data to obtain the first audio data; and training a second model by utilizing a first group of data comprising the first audio data and other groups of data comprising the plurality of groups of data to obtain the first model.

Alternatively, in the present embodiment, the target object may include, but is not limited to, a target object capable of making voice information, such as attendants and customers in a restaurant, doctors and patients in a hospital, teachers and students in a school, and the like. The target object may also include, but is not limited to, a voice playing device, such as a speaker, a sound device, an earphone, etc., which are just examples, and the present invention is not limited to the target object and the environment where the target object is located.

Optionally, in this embodiment, the mixing of the noise data and the original audio data may be performed in a preset manner, for example, the mixing of the audio data is performed in a manner of different signal-to-noise ratios, reverberation, audio speeds, and the like, so as to obtain the first audio data for training.

In an alternative embodiment, wherein obtaining the noise data comprises: the noise data is acquired for a predetermined period of time using one or more sound recording apparatuses installed at predetermined locations.

Optionally, in this embodiment, the recording device may include, but is not limited to, a hi-fi recording pen, a recording phone, a recording card, a recording box, a recorder, a recording system, and the like.

In an alternative embodiment, the predetermined location comprises at least one of: a table of the restaurant, a wall of the restaurant, and a position a predetermined distance from a monitoring camera of the restaurant; and/or, the predetermined time period comprises at least one of: the flat time period of the sitting rate of the restaurant and the peak time period of the sitting rate of the restaurant.

Optionally, in this embodiment, the length of the predetermined time period may be set according to actual conditions, for example, if the peak time of the restaurant is 1 hour, the predetermined time period is set to 1 hour, and if the peak time of the restaurant is 2 hours, the predetermined time period is set to 2 hours, which is just an example, and the length of the specific predetermined time period may be adjusted according to actual needs, which is not limited in this disclosure.

In an alternative embodiment, mixing the noise data and the original audio data to obtain the first audio data comprises: mixing the noise data into the original audio data according to a predetermined mixing parameter to obtain the first audio data, wherein the predetermined mixing parameter comprises at least one of: a predetermined signal-to-noise ratio, a predetermined reverberation, a predetermined audio speed.

Optionally, in this embodiment, the predetermined signal-to-noise ratio, the predetermined reverberation, and the predetermined audio speed may be set according to actual needs, and the specific values of the predetermined mixing parameter are not limited in any way in the present invention.

In an alternative embodiment, training the second model to obtain the first model using the first set of data including the first audio data and the other sets of data included in the plurality of sets of data includes: dividing the multiple groups of data into a training set, a test set and a verification set according to a preset proportion; and training the second model by utilizing each divided data set to obtain the first model.

Alternatively, in this embodiment, the predetermined ratio may include, but is not limited to, a ratio of 7:2:1, in other words, the predetermined ratio may be divided into a training set, a testing set, and a verification set according to a ratio of 7:2:1, and the training of the second model to obtain the first model by using the respective divided data sets may be performed by training using HMM (hidden markov model) + GMM (gaussian mixture model) or HMM (hidden markov model) -DNN (deep neural network), so as to complete speech recognition under the mixed noise data.

The above-mentioned specific division ratio of the multiple groups of data and the method for performing speech recognition are only an optional example, and the present invention is not limited to this.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The following description is made in detail with respect to the collection of noise audio in a real environment at a restaurant as an alternative embodiment:

step 1: and the audio collecting module is used for the waiter to wear the recording device to service the customer, and the recording device collects the speaking audio (corresponding to the original audio data) of the waiter.

Step 2: audio data labeled in a quiet environment, which can be purchased or obtained free of charge, is prepared, and audio contents are independent of restaurant scenes (corresponding to the aforementioned original audio data).

And step 3: and the equipment type selection module selects a high-fidelity recording pen (corresponding to the recording equipment) to be released to the chain restaurant at fixed time and fixed point.

And 4, step 4: the storefront selection module of the chain restaurants has a wide area range, a certain time cost is needed for collecting all the chain restaurants, in order to collect a large amount of noise, shops with high seating rate and low seating rate of the restaurants in a peak time period are respectively selected, and the recording equipment selected in the step 3 is put into the restaurants. (restaurant selection means is not exclusive, e.g. only restaurant with high seating rate)

And 5: and (4) a placing position module of the recording equipment, based on the restaurant selected in the step (4), respectively placing the recording equipment on the table of the restaurant, the wall of the restaurant, the vicinity of the restaurant monitoring camera and the like. According to different placing positions of the equipment (corresponding to the preset positions), the noise at different positions is collected.

Step 6: and the noise data sampling module selects a flat peak time period and a peak time period of each day and respectively records 1 hour of noise data. (the selection strategy of the time period is not unique and corresponds to the aforementioned predetermined time period)

And 7: and mixing the noise data collected in the operation of the step into the labeled training data generated in the step 1 and the step 2 according to different signal-to-noise ratios, reverberation, audio speed and the like (the noise mixing method is not unique and corresponds to the preset mixing parameters).

And 8: the training data after the noise data mixing in step 7 is divided into a training set, a test set and a verification set according to a ratio of 7:2:1 (corresponding to the predetermined ratio). And (the division ratio mode is not unique), training by adopting an HMM (hidden Markov model) + GMM (Gaussian mixture model) or an HMM (hidden Markov model) -DNN (deep neural network), and completing the speech recognition process under the mixed noise data. (general speech recognition method corresponding to the aforementioned method of training the second model)

In this embodiment, a speech recognition apparatus is further provided, and the speech recognition apparatus is used to implement the foregoing embodiments and preferred embodiments, which have already been described and are not described again. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 2 is a block diagram of a speech recognition apparatus according to an embodiment of the present invention, as shown in fig. 2, the apparatus including:

an obtaining module 32, configured to obtain target audio data;

a processing module 34, configured to recognize the target audio data by using a first model, and determine a recognized voice corresponding to the target audio data, where the first model is trained through machine learning by using multiple sets of training data, and each set of the multiple sets of training data includes: and a voice corresponding to the audio data, a first set of data included in the plurality of sets of training data, the first audio data included in the first set of data having noise data mixed therein.

In an optional embodiment, the apparatus is further configured to: acquiring the noise data and original audio data emitted by a target object before identifying the target audio data using a first model and determining a recognized voice corresponding to the target audio data; mixing the noise data and the original audio data to obtain the first audio data; and training a second model by utilizing a first group of data comprising the first audio data and other groups of data comprising the plurality of groups of data to obtain the first model.

In an alternative embodiment, the obtaining module 32 is configured to obtain the noise data by:

the noise data is acquired for a predetermined period of time using one or more sound recording apparatuses installed at predetermined locations.

In an alternative embodiment, the means for mixing the noise data and the original audio data to obtain the first audio data comprises: mixing the noise data into the original audio data according to a predetermined mixing parameter to obtain the first audio data, wherein the predetermined mixing parameter comprises at least one of: a predetermined signal-to-noise ratio, a predetermined reverberation, a predetermined audio speed.

In an optional embodiment, the apparatus is configured to train a second model using a first set of data including the first audio data and other sets of data included in the plurality of sets of data to obtain the first model by: dividing the multiple groups of data into a training set, a test set and a verification set according to a preset proportion; and training the second model by utilizing each divided data set to obtain the first model.

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

Embodiments of the present invention also provide a computer-readable storage medium having a computer program stored thereon, wherein the computer program is arranged to perform the steps of any of the above-mentioned method embodiments when executed.

Alternatively, in the present embodiment, the above-mentioned computer-readable storage medium may be configured to store a computer program for executing the steps of:

s1; acquiring target audio data;

and S2, recognizing the target audio data by using a first model, and determining the recognized voice corresponding to the target audio data, wherein the first model is trained by machine learning by using multiple groups of training data, and each group of data in the multiple groups of training data comprises: audio data and a voice corresponding to the audio data, a first set of data is included in the plurality of sets of training data, and noise data is mixed in first audio data included in the first set of data.

Optionally, the computer readable storage medium is further arranged to store a computer program for performing the steps of:

s1; acquiring target audio data;

Optionally, in this embodiment, the computer-readable storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

s1; acquiring target audio data;

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A speech recognition method, comprising:

acquiring target audio data;

identifying the target audio data by using a first model, and determining the identified voice corresponding to the target audio data, wherein the first model is trained by machine learning by using a plurality of sets of training data, and each set of data in the plurality of sets of training data comprises: and a voice corresponding to the audio data, a first set of data included in the plurality of sets of training data, the first audio data included in the first set of data having noise data mixed therein.

2. The method of claim 1, wherein prior to identifying the target audio data using the first model and determining the identified speech corresponding to the target audio data, the method further comprises:

acquiring the noise data and original audio data emitted by a target object;

mixing the noise data and the original audio data to obtain the first audio data;

and training a second model by utilizing a first group of data comprising the first audio data and other groups of data comprising the plurality of groups of data to obtain the first model.

3. The method of claim 2, wherein obtaining the noise data comprises:

4. The method of claim 3,

the predetermined location comprises at least one of: a table of the restaurant, a wall of the restaurant, and a position a predetermined distance from a monitoring camera of the restaurant; and/or the presence of a gas in the gas,

the predetermined period of time includes at least one of: the flat time period of the sitting rate of the restaurant and the peak time period of the sitting rate of the restaurant.

5. The method of claim 2, wherein mixing the noise data and the original audio data to obtain the first audio data comprises:

mixing the noise data into the original audio data according to a predetermined mixing parameter to obtain the first audio data, wherein the predetermined mixing parameter comprises at least one of: a predetermined signal-to-noise ratio, a predetermined reverberation, a predetermined audio speed.

6. The method of claim 2, wherein training a second model to obtain the first model using a first set of data comprising the first audio data and other sets of data included in the plurality of sets of data comprises:

dividing the multiple groups of data into a training set, a test set and a verification set according to a preset proportion;

and training the second model by utilizing each divided data set to obtain the first model.

7. A speech recognition apparatus, comprising:

the acquisition module is used for acquiring target audio data;

a processing module, configured to recognize the target audio data by using a first model, and determine a recognized voice corresponding to the target audio data, where the first model is trained through machine learning by using multiple sets of training data, and each set of data in the multiple sets of training data includes: and a voice corresponding to the audio data, a first set of data included in the plurality of sets of training data, the first audio data included in the first set of data having noise data mixed therein.

8. The apparatus of claim 7, wherein the apparatus is further configured to:

acquiring the noise data and original audio data emitted by a target object before identifying the target audio data using a first model and determining a recognized voice corresponding to the target audio data;

9. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is arranged to perform the method of any of claims 1 to 6 when executed.

10. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 6.