CN111081222A - Speech recognition method, speech recognition apparatus, storage medium, and electronic apparatus - Google Patents

Speech recognition method, speech recognition apparatus, storage medium, and electronic apparatus Download PDF

Info

Publication number
CN111081222A
CN111081222A CN201911399924.2A CN201911399924A CN111081222A CN 111081222 A CN111081222 A CN 111081222A CN 201911399924 A CN201911399924 A CN 201911399924A CN 111081222 A CN111081222 A CN 111081222A
Authority
CN
China
Prior art keywords
data
audio data
model
training
noise
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911399924.2A
Other languages
Chinese (zh)
Inventor
刘洋
唐大闰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Mininglamp Software System Co ltd
Original Assignee
Beijing Mininglamp Software System Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Mininglamp Software System Co ltd filed Critical Beijing Mininglamp Software System Co ltd
Priority to CN201911399924.2A priority Critical patent/CN111081222A/en
Publication of CN111081222A publication Critical patent/CN111081222A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention provides a voice recognition method, a voice recognition device, a storage medium and an electronic device, which can solve the problem that the voice recognition is difficult to be effectively carried out in a real environment in the related technology by acquiring target audio data mixed with noise data and recognizing the target audio data containing the noise data by using a first model, thereby achieving the technical effects of enabling a voice model to be more suitable for the real environment and improving the accuracy of the voice recognition in the real environment.

Description

Speech recognition method, speech recognition apparatus, storage medium, and electronic apparatus
Technical Field
The present invention relates to the field of communications, and in particular, to a voice recognition method, apparatus, storage medium, and electronic apparatus.
Background
The quality of the voice training data plays a crucial role in the voice recognition model, wherein the accuracy of voice recognition can be guaranteed in a quiet environment, but in a daily real environment, the voice model can be interfered by a noise environment, so that the accuracy of voice recognition is greatly reduced, the existing marked training data cannot effectively simulate the real environment, and voice recognition in the real environment is difficult to effectively perform.
Aiming at the problem that the voice recognition in the real environment is difficult to effectively carry out in the related art, an effective solution is not provided at present.
Disclosure of Invention
Embodiments of the present invention provide a speech recognition method, apparatus, storage medium, and electronic apparatus, so as to at least solve the problem in the related art that it is difficult to perform speech recognition effectively in a real environment.
According to an embodiment of the present invention, there is provided a speech recognition method including: acquiring target audio data; identifying the target audio data by using a first model, and determining the identified voice corresponding to the target audio data, wherein the first model is trained by machine learning by using a plurality of sets of training data, and each set of data in the plurality of sets of training data comprises: and a voice corresponding to the audio data, a first set of data included in the plurality of sets of training data, the first audio data included in the first set of data having noise data mixed therein.
According to another embodiment of the present invention, there is provided a voice recognition apparatus including: the acquisition module is used for acquiring target audio data; a processing module, configured to recognize the target audio data by using a first model, and determine a recognized voice corresponding to the target audio data, where the first model is trained through machine learning by using multiple sets of training data, and each set of data in the multiple sets of training data includes: and a voice corresponding to the audio data, a first set of data included in the plurality of sets of training data, the first audio data included in the first set of data having noise data mixed therein.
According to a further embodiment of the present invention, there is also provided a computer-readable storage medium having a computer program stored thereon, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.
According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.
According to the invention, the acquired target audio data mixed with the noise data is utilized, the first model is used for identifying the target audio data containing the mixed noise data, the data of the real environment is simulated through the mixed noise and is added into the audio data for training, and the identified voice corresponding to the target audio data is determined, so that the problem that the voice identification is difficult to effectively perform in the real environment in the related technology can be solved, and the technical effects of enabling the voice model to be more suitable for the real environment and improving the voice identification accuracy in the real environment are achieved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
fig. 1 is a schematic view of an application scenario of a speech recognition method according to an embodiment of the present invention;
FIG. 2 is a flow diagram of a method of speech recognition according to an embodiment of the present invention;
fig. 3 is a block diagram of a speech recognition apparatus according to an embodiment of the present invention.
Detailed Description
The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
The method embodiments provided in the present application may be executed in a mobile terminal, a computer terminal or a similar computing device. Taking the example of the operation on the mobile terminal, fig. 1 is a hardware structure block diagram of the mobile terminal of a speech recognition method according to an embodiment of the present invention. As shown in fig. 1, the mobile terminal 10 may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, and optionally may also include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration, and does not limit the structure of the mobile terminal. For example, the mobile terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The memory 104 may be used to store computer programs, for example, software programs and modules of application software, such as a computer program corresponding to the speech recognition method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer programs stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the mobile terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.
In this embodiment, a speech recognition method operating in a terminal or a server is provided, and fig. 1 is a flowchart of a speech recognition method according to an embodiment of the present invention, as shown in fig. 1, the flowchart includes the following steps:
step S202, acquiring target audio data;
step S204, using a first model to identify the target audio data, and determining the identified voice corresponding to the target audio data, wherein the first model is trained by machine learning using multiple sets of training data, and each set of data in the multiple sets of training data includes: audio data and a voice corresponding to the audio data, a first set of data is included in the plurality of sets of training data, and noise data is mixed in first audio data included in the first set of data.
Optionally, in this embodiment, the target audio data may include, but is not limited to, pre-collected audio data, for example, voice data collected by a sound collection device, the collected audio data is labeled, the first model is obtained through machine learning training as training data, the labeled audio data in a quiet environment may also include, but is not limited to, the labeled audio data may be obtained through purchasing or obtaining free, and the labeled audio data is obtained through machine learning training as training data in the first model. The different audio data may be used as the same or different sets of training data to obtain the first model through machine learning training.
Alternatively, in the present embodiment, the first audio data is mixed with noise data, and the noise data may include, but is not limited to, artificially synthesized noise data or noise data collected by a sound collection device from a real environment, such as a restaurant, a transportation station, a school, and a hospital, which has a large traffic and a large amount of noise data.
Through the steps, the acquired target audio data mixed with the noise data is utilized, the first model is used for identifying the target audio data mixed with the noise data, the data of the real environment is simulated through the mixed noise and is added into the audio data for training, and the identified voice corresponding to the target audio data is determined, so that the problem that the voice identification is difficult to effectively perform in the real environment in the related technology can be solved, the technical effects that the voice model is more suitable for the real environment, and the voice identification accuracy in the real environment is improved are achieved.
In an optional embodiment, before the target audio data is recognized using the first model and the recognized speech corresponding to the target audio data is determined, the method further comprises: acquiring the noise data and original audio data emitted by a target object; mixing the noise data and the original audio data to obtain the first audio data; and training a second model by utilizing a first group of data comprising the first audio data and other groups of data comprising the plurality of groups of data to obtain the first model.
Alternatively, in the present embodiment, the target object may include, but is not limited to, a target object capable of making voice information, such as attendants and customers in a restaurant, doctors and patients in a hospital, teachers and students in a school, and the like. The target object may also include, but is not limited to, a voice playing device, such as a speaker, a sound device, an earphone, etc., which are just examples, and the present invention is not limited to the target object and the environment where the target object is located.
Optionally, in this embodiment, the mixing of the noise data and the original audio data may be performed in a preset manner, for example, the mixing of the audio data is performed in a manner of different signal-to-noise ratios, reverberation, audio speeds, and the like, so as to obtain the first audio data for training.
In an alternative embodiment, wherein obtaining the noise data comprises: the noise data is acquired for a predetermined period of time using one or more sound recording apparatuses installed at predetermined locations.
Optionally, in this embodiment, the recording device may include, but is not limited to, a hi-fi recording pen, a recording phone, a recording card, a recording box, a recorder, a recording system, and the like.
In an alternative embodiment, the predetermined location comprises at least one of: a table of the restaurant, a wall of the restaurant, and a position a predetermined distance from a monitoring camera of the restaurant; and/or, the predetermined time period comprises at least one of: the flat time period of the sitting rate of the restaurant and the peak time period of the sitting rate of the restaurant.
Optionally, in this embodiment, the length of the predetermined time period may be set according to actual conditions, for example, if the peak time of the restaurant is 1 hour, the predetermined time period is set to 1 hour, and if the peak time of the restaurant is 2 hours, the predetermined time period is set to 2 hours, which is just an example, and the length of the specific predetermined time period may be adjusted according to actual needs, which is not limited in this disclosure.
In an alternative embodiment, mixing the noise data and the original audio data to obtain the first audio data comprises: mixing the noise data into the original audio data according to a predetermined mixing parameter to obtain the first audio data, wherein the predetermined mixing parameter comprises at least one of: a predetermined signal-to-noise ratio, a predetermined reverberation, a predetermined audio speed.
Optionally, in this embodiment, the predetermined signal-to-noise ratio, the predetermined reverberation, and the predetermined audio speed may be set according to actual needs, and the specific values of the predetermined mixing parameter are not limited in any way in the present invention.
In an alternative embodiment, training the second model to obtain the first model using the first set of data including the first audio data and the other sets of data included in the plurality of sets of data includes: dividing the multiple groups of data into a training set, a test set and a verification set according to a preset proportion; and training the second model by utilizing each divided data set to obtain the first model.
Alternatively, in this embodiment, the predetermined ratio may include, but is not limited to, a ratio of 7:2:1, in other words, the predetermined ratio may be divided into a training set, a testing set, and a verification set according to a ratio of 7:2:1, and the training of the second model to obtain the first model by using the respective divided data sets may be performed by training using HMM (hidden markov model) + GMM (gaussian mixture model) or HMM (hidden markov model) -DNN (deep neural network), so as to complete speech recognition under the mixed noise data.
The above-mentioned specific division ratio of the multiple groups of data and the method for performing speech recognition are only an optional example, and the present invention is not limited to this.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
The following description is made in detail with respect to the collection of noise audio in a real environment at a restaurant as an alternative embodiment:
step 1: and the audio collecting module is used for the waiter to wear the recording device to service the customer, and the recording device collects the speaking audio (corresponding to the original audio data) of the waiter.
Step 2: audio data labeled in a quiet environment, which can be purchased or obtained free of charge, is prepared, and audio contents are independent of restaurant scenes (corresponding to the aforementioned original audio data).
And step 3: and the equipment type selection module selects a high-fidelity recording pen (corresponding to the recording equipment) to be released to the chain restaurant at fixed time and fixed point.
And 4, step 4: the storefront selection module of the chain restaurants has a wide area range, a certain time cost is needed for collecting all the chain restaurants, in order to collect a large amount of noise, shops with high seating rate and low seating rate of the restaurants in a peak time period are respectively selected, and the recording equipment selected in the step 3 is put into the restaurants. (restaurant selection means is not exclusive, e.g. only restaurant with high seating rate)
And 5: and (4) a placing position module of the recording equipment, based on the restaurant selected in the step (4), respectively placing the recording equipment on the table of the restaurant, the wall of the restaurant, the vicinity of the restaurant monitoring camera and the like. According to different placing positions of the equipment (corresponding to the preset positions), the noise at different positions is collected.
Step 6: and the noise data sampling module selects a flat peak time period and a peak time period of each day and respectively records 1 hour of noise data. (the selection strategy of the time period is not unique and corresponds to the aforementioned predetermined time period)
And 7: and mixing the noise data collected in the operation of the step into the labeled training data generated in the step 1 and the step 2 according to different signal-to-noise ratios, reverberation, audio speed and the like (the noise mixing method is not unique and corresponds to the preset mixing parameters).
And 8: the training data after the noise data mixing in step 7 is divided into a training set, a test set and a verification set according to a ratio of 7:2:1 (corresponding to the predetermined ratio). And (the division ratio mode is not unique), training by adopting an HMM (hidden Markov model) + GMM (Gaussian mixture model) or an HMM (hidden Markov model) -DNN (deep neural network), and completing the speech recognition process under the mixed noise data. (general speech recognition method corresponding to the aforementioned method of training the second model)
In this embodiment, a speech recognition apparatus is further provided, and the speech recognition apparatus is used to implement the foregoing embodiments and preferred embodiments, which have already been described and are not described again. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.
Fig. 2 is a block diagram of a speech recognition apparatus according to an embodiment of the present invention, as shown in fig. 2, the apparatus including:
an obtaining module 32, configured to obtain target audio data;
a processing module 34, configured to recognize the target audio data by using a first model, and determine a recognized voice corresponding to the target audio data, where the first model is trained through machine learning by using multiple sets of training data, and each set of the multiple sets of training data includes: and a voice corresponding to the audio data, a first set of data included in the plurality of sets of training data, the first audio data included in the first set of data having noise data mixed therein.
In an optional embodiment, the apparatus is further configured to: acquiring the noise data and original audio data emitted by a target object before identifying the target audio data using a first model and determining a recognized voice corresponding to the target audio data; mixing the noise data and the original audio data to obtain the first audio data; and training a second model by utilizing a first group of data comprising the first audio data and other groups of data comprising the plurality of groups of data to obtain the first model.
In an alternative embodiment, the obtaining module 32 is configured to obtain the noise data by:
the noise data is acquired for a predetermined period of time using one or more sound recording apparatuses installed at predetermined locations.
In an alternative embodiment, the predetermined location comprises at least one of: a table of the restaurant, a wall of the restaurant, and a position a predetermined distance from a monitoring camera of the restaurant; and/or, the predetermined time period comprises at least one of: the flat time period of the sitting rate of the restaurant and the peak time period of the sitting rate of the restaurant.
In an alternative embodiment, the means for mixing the noise data and the original audio data to obtain the first audio data comprises: mixing the noise data into the original audio data according to a predetermined mixing parameter to obtain the first audio data, wherein the predetermined mixing parameter comprises at least one of: a predetermined signal-to-noise ratio, a predetermined reverberation, a predetermined audio speed.
In an optional embodiment, the apparatus is configured to train a second model using a first set of data including the first audio data and other sets of data included in the plurality of sets of data to obtain the first model by: dividing the multiple groups of data into a training set, a test set and a verification set according to a preset proportion; and training the second model by utilizing each divided data set to obtain the first model.
It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.
Embodiments of the present invention also provide a computer-readable storage medium having a computer program stored thereon, wherein the computer program is arranged to perform the steps of any of the above-mentioned method embodiments when executed.
Alternatively, in the present embodiment, the above-mentioned computer-readable storage medium may be configured to store a computer program for executing the steps of:
s1; acquiring target audio data;
and S2, recognizing the target audio data by using a first model, and determining the recognized voice corresponding to the target audio data, wherein the first model is trained by machine learning by using multiple groups of training data, and each group of data in the multiple groups of training data comprises: audio data and a voice corresponding to the audio data, a first set of data is included in the plurality of sets of training data, and noise data is mixed in first audio data included in the first set of data.
Optionally, the computer readable storage medium is further arranged to store a computer program for performing the steps of:
s1; acquiring target audio data;
and S2, recognizing the target audio data by using a first model, and determining the recognized voice corresponding to the target audio data, wherein the first model is trained by machine learning by using multiple groups of training data, and each group of data in the multiple groups of training data comprises: audio data and a voice corresponding to the audio data, a first set of data is included in the plurality of sets of training data, and noise data is mixed in first audio data included in the first set of data.
Optionally, in this embodiment, the computer-readable storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:
s1; acquiring target audio data;
and S2, recognizing the target audio data by using a first model, and determining the recognized voice corresponding to the target audio data, wherein the first model is trained by machine learning by using multiple groups of training data, and each group of data in the multiple groups of training data comprises: audio data and a voice corresponding to the audio data, a first set of data is included in the plurality of sets of training data, and noise data is mixed in first audio data included in the first set of data.
Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A speech recognition method, comprising:
acquiring target audio data;
identifying the target audio data by using a first model, and determining the identified voice corresponding to the target audio data, wherein the first model is trained by machine learning by using a plurality of sets of training data, and each set of data in the plurality of sets of training data comprises: and a voice corresponding to the audio data, a first set of data included in the plurality of sets of training data, the first audio data included in the first set of data having noise data mixed therein.
2. The method of claim 1, wherein prior to identifying the target audio data using the first model and determining the identified speech corresponding to the target audio data, the method further comprises:
acquiring the noise data and original audio data emitted by a target object;
mixing the noise data and the original audio data to obtain the first audio data;
and training a second model by utilizing a first group of data comprising the first audio data and other groups of data comprising the plurality of groups of data to obtain the first model.
3. The method of claim 2, wherein obtaining the noise data comprises:
the noise data is acquired for a predetermined period of time using one or more sound recording apparatuses installed at predetermined locations.
4. The method of claim 3,
the predetermined location comprises at least one of: a table of the restaurant, a wall of the restaurant, and a position a predetermined distance from a monitoring camera of the restaurant; and/or the presence of a gas in the gas,
the predetermined period of time includes at least one of: the flat time period of the sitting rate of the restaurant and the peak time period of the sitting rate of the restaurant.
5. The method of claim 2, wherein mixing the noise data and the original audio data to obtain the first audio data comprises:
mixing the noise data into the original audio data according to a predetermined mixing parameter to obtain the first audio data, wherein the predetermined mixing parameter comprises at least one of: a predetermined signal-to-noise ratio, a predetermined reverberation, a predetermined audio speed.
6. The method of claim 2, wherein training a second model to obtain the first model using a first set of data comprising the first audio data and other sets of data included in the plurality of sets of data comprises:
dividing the multiple groups of data into a training set, a test set and a verification set according to a preset proportion;
and training the second model by utilizing each divided data set to obtain the first model.
7. A speech recognition apparatus, comprising:
the acquisition module is used for acquiring target audio data;
a processing module, configured to recognize the target audio data by using a first model, and determine a recognized voice corresponding to the target audio data, where the first model is trained through machine learning by using multiple sets of training data, and each set of data in the multiple sets of training data includes: and a voice corresponding to the audio data, a first set of data included in the plurality of sets of training data, the first audio data included in the first set of data having noise data mixed therein.
8. The apparatus of claim 7, wherein the apparatus is further configured to:
acquiring the noise data and original audio data emitted by a target object before identifying the target audio data using a first model and determining a recognized voice corresponding to the target audio data;
mixing the noise data and the original audio data to obtain the first audio data;
and training a second model by utilizing a first group of data comprising the first audio data and other groups of data comprising the plurality of groups of data to obtain the first model.
9. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is arranged to perform the method of any of claims 1 to 6 when executed.
10. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 6.
CN201911399924.2A 2019-12-30 2019-12-30 Speech recognition method, speech recognition apparatus, storage medium, and electronic apparatus Pending CN111081222A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911399924.2A CN111081222A (en) 2019-12-30 2019-12-30 Speech recognition method, speech recognition apparatus, storage medium, and electronic apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911399924.2A CN111081222A (en) 2019-12-30 2019-12-30 Speech recognition method, speech recognition apparatus, storage medium, and electronic apparatus

Publications (1)

Publication Number Publication Date
CN111081222A true CN111081222A (en) 2020-04-28

Family

ID=70320017

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911399924.2A Pending CN111081222A (en) 2019-12-30 2019-12-30 Speech recognition method, speech recognition apparatus, storage medium, and electronic apparatus

Country Status (1)

Country Link
CN (1) CN111081222A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111883159A (en) * 2020-08-05 2020-11-03 龙马智芯(珠海横琴)科技有限公司 Voice processing method and device
WO2022115267A1 (en) * 2020-11-24 2022-06-02 Google Llc Speech personalization and federated training using real world noise

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101710490A (en) * 2009-11-20 2010-05-19 安徽科大讯飞信息科技股份有限公司 Method and device for compensating noise for voice assessment
CN106297819A (en) * 2015-05-25 2017-01-04 国家计算机网络与信息安全管理中心 A kind of noise cancellation method being applied to Speaker Identification
CN108022591A (en) * 2017-12-30 2018-05-11 北京百度网讯科技有限公司 The processing method of speech recognition, device and electronic equipment in environment inside car
CN108242234A (en) * 2018-01-10 2018-07-03 腾讯科技(深圳)有限公司 Speech recognition modeling generation method and its equipment, storage medium, electronic equipment
CN108922517A (en) * 2018-07-03 2018-11-30 百度在线网络技术(北京)有限公司 The method, apparatus and storage medium of training blind source separating model
CN109616100A (en) * 2019-01-03 2019-04-12 百度在线网络技术(北京)有限公司 The generation method and its device of speech recognition modeling
CN110473528A (en) * 2019-08-22 2019-11-19 北京明略软件***有限公司 Audio recognition method and device, storage medium and electronic device
CN110544469A (en) * 2019-09-04 2019-12-06 秒针信息技术有限公司 Training method and device of voice recognition model, storage medium and electronic device
CN110600014A (en) * 2019-09-19 2019-12-20 深圳酷派技术有限公司 Model training method and device, storage medium and electronic equipment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101710490A (en) * 2009-11-20 2010-05-19 安徽科大讯飞信息科技股份有限公司 Method and device for compensating noise for voice assessment
CN106297819A (en) * 2015-05-25 2017-01-04 国家计算机网络与信息安全管理中心 A kind of noise cancellation method being applied to Speaker Identification
CN108022591A (en) * 2017-12-30 2018-05-11 北京百度网讯科技有限公司 The processing method of speech recognition, device and electronic equipment in environment inside car
CN108242234A (en) * 2018-01-10 2018-07-03 腾讯科技(深圳)有限公司 Speech recognition modeling generation method and its equipment, storage medium, electronic equipment
CN108922517A (en) * 2018-07-03 2018-11-30 百度在线网络技术(北京)有限公司 The method, apparatus and storage medium of training blind source separating model
CN109616100A (en) * 2019-01-03 2019-04-12 百度在线网络技术(北京)有限公司 The generation method and its device of speech recognition modeling
CN110473528A (en) * 2019-08-22 2019-11-19 北京明略软件***有限公司 Audio recognition method and device, storage medium and electronic device
CN110544469A (en) * 2019-09-04 2019-12-06 秒针信息技术有限公司 Training method and device of voice recognition model, storage medium and electronic device
CN110600014A (en) * 2019-09-19 2019-12-20 深圳酷派技术有限公司 Model training method and device, storage medium and electronic equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111883159A (en) * 2020-08-05 2020-11-03 龙马智芯(珠海横琴)科技有限公司 Voice processing method and device
WO2022115267A1 (en) * 2020-11-24 2022-06-02 Google Llc Speech personalization and federated training using real world noise
US11741944B2 (en) 2020-11-24 2023-08-29 Google Llc Speech personalization and federated training using real world noise

Similar Documents

Publication Publication Date Title
CN103456301B (en) A kind of scene recognition method and device and mobile terminal based on ambient sound
CN113168836B (en) Computer system, voice recognition method and program product
CN110544469B (en) Training method and device of voice recognition model, storage medium and electronic device
CN110265052B (en) Signal-to-noise ratio determining method and device for radio equipment, storage medium and electronic device
CN109637525B (en) Method and apparatus for generating an on-board acoustic model
CN106022826A (en) Cheating user recognition method and system in webcast platform
CN110084616A (en) Intelligence pays a return visit method, apparatus, computer installation and storage medium
CN111798852A (en) Voice wake-up recognition performance test method, device and system and terminal equipment
CN107179995A (en) A kind of performance test methods of application program of computer network
CN109657038A (en) The method for digging, device and electronic equipment of a kind of question and answer to data
CN111081222A (en) Speech recognition method, speech recognition apparatus, storage medium, and electronic apparatus
CN103812683A (en) User behavior data processing method, device and system
CN110751960B (en) Method and device for determining noise data
CN111312286A (en) Age identification method, age identification device, age identification equipment and computer readable storage medium
CN106157972A (en) Use the method and apparatus that local binary pattern carries out acoustics situation identification
CN110164474A (en) Voice wakes up automated testing method and system
CN111311774A (en) Sign-in method and system based on voice recognition
CN112231748A (en) Desensitization processing method and apparatus, storage medium, and electronic apparatus
CN112687140A (en) Assessment automatic scoring method, device and system
CN109215659A (en) Processing method, the device and system of voice data
CN109165570A (en) Method and apparatus for generating information
CN106341694B (en) A kind of method and apparatus obtaining live streaming operation data
CN107403629A (en) Far field pickup method of evaluating performance and system, electronic equipment
CN111210810A (en) Model training method and device
CN110362470A (en) Test data collection method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200428