WO2020168727A1 - 一种语音识别方法、装置、存储介质及空调 - Google Patents

一种语音识别方法、装置、存储介质及空调 Download PDF

Info

Publication number
WO2020168727A1
WO2020168727A1 PCT/CN2019/110107 CN2019110107W WO2020168727A1 WO 2020168727 A1 WO2020168727 A1 WO 2020168727A1 CN 2019110107 W CN2019110107 W CN 2019110107W WO 2020168727 A1 WO2020168727 A1 WO 2020168727A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
voice data
data
far
field
Prior art date
Application number
PCT/CN2019/110107
Other languages
English (en)
French (fr)
Inventor
李明杰
宋德超
贾巨涛
吴伟
谢俊杰
Original Assignee
珠海格力电器股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 珠海格力电器股份有限公司 filed Critical 珠海格力电器股份有限公司
Priority to EP19915991.4A priority Critical patent/EP3923273B1/en
Priority to ES19915991T priority patent/ES2953525T3/es
Publication of WO2020168727A1 publication Critical patent/WO2020168727A1/zh
Priority to US17/407,443 priority patent/US11830479B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S13/00Systems using the reflection or reradiation of radio waves, e.g. radar systems; Analogous systems using reflection or reradiation of waves whose nature or wavelength is irrelevant or unspecified
    • G01S13/88Radar or analogous systems specially adapted for specific applications
    • FMECHANICAL ENGINEERING; LIGHTING; HEATING; WEAPONS; BLASTING
    • F24HEATING; RANGES; VENTILATING
    • F24FAIR-CONDITIONING; AIR-HUMIDIFICATION; VENTILATION; USE OF AIR CURRENTS FOR SCREENING
    • F24F11/00Control or safety arrangements
    • F24F11/50Control or safety arrangements characterised by user interfaces or communication
    • F24F11/56Remote control
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S13/00Systems using the reflection or reradiation of radio waves, e.g. radar systems; Analogous systems using reflection or reradiation of waves whose nature or wavelength is irrelevant or unspecified
    • G01S13/86Combinations of radar systems with non-radar systems, e.g. sonar, direction finder
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S3/00Direction-finders for determining the direction from which infrasonic, sonic, ultrasonic, or electromagnetic waves, or particle emission, not having a directional significance, are being received
    • G01S3/80Direction-finders for determining the direction from which infrasonic, sonic, ultrasonic, or electromagnetic waves, or particle emission, not having a directional significance, are being received using ultrasonic, sonic or infrasonic waves
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Definitions

  • This application belongs to the technical field of voice control, and specifically relates to a voice recognition method, device, storage medium, and air conditioner, and more particularly to a far-field voice recognition method, device, storage medium, and air conditioner based on microwave radar.
  • Voice recognition technology is currently a relatively mature human-computer interaction method. From the initial handheld device such as near-field voice recognition, such as Sirfi voice recognition and various voice assistants, the application of voice recognition has been completed to intelligent hardware, Extensions in the fields of home appliances, robots, etc. However, new human-computer interaction methods have more stringent requirements on hardware, software, algorithms, etc., especially far-field speech recognition technology is facing huge challenges.
  • voice air conditioners belong to far-field voice recognition technology.
  • voice interaction between humans and machines here mainly refers to smart hardware, robots, etc.
  • the quality of the voice signal is relatively high because of the near field.
  • the touch screen is assisted, so the interactive link can be relatively simple.
  • VAD Voice Activity Detection
  • the training data of the current speech recognition algorithm mainly uses the speech collected on the mobile phone for training, and only applies to near-field recognition.
  • For complex far-field speech data there is a lot of reverberation and noise.
  • Related technologies mainly use deep learning methods or microphone array methods to remove reverberation and noise.
  • the position and direction data of the sound source cannot be sensed at the same time. Therefore, only general methods (such as the front-end microphone array method and the rear microphone array method) can be used.
  • End-end neural network algorithm to process voice data, there are problems such as low far-field voice recognition rate, long response time, and poor noise reduction effect.
  • the use of deep learning methods or microphone array methods to remove reverberation and noise may include:
  • Microphone array method mainly to improve the robustness of sound wave direction estimation in reverberant scenes. After detecting the direction of the beam by integrating multiple microphones, beamforming technology is used to suppress the surrounding non-stationary noise.
  • beamforming technology is used to suppress the surrounding non-stationary noise.
  • the number of microphones and the distance between each microphone are limited, and the range of directions that can be distinguished is small.
  • Deep learning method Filter and singularize the speech data of reverberation and noise by means of signal processing, and use DNN or RNN to replace beamforming technology to achieve speech enhancement. But the processing effect is poor, especially in a noisy environment, the far-field speech recognition effect is not good.
  • the purpose of this application is to provide a voice recognition method, device, storage medium, and air conditioner in view of the above-mentioned defects, so as to solve the problem of using deep learning methods or microphone array methods to remove reverberation and noise in far-field voice data, and there are far-field voices.
  • the problem of poor recognition effect is achieved to improve the effect of far-field speech recognition.
  • the present application provides a voice recognition method, including: acquiring first voice data; adjusting a collection state of second voice data according to the first voice data, and acquiring second voice data based on the adjusted collection state; using a preset
  • the far-field voice recognition model performs far-field voice recognition on the acquired second voice data to obtain semantic information corresponding to the acquired second voice data.
  • the first voice data includes: a voice wake-up word; the voice wake-up word is voice data used to wake up a voice device; and/or, the second voice data includes: a voice command;
  • the voice command is voice data used to control the voice device; and/or the operation of acquiring the first voice data, the operation of adjusting the collection state of the second voice data according to the first voice data, and the collection based on the adjustment
  • the operation of acquiring the second voice data in the state is performed on the local side of the voice device; and/or, using a preset far-field voice recognition model to perform a far-field voice recognition operation on the acquired second voice data, the voice device is locally The side receives the feedback information processed by the cloud.
  • acquiring the first voice data includes: acquiring the first voice data collected by the voice collection device; and/or, acquiring the second voice data includes: acquiring the voice collection device after adjusting the collection state The obtained second voice data; wherein the voice collection device includes: a microphone array; in the microphone array, more than one microphone for collecting voice data in more than one direction is provided.
  • adjusting the collection state of the second voice data according to the first voice data includes: determining the location information of the sound source that sends the first voice data; enhancing the location of the voice collection device that obtains the first voice data The collection strength of the second voice data on the information, and/or suppress the collection strength of the second voice data at other locations except the location information by the voice collection device that collects the first voice data.
  • determining the location information of the sound source that sends the first voice data includes: using a voice collection device to determine the direction of the sound source that sends the first voice data;
  • the sound source is located to obtain the position information of the sound source;
  • the position positioning device includes: a microwave radar module; the position information includes: distance and direction; and/or, enhanced acquisition of first voice data
  • using a preset far-field voice recognition model to perform far-field voice recognition on the acquired second voice data includes: preprocessing the collected second voice data to obtain voice information; and then using a preset far-field voice recognition model.
  • a field speech recognition model which performs far-field speech recognition processing on pre-processed speech information; wherein, the far-field speech recognition model includes: a far-field acoustic model obtained by deep learning training based on an LSTM algorithm.
  • it further includes: collecting voice data and its sound source data; after preprocessing the voice data and its sound source data, using an LSTM model for training, to obtain a far-field speech recognition model based on LSTM.
  • a voice recognition device which includes: an acquisition unit configured to acquire first voice data; and the acquisition unit is further configured to adjust a second voice according to the first voice data The data collection status, and the second voice data is acquired based on the adjusted collection status; the recognition unit is used to perform far-field voice recognition on the acquired second voice data using a preset far-field voice recognition model to obtain and acquire Semantic information corresponding to the second voice data.
  • the first voice data includes: a voice wake-up word; the voice wake-up word is voice data used to wake up a voice device; and/or, the second voice data includes: a voice command;
  • the voice command is voice data used to control the voice device; and/or the operation of acquiring the first voice data, the operation of adjusting the collection state of the second voice data according to the first voice data, and the collection based on the adjustment
  • the operation of acquiring the second voice data in the state is performed on the local side of the voice device; and/or, using a preset far-field voice recognition model to perform a far-field voice recognition operation on the acquired second voice data, the voice device is locally The side receives the feedback information processed by the cloud.
  • the acquiring unit acquires the first voice data includes: acquiring the first voice data acquired by the voice collection device; and/or, the acquiring unit acquiring the second voice data includes: acquiring The second voice data collected by the voice collection device after the collection state; wherein, the voice collection device includes: a microphone array; in the microphone array, a device for collecting voice data in more than one direction is provided Above the microphone.
  • the acquiring unit adjusts the collection state of the second voice data according to the first voice data, including: determining the location information of the sound source sending the first voice data; enhancing the voice collection for acquiring the first voice data The collection intensity of the second voice data on the location information by the device, and/or suppress the collection intensity of the second voice data on the location other than the location information by the voice collection device that collects the first voice data.
  • the acquiring unit determines the location information of the sound source that sends the first voice data includes: using a voice collection device to determine the direction of the sound source that sends the first voice data; and using a location positioning device based on The sound source is located in this direction to obtain the position information of the sound source; wherein the position positioning device includes: a microwave radar module; the position information includes: distance and direction; and/or, The acquiring unit enhances the acquisition strength of the second voice data on the location information by the voice collection device that acquires the first voice data, including: when the voice collection device includes a microphone array, turning on the location information in the microphone array And/or increase the number of microphones on the position information in the microphone array; and/or, the acquisition unit inhibits the voice collection device that collects the first voice data from responding to positions other than the position information.
  • the acquisition intensity of the second voice data on the microphone array includes: turning off the microphones at other positions on the microphone array except the position information, and/or reducing the opening on the microphone array at other positions except the position information Quant
  • the recognition unit uses a preset far-field voice recognition model to perform far-field voice recognition on the acquired second voice data, including: preprocessing the collected second voice data to obtain voice information;
  • the preset far-field speech recognition model performs far-field speech recognition processing on the preprocessed speech information; wherein, the far-field speech recognition model includes: a far-field acoustic model obtained by deep learning training based on an LSTM algorithm.
  • the acquisition unit is also used to collect voice data and its sound source data;
  • the recognition unit is also used to preprocess the voice data and its sound source data, and then use the LSTM model After training, a far-field speech recognition model based on LSTM is obtained.
  • an air conditioner including: the voice recognition device described above.
  • another aspect of the present application provides a storage medium, including: a plurality of instructions are stored in the storage medium; the plurality of instructions are used to be loaded by a processor and execute the aforementioned speech recognition method.
  • an air conditioner which includes: a processor, configured to execute multiple instructions; a memory, configured to store multiple instructions; wherein, the multiple instructions are used by the The memory stores, and is loaded by the processor and executes the above-mentioned voice recognition method.
  • the solution of the present application automatically recognizes various surrounding environments through microwave radar technology, and uses deep learning algorithms to improve the accuracy of far-field speech recognition, and has a good user experience.
  • the solution of this application uses microwave radar technology to locate the sound source position, adjusts the collection state of the microphone array according to the sound source position, and further uses the far-field speech recognition model trained based on the LSTM deep learning algorithm to perform far-field speech data Recognition can ensure a high recognition rate to meet the needs of use in complex environments.
  • the solution of this application based on the microwave radar technology, combined with the LSTM deep learning algorithm model, uses the sound source and voice data to train a far-field voice recognition model, and accurately and efficiently converts voice data into text data, which can improve far Field speech recognition effect.
  • the solution of the present application combines front-end information processing technology and back-end speech recognition technology, namely: obtains the position parameters of the sound source by combining microwave radar technology, and combines the audio data and the position data (such as the position parameters of the sound source)
  • the far-field acoustic model is trained through the LSTM algorithm suitable for long audio data and audio data context, which can shorten the response time and improve the noise reduction effect.
  • the solution of the present application uses a microphone array to roughly identify the sound source direction of the wake-up word voice, uses microwave radar technology to accurately calculate the distance and direction of the sound source in real time, and then uses edge computing technology to control the microphone array in real time.
  • combining sound source data and voice data, training and using a far-field acoustic model based on LSTM can improve far-field recognition efficiency and noise reduction effects, and shorten response time.
  • the solution of this application uses microwave radar technology to locate the sound source position, adjusts the collection state of the microphone array according to the sound source position, and further uses the far-field speech recognition model trained based on the LSTM deep learning algorithm to perform remote speech data.
  • Field recognition to solve the problem of using deep learning methods or microphone array methods to remove reverberation and noise in far-field voice data, and the problem of poor far-field voice recognition results, thereby overcoming the low far-field voice recognition rate, long response time, and The defect of poor noise reduction effect achieves the beneficial effects of high far-field recognition efficiency, short response time and good noise reduction effect.
  • FIG. 1 is a schematic flowchart of an embodiment of the speech recognition method of this application
  • FIG. 2 is a schematic flowchart of an embodiment of adjusting the collection state of second voice data according to the first voice data in the method of this application;
  • FIG. 3 is a schematic flowchart of an embodiment of determining the location information of the sound source for sending the first voice data in the method of this application;
  • FIG. 4 is a schematic flowchart of an embodiment of performing far-field voice recognition on acquired second voice data using a preset far-field voice recognition model in the method of this application;
  • FIG. 5 is a schematic flowchart of an embodiment in which a preset far-field speech recognition model is trained in the method of this application;
  • FIG. 6 is a schematic structural diagram of an embodiment of the speech recognition device of this application.
  • FIG. 7 is a schematic structural diagram of a far-field speech recognition system based on microwave radar according to an embodiment of the air conditioner of this application;
  • FIG. 8 is a schematic flowchart of a far-field speech recognition algorithm based on microwave radar according to an embodiment of the air conditioner of this application.
  • a speech recognition method is provided, as shown in FIG. 1 is a schematic flowchart of an embodiment of the method of the present application.
  • the voice recognition method may include: step S110 to step S130.
  • step S110 first voice data is acquired.
  • the first voice data may include: voice wake-up words, of course, the first voice data may also include voice instructions.
  • the voice wake-up word is voice data that can be used to wake up a voice device.
  • acquiring the first voice data in step S110 may include: acquiring the first voice data collected by the voice collecting device.
  • the first voice data is acquired by the voice collection device collecting the first voice data, so that the acquisition of the first voice data is convenient and accurate.
  • the collection state of the second voice data is adjusted according to the first voice data, and the second voice data is acquired based on the adjusted collection state.
  • the microphone array on the equipment-side processing platform, first use the microphone array to locate the sound source of the wake-up word (for example, use the microphone array to determine the location of the wake-up word voice source through the direction of sound waves), and then use the microwave radar module to accurately locate the sound source. Collect distance and direction (that is, the distance and direction of the sound source) data; then turn on and turn off the microphones at the corresponding positions on the microphone array module according to the data; finally collect far-field audio data.
  • distance and direction that is, the distance and direction of the sound source
  • the second voice data may include: voice instructions, and of course, the second voice data may also include the next voice wake-up word.
  • the voice command is voice data that can be used to control a voice device.
  • step S110 the operation of obtaining first voice data in step S110, the operation of adjusting the collection state of second voice data according to the first voice data in step S120, and the operation of obtaining second voice data based on the adjusted collection state, Execute on the local side of the voice device.
  • the accuracy and reliability of the acquisition can be improved, and Improve processing efficiency.
  • the flow chart of an embodiment of adjusting the collection state of the second voice data according to the first voice data in the method of the present application shown in FIG. 2 may be further described to further illustrate the adjustment according to the first voice data in step S120.
  • the specific process of the collection state of the second voice data may include: step S210 and step S220.
  • Step S210 Determine the location information of the sound source that sends the first voice data.
  • the specific process of the location information of the sound source may include: step S310 and step S320.
  • Step S310 Determine the direction of the sound source sending the first voice data by using the voice collection device.
  • using a microphone array to roughly identify the sound source direction of the wake-up word voice may include: the voice recognition system needs to wake up the device through the voice wake-up word (such as a certain air conditioner).
  • the voice recognition system needs to wake up the device through the voice wake-up word (such as a certain air conditioner).
  • the general direction of the voice source of the wake-up word can be obtained first through the microphone array technology.
  • Step S320 Use the position positioning device to locate the sound source based on the direction to obtain position information of the sound source.
  • the position positioning device may include a microwave radar module.
  • the position positioning device may also include other positioning modules, so that on the basis of microwave radar positioning technology, the problem of far-field speech recognition in a complex environment can be solved.
  • the location information may include: distance and direction.
  • microwave radar technology to accurately calculate the distance and direction of the sound source in real time can include: microwave radar sends a microwave signal through a transmitting device, and the signal will be reflected when it encounters an object, and the microwave signal reflected by the receiving device can be collected Get the location, size, shape and other data of objects in the environment.
  • this technology can be used to obtain location data of the sound source (the person making the sound).
  • the direction of the sound source of the first voice data is determined by the voice collection device, and the position location device is used to locate the sound source based on the direction to determine the position information of the sound source, so that the sound source of the first voice data
  • the determination of the location information is accurate and reliable.
  • Step S220 based on the location information, enhance the collection strength of the second voice data on the location information by the voice collection device that acquired the first voice data, and/or suppress the voice collection device that collected the first voice data from removing the location information
  • the collection intensity of the second voice data in other locations than the other position is used to adjust the collection state of the second voice data by the voice collection device.
  • the collection status of the voice collection device may include: the collection strength of the voice collection device.
  • a combination of cloud ie, cloud processing platform
  • end ie, device processing end or device end processing platform
  • the microphone array On the equipment processing side, first use the microphone array to roughly identify the sound source direction of the wake-up words, then use the microwave radar technology to accurately calculate the distance and direction of the sound source in real time, and then use edge computing technology to control the state of the microphone array in real time .
  • adjusting the collection intensity of the second voice data by the voice collection device based on the location information of the sound source of the first voice data is beneficial to improve the convenience and reliability of the collection of the second voice data.
  • the voice collecting device that acquires the first voice data enhances the collection strength of the second voice data on the location information, and/or suppresses the voice collection device that collects the first voice data from removing the location information.
  • the operation of collecting the intensity of the second voice data at other positions may include at least one of the following adjustment situations.
  • the first adjustment scenario: enhancing the collection intensity of the second voice data on the location information by the voice collection device that obtains the first voice data may include: in the case that the voice collection device may include a microphone array, turning on the The microphones on the position information in the microphone array, and/or increase the number of microphones on the position information in the microphone array.
  • the second adjustment scenario: suppressing the collection intensity of the second voice data at other locations except the location information by the voice collection device that collects the first voice data may include: turning off the microphone array other than the location information Microphones in other positions, and/or reduce the number of openings in other positions on the microphone array except the position information.
  • the microphone array of the present application has multiple microphone devices, and on the basis of obtaining the approximate location of the sound source through the wake-up word, the state of the microphone is controlled by the front-end device.
  • the microphone array has 4 microphone effects in different directions, and the position of the sound source is obtained directly in front. At this time, the microphone reception effect in this direction (the ability to receive audio signals) can be enhanced, and the microphone reception in other directions can be suppressed Effect, thereby removing noise in other directions.
  • the accuracy and reliability of the voice collection device's collection of the second voice data can be improved, which is beneficial to the improvement.
  • the accuracy and reliability of voice recognition and voice control can be improved, which is beneficial to the improvement.
  • acquiring the second voice data in step S120 may include: acquiring the second voice data collected by the voice collecting device after adjusting the collection state.
  • the second voice data is acquired by the voice acquisition device collecting the second voice data, so that the acquisition of the second voice data is convenient and accurate.
  • the voice collection device may include: a microphone array.
  • a microphone array In the microphone array, more than one microphone that can be used to collect voice data in more than one direction is provided.
  • the method of obtaining is flexible, and the obtained result is reliable.
  • step S130 perform far-field voice recognition on the acquired second voice data by using a preset far-field voice recognition model to obtain semantic information corresponding to the acquired second voice data, so as to control the voice device to execute according to the semantic information
  • the second voice data may include: semantic text data.
  • the text data can be the text data obtained by converting the speech data into the acoustic model trained.
  • the accuracy and reliability of the second voice data acquisition can be guaranteed; and the preset far-field voice recognition model is used Performing far-field voice recognition on the second voice data can improve the efficiency and effect of performing far-field voice recognition on the second voice data.
  • a preset far-field voice recognition model is used to perform a far-field voice recognition operation on the acquired second voice data, and the voice device receives the feedback information processed by the cloud on the local side.
  • the operation of performing far-field voice recognition on the acquired second voice data using the preset far-field voice recognition model is executed through the cloud, and then the operation result is fed back to the local side of the voice device.
  • the efficiency of data processing can be improved.
  • storage reliability on the other hand, can reduce the pressure of data processing and storage on the local side of the voice device, thereby improving the convenience and reliability of voice control by the voice device.
  • the specific process of performing far-field voice recognition on the acquired second voice data by the set far-field voice recognition model may include: step S410 and step S420.
  • Step S410 preprocessing the collected second voice data to obtain voice information.
  • Step S420 using the preset far-field speech recognition model to perform far-field speech recognition processing on the preprocessed speech information.
  • the preprocessing may include preprocessing such as missing values, normalization, and noise reduction.
  • the far-field speech recognition model may include: a far-field acoustic model obtained by deep learning training based on an LSTM algorithm.
  • the microphone array receive voice data and determine the approximate location of the sound source of the wake word; microwave radar: obtain the location parameters (direction and distance data) of the sound source, that is, obtain the sound source data; adjust the microphone Array state: Enhance or suppress the microphone in the corresponding direction according to the sound source location data; LSTM-based far-field acoustic model: Acoustic model trained through the sound source data and voice data to convert voice data into corresponding text data.
  • data preprocessing can be compared with the training of the LSTM acoustic model in step 1.
  • the data preprocessing method is the same;
  • LSTM-based far-field acoustic model use the LSTM far-field acoustic model trained by training the LSTM acoustic model for speech recognition;
  • voice text data According to the voice recognition result of the model, the corresponding text data is obtained.
  • the far-field speech recognition model is trained using sound sources and speech data, and the speech data is accurately and efficiently converted into text data, providing a long-distance system that meets user needs and has a high recognition rate.
  • Field voice system On the basis of microwave radar technology, combined with the LSTM deep learning algorithm model, the far-field speech recognition model is trained using sound sources and speech data, and the speech data is accurately and efficiently converted into text data, providing a long-distance system that meets user needs and has a high recognition rate.
  • the accuracy and reliability of the second voice data itself can be improved; and then the preset far-field voice recognition model is used to perform remote processing on the preprocessed voice information.
  • Field speech recognition can ensure the accuracy and reliability of the second speech data recognition.
  • it may further include: training to obtain a preset far-field speech recognition model.
  • Step S510 Collect voice data and its sound source data.
  • the voice data may include: voice wake-up words and/or voice commands.
  • the sound source data may include the position parameters (direction and distance data) of the sound source; the voice data may be voice data received by the microphone after adjusting the state of the microphone array.
  • Step S520 after preprocessing the voice data and its sound source data, use the LSTM model for training to obtain a far-field voice recognition model based on the LSTM.
  • the voice device receives the feedback processed by the cloud on the local side information. For example: On the cloud processing end, combine sound source data and voice data to train and use a far-field acoustic model based on LSTM.
  • the front-end information processing technology and the back-end speech recognition technology are combined, that is, the position parameters of the sound source are obtained by combining the microwave radar technology, the audio data and the position data (such as the position parameters of the sound source) are combined, and the The LSTM algorithm of long audio data and audio data context trains the far-field acoustic model. Automatically recognize various surrounding environments through microwave radar technology, and use deep learning algorithms to improve the accuracy of far-field speech recognition.
  • training an LSTM acoustic model can specifically include: collecting the aforementioned historical data (sound source and voice historical record data); data preprocessing: processing missing values, standardization, noise reduction, etc. Preprocessing; load data into the model through the input layer of the LSTM model; intermediate processing layer of the LSTM model; text output layer: output text data converted from voice data to obtain a far-field acoustic model based on LSTM.
  • a far-field voice recognition model based on LSTM is obtained, which can facilitate the use of the far-field voice recognition model to remotely perform the second voice data Field speech recognition, with high recognition efficiency and good recognition effect.
  • a voice recognition device corresponding to the voice recognition method is also provided. See FIG. 6 for a schematic structural diagram of an embodiment of the device of the present application.
  • the voice recognition device may include: an acquiring unit 102 and a recognition unit 104.
  • the acquiring unit 102 may be used to acquire the first voice data.
  • the acquiring unit 102 may be used to acquire the first voice data. For specific functions and processing of the acquiring unit 102, refer to step S110.
  • the first voice data may include: voice wake-up words, of course, the first voice data may also include voice instructions.
  • the voice wake-up word is voice data that can be used to wake up a voice device.
  • the acquiring unit 102 acquiring the first voice data may include: the acquiring unit 102 may also be specifically configured to acquire the first voice data collected by the voice collecting device.
  • the first voice data is acquired by the voice collection device collecting the first voice data, so that the acquisition of the first voice data is convenient and accurate.
  • the acquiring unit 102 may also be configured to adjust the collection state of the second voice data according to the first voice data, and acquire the second voice data based on the adjusted collection state.
  • the specific function and processing of the acquiring unit 102 also refer to step S120.
  • the microphone array on the equipment-side processing platform, first use the microphone array to locate the sound source of the wake-up word (for example, use the microphone array to determine the location of the wake-up word voice source through the direction of sound waves), and then use the microwave radar module to accurately locate the sound source. Collect distance and direction (that is, the distance and direction of the sound source) data; then turn on and turn off the microphones at the corresponding positions on the microphone array module according to the data; finally collect far-field audio data.
  • distance and direction that is, the distance and direction of the sound source
  • the second voice data may include: voice instructions, and of course, the second voice data may also include the next voice wake-up word.
  • the voice command is voice data that can be used to control a voice device.
  • the obtaining unit 102 obtains the first voice data operation, the obtaining unit 102 adjusts the collection state of the second voice data according to the first voice data, and obtains the second voice based on the adjusted collection state. Data operations are performed on the local side of the voice device.
  • the accuracy and reliability of the acquisition can be improved, and Improve processing efficiency.
  • the obtaining unit 102 adjusting the collection state of the second voice data according to the first voice data may include:
  • the acquiring unit 102 may also be specifically configured to determine the location information of the sound source that sends the first voice data. For the specific function and processing of the acquiring unit 102, refer to step S210.
  • the acquiring unit 102 determining the location information of the sound source sending the first voice data may include:
  • the acquiring unit 102 may also be specifically configured to use a voice collection device to determine the direction of the sound source that sends the first voice data. For the specific function and processing of the acquiring unit 102, refer to step S310.
  • using a microphone array to roughly identify the sound source direction of the wake-up word voice may include: the voice recognition system needs to wake up the device through the voice wake-up word (such as a certain air conditioner).
  • the voice recognition system needs to wake up the device through the voice wake-up word (such as a certain air conditioner).
  • the general direction of the voice source of the wake-up word can be obtained first through the microphone array technology.
  • the acquiring unit 102 may also be specifically configured to use a position positioning device to locate the sound source based on the direction to obtain the position information of the sound source. See also step S320 for specific functions and processing of the acquiring unit 102.
  • the position positioning device may include a microwave radar module.
  • the position positioning device may also include other positioning modules, so that on the basis of microwave radar positioning technology, the problem of far-field speech recognition in a complex environment can be solved.
  • the location information may include: distance and direction.
  • microwave radar technology to accurately calculate the distance and direction of the sound source in real time can include: microwave radar sends a microwave signal through a transmitting device, and the signal will be reflected when it encounters an object, and the microwave signal reflected by the receiving device can be collected Get the location, size, shape and other data of objects in the environment.
  • this technology can be used to obtain location data of the sound source (the person making the sound).
  • the direction of the sound source of the first voice data is determined by the voice collection device, and the position location device is used to locate the sound source based on the direction to determine the position information of the sound source, so that the sound source of the first voice data
  • the determination of the location information is accurate and reliable.
  • the acquiring unit 102 may also be specifically configured to, based on the location information, enhance the collection intensity of the second voice data on the location information by the voice collection device that obtains the first voice data, and/or suppress the collection of the first voice data.
  • the voice collection device collects the second voice data at other locations except the location information to adjust the second voice data collection state of the voice collection device.
  • the collection status of the voice collection device may include: the collection strength of the voice collection device. See also step S220 for specific functions and processing of the acquiring unit 102.
  • a combination of cloud ie, cloud processing platform
  • end ie, device processing end or device end processing platform
  • the microphone array On the equipment processing side, first use the microphone array to roughly identify the sound source direction of the wake-up words, then use the microwave radar technology to accurately calculate the distance and direction of the sound source in real time, and then use edge computing technology to control the state of the microphone array in real time .
  • adjusting the collection intensity of the second voice data by the voice collection device based on the location information of the sound source of the first voice data is beneficial to improve the convenience and reliability of the collection of the second voice data.
  • the acquiring unit 102 enhances the collection intensity of the second voice data on the location information by the voice collection device that acquires the first voice data, and/or suppresses the voice collection device that collects the first voice data from removing the
  • the operation of collecting the intensity of the second voice data at other locations other than the location information may include at least one of the following adjustment scenarios.
  • the first adjustment scenario the acquisition unit 102 enhances the acquisition strength of the second voice data on the location information by the voice acquisition device that acquires the first voice data, which may include: the acquisition unit 102, which may be specifically used for When the voice collection device may include a microphone array, turn on the microphones on the position information in the microphone array, and/or increase the number of microphones on the position information in the microphone array.
  • the second adjustment scenario the acquisition unit 102 suppresses the acquisition intensity of the second voice data at other locations except the location information by the voice acquisition device that collects the first voice data, which may include: the acquisition unit 102, specifically It can also be used to turn off microphones in other positions on the microphone array except the position information, and/or reduce the number of openings in other positions on the microphone array except the position information.
  • the microphone array of the present application has multiple microphone devices, and on the basis of obtaining the approximate location of the sound source through the wake-up word, the state of the microphone is controlled by the front-end device.
  • the microphone array has 4 microphone effects in different directions, and the position of the sound source is obtained directly in front. At this time, the microphone reception effect in this direction (the ability to receive audio signals) can be enhanced, and the microphone reception in other directions can be inhibited. Effect, thereby removing noise in other directions.
  • the accuracy and reliability of the voice collection device's collection of the second voice data can be improved, which is beneficial to the improvement.
  • the accuracy and reliability of voice recognition and voice control can be improved, which is beneficial to the improvement.
  • the acquiring unit 102 acquiring the second voice data may include: the acquiring unit 102, which may be specifically configured to acquire the second voice data collected by the voice collecting device after adjusting the collection state.
  • the second voice data is acquired by the voice acquisition device collecting the second voice data, so that the acquisition of the second voice data is convenient and accurate.
  • the voice collection device may include: a microphone array.
  • a microphone array In the microphone array, more than one microphone that can be used to collect voice data in more than one direction is provided.
  • the method of obtaining is flexible, and the obtained result is reliable.
  • the recognition unit 104 may be used to perform far-field voice recognition on the acquired second voice data by using a preset far-field voice recognition model to obtain semantic information corresponding to the acquired second voice data.
  • the voice device to execute the second voice data according to the semantic information.
  • the semantic information may include: semantic text data.
  • the text data can be the text data obtained by converting the speech data into the acoustic model trained.
  • the accuracy and reliability of the second voice data acquisition can be guaranteed; and the preset far-field voice recognition model is used Performing far-field voice recognition on the second voice data can improve the efficiency and effect of performing far-field voice recognition on the second voice data.
  • the recognition unit 104 uses a preset far-field voice recognition model to perform a far-field voice recognition operation on the acquired second voice data, and the voice device receives the feedback information processed by the cloud on the local side.
  • the operation of performing far-field voice recognition on the acquired second voice data using the preset far-field voice recognition model is executed through the cloud, and then the operation result is fed back to the local side of the voice device.
  • the efficiency of data processing can be improved
  • storage reliability on the other hand, can reduce the pressure of data processing and storage on the local side of the voice device, thereby improving the convenience and reliability of voice control by the voice device.
  • the recognition unit 104 uses a preset far-field voice recognition model to perform far-field voice recognition on the acquired second voice data, which may include:
  • the recognition unit 104 can also be specifically used to preprocess the collected second voice data to obtain voice information. See also step S410 for the specific functions and processing of the identification unit 104.
  • the recognition unit 104 can also be specifically used to reuse a preset far-field speech recognition model to perform far-field speech recognition processing on the preprocessed speech information.
  • the preprocessing may include preprocessing such as missing values, normalization, and noise reduction. For the specific function and processing of the identification unit 104, refer to step S420.
  • the far-field speech recognition model may include: a far-field acoustic model obtained by deep learning training based on an LSTM algorithm.
  • the microphone array receive voice data and determine the approximate location of the sound source of the wake word; microwave radar: obtain the location parameters (direction and distance data) of the sound source, that is, obtain the sound source data; adjust the microphone Array state: Enhance or suppress the microphone in the corresponding direction according to the sound source location data; LSTM-based far-field acoustic model: Acoustic model trained through the sound source data and voice data to convert voice data into corresponding text data.
  • data preprocessing can be compared with the training of the LSTM acoustic model in step 1.
  • the data preprocessing method is the same;
  • LSTM-based far-field acoustic model use the LSTM far-field acoustic model trained by training the LSTM acoustic model for speech recognition;
  • voice text data According to the voice recognition result of the model, the corresponding text data is obtained.
  • the far-field speech recognition model is trained using sound sources and speech data, and the speech data is accurately and efficiently converted into text data, providing a long-distance system that meets user needs and has a high recognition rate.
  • Field voice system On the basis of microwave radar technology, combined with the LSTM deep learning algorithm model, the far-field speech recognition model is trained using sound sources and speech data, and the speech data is accurately and efficiently converted into text data, providing a long-distance system that meets user needs and has a high recognition rate.
  • the accuracy and reliability of the second voice data itself can be improved; and then the preset far-field voice recognition model is used to perform remote processing on the preprocessed voice information.
  • Field speech recognition can ensure the accuracy and reliability of the second speech data recognition
  • it may further include: training to obtain a preset far-field speech recognition model, which may be specifically as follows:
  • the acquiring unit 102 may also be used to collect voice data and sound source data.
  • the voice data may include: voice wake-up words and/or voice commands.
  • the sound source data may include the position parameters (direction and distance data) of the sound source; the voice data may be voice data received by the microphone after adjusting the state of the microphone array.
  • the recognition unit 104 may also be used to preprocess the voice data and its sound source data, and then use the LSTM model for training to obtain a far-field voice recognition model based on the LSTM.
  • the operation of collecting voice data and its sound source data, the operation of preprocessing the voice data and its sound source data, and the operation of training using the LSTM model the voice device receives the feedback processed by the cloud on the local side information.
  • the identification unit 104 For the specific function and processing of the identification unit 104, refer to step S520. For example: On the cloud processing end, combine sound source data and voice data to train and use a far-field acoustic model based on LSTM.
  • the front-end information processing technology and the back-end speech recognition technology are combined, that is, the position parameters of the sound source are obtained by combining the microwave radar technology, the audio data and the position data (such as the position parameters of the sound source) are combined, and the The LSTM algorithm of long audio data and audio data context trains the far-field acoustic model. Automatically recognize various surrounding environments through microwave radar technology, and use deep learning algorithms to improve the accuracy of far-field speech recognition.
  • training an LSTM acoustic model can specifically include: collecting the aforementioned historical data (sound source and voice historical record data); data preprocessing: processing missing values, standardization, noise reduction, etc. Preprocessing; load data into the model through the input layer of the LSTM model; intermediate processing layer of the LSTM model; text output layer: output text data converted from voice data to obtain a far-field acoustic model based on LSTM.
  • a far-field voice recognition model based on LSTM is obtained, which can facilitate the use of the far-field voice recognition model to remotely perform the second voice data Field speech recognition, with high recognition efficiency and good recognition effect.
  • the technical solution of this application is adopted to locate the sound source position by using microwave radar technology, adjust the collection state of the microphone array according to the sound source position, and further use the far-field speech recognition model trained based on the LSTM deep learning algorithm
  • the far-field recognition of voice data can ensure a high recognition rate to meet the needs of use in complex environments.
  • an air conditioner corresponding to a voice recognition device is also provided.
  • the air conditioner may include: the voice recognition device described above.
  • the front-end microphone array technology improves the voice recognition effect by increasing the number of microphones, but due to product price and size limitations, the number of microphones and the distance between each microphone are limited, and each microphone has the same function and effect. This will receive noise in multiple directions and reduce the accuracy of speech recognition. Therefore, the technology has a lower cost performance and a smaller range of directions that can be distinguished.
  • the existing acoustic model is mainly used to process some near-field short audio data, and can only process voice audio data, and cannot perceive and obtain the location parameters (distance and direction) of the sound source, so it can only adapt to a specific environment Under the voice recognition.
  • the existing acoustic model belongs to the back-end speech recognition processing technology, and is not closely integrated with the front-end signal processing equipment or algorithms.
  • the solution of the present application is based on the microwave radar positioning technology to solve the problem of far-field speech recognition in a complex environment.
  • LSTM Long Short-Term Memory
  • the solution of this application combines front-end information processing technology and back-end speech recognition technology, that is, the position parameters of the sound source are obtained by combining the microwave radar technology, and the audio data and the position data (such as the position parameters of the sound source)
  • the far-field acoustic model is trained through the LSTM algorithm suitable for long audio data and audio data context.
  • long audio refers to long audio, which is relative to short audio.
  • Most of the current technologies are suitable for short audio processing.
  • the solution of this application can realize long audio processing, so that more information can be extracted.
  • a combination of cloud ie, cloud processing platform
  • end ie, device processing end or device end processing platform
  • cloud On the equipment processing side, first use the microphone array to roughly identify the sound source direction of the wake-up words, then use the microwave radar technology to accurately calculate the distance and direction of the sound source in real time, and then use edge computing technology to control the state of the microphone array in real time .
  • cloud processing end combine sound source data and voice data to train and use a far-field acoustic model based on LSTM.
  • using a microphone array to roughly recognize the sound source direction of the wake-up word voice may include: the voice recognition system needs to wake up the device through the voice wake-up word (such as a certain air conditioner).
  • the voice recognition system needs to wake up the device through the voice wake-up word (such as a certain air conditioner).
  • the general direction of the voice source of the wake-up word can be obtained first through the microphone array technology.
  • microwave radar sends a microwave signal through a sending device, the signal will be reflected when it encounters an object, and the reflected microwave signal is collected by the receiving device. You can get the location, size, shape and other data of objects in the environment. In the solution of this application, this technology can be used to obtain location data of the sound source (the person making the sound).
  • edge computing technology is used to control the state of the microphone array in real time, including: there are multiple microphone devices in the microphone array of the present application, and the state of the microphone is controlled through the front-end device on the basis of obtaining the approximate position of the sound source through the wake-up word.
  • the microphone array has 4 microphone effects in different directions, and the position of the sound source is obtained directly in front.
  • the microphone reception effect in this direction can be enhanced, and the microphone reception in other directions can be inhibited. Effect, thereby removing noise in other directions.
  • Enhance the microphone reception effect in this direction (the ability to receive audio signals), suppress the microphone reception effect in other directions, mainly including: turning on and off the microphones in different directions in the microphone array, and also filtering the audio received by the microphone .
  • Enhance the microphone reception effect in this direction the ability to receive audio signals
  • suppress the microphone reception effect in other directions mainly including: turning on and off the microphones in different directions in the microphone array, and also filtering the audio received by the microphone .
  • by controlling the switch and filtering the microphone in a certain direction so that a small amount of audio can be received in that direction.
  • far-field speech recognition is a technical difficulty.
  • the microwave radar technology is used to automatically recognize various surrounding environments, and deep learning algorithms are used to improve the accuracy of far-field speech recognition.
  • the solution of this application mainly includes related technologies in microwave radar positioning, deep learning, big data processing, edge computing, cloud computing, etc., and is divided into two functional modules: device-side processing platform and Cloud processing platform.
  • the microphone array receives voice data and determines the approximate location of the sound source of the wake-up word; microwave radar: obtains the position parameters (direction and distance data) of the sound source, that is, obtains the sound source data; adjusts Microphone array status: Enhance or suppress microphones in the corresponding direction according to the sound source location data; LSTM-based far-field acoustic model: Acoustic model trained through sound source data and voice data to convert voice data into corresponding text data.
  • the sound source data can include the position parameters of the sound source (direction and distance data); the voice data can be the voice data received by the microphone after adjusting the state of the microphone array; the text data can be the trained acoustic model. The speech data is converted into the obtained text data.
  • the implementation principle of the solution of the present application may include:
  • the device-side processing platform first use the microphone array to locate the approximate location of the wake-up word sound source (for example, use the microphone array to determine the location of the wake-up word voice source through the direction of sound waves), and then use the microwave radar module to accurately locate the sound source , Collect distance and direction (that is, the distance and direction of the sound source) data; then turn on and off the microphone at the corresponding position on the microphone array module according to the data; finally collect the far-field audio data.
  • the microphone array to locate the approximate location of the wake-up word sound source (for example, use the microphone array to determine the location of the wake-up word voice source through the direction of sound waves), and then use the microwave radar module to accurately locate the sound source , Collect distance and direction (that is, the distance and direction of the sound source) data; then turn on and off the microphone at the corresponding position on the microphone array module according to the data; finally collect the far-field audio data.
  • the LSTM acoustic model is first trained using manually collected and labeled sound sources and audio databases to obtain a far-field speech recognition model; then, real-time far-field is performed on the above model by collecting voice data in real time Speech recognition; finally get high-accuracy speech and text data in a complex environment.
  • the main purpose is to label the location data of the sound source for the purpose of marking during training.
  • far-field speech recognition in the solution of the present application, in a complex scenario, far-field speech recognition can be performed accurately and efficiently based on microwave radar technology.
  • the specific process of far-field speech recognition based on microwave radar in the solution of the present application may include:
  • Step 1 Training the LSTM acoustic model, which can specifically include:
  • Step 11 Collect the above-mentioned historical data (historical data of sound source and voice).
  • Step 12 Data preprocessing: processing missing values, standardization, noise reduction and other preprocessing on the data.
  • processing missing values is to fill in the data items that may be missing with the overall mean or other methods.
  • Standardization is the homogenization of different data through data normalization or the same measurement, such as making audio data and position data into the same type of data.
  • Step 13 Load data into the model through the input layer of the LSTM model.
  • Step 14 The intermediate processing layer of the LSTM model.
  • the intermediate processing layer is a processing process of the neural network, which is a fixed operation in the LSTM algorithm.
  • the intermediate processing layer uses input, forget, and output methods to update the state of cells in the network and the weights of connections between cells.
  • Step 15 Text output layer: output text data converted from voice data to obtain a far-field acoustic model based on LSTM.
  • Step 2 Real-time voice: monitor the voice of the air conditioner in real time.
  • Step 3. Collect voice data and sound source data.
  • Step 4 Data preprocessing: it can be the same as the data preprocessing method of training the LSTM acoustic model in step 1.
  • Step 5 Far-field acoustic model based on LSTM: Use the LSTM far-field acoustic model trained by the LSTM acoustic model in step 1 to perform speech recognition.
  • Step 6 Voice text data: According to the voice recognition result of the model, the corresponding text data is obtained.
  • the solution of this application uses the sound source and voice data to train a far-field voice recognition model, and accurately and efficiently converts voice data into text data, providing users with Far-field speech system with high recognition rate.
  • the text data can be extracted and recognized to control the corresponding equipment. This is a necessary step for a speech recognition system.
  • the technical solution of this application is adopted, based on the microwave radar technology, combined with the LSTM deep learning algorithm model, and the far-field speech recognition model is trained by the sound source and speech data, and the speech data is accurately and efficiently converted into Text data can improve the far-field speech recognition effect.
  • a storage medium corresponding to the voice recognition method is also provided.
  • the storage medium may include: a plurality of instructions are stored in the storage medium; the plurality of instructions are used to be loaded by a processor and execute the above-mentioned voice recognition method.
  • an air conditioner corresponding to the voice recognition method may include: a processor for executing a plurality of instructions; a memory for storing a plurality of instructions; wherein, the plurality of instructions are for being stored by the memory, and loaded and executed by the processor The voice recognition method described above.
  • the technical solution of this application is used to roughly identify the direction of the sound source by using the microphone array to identify the wake-up word voice, and the microwave radar technology is used to accurately calculate the distance and direction of the sound source in real time, and then edge computing is used.
  • the technology controls the state of the microphone array in real time, combines sound source data and voice data, trains and uses LSTM-based far-field acoustic models, which can improve far-field recognition efficiency and noise reduction effects, and shorten response time.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Chemical & Material Sciences (AREA)
  • Quality & Reliability (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Combustion & Propulsion (AREA)
  • Mechanical Engineering (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Electromagnetism (AREA)
  • Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Air Conditioning Control Device (AREA)
  • Fittings On The Vehicle Exterior For Carrying Loads, And Devices For Holding Or Mounting Articles (AREA)

Abstract

一种语音识别方法、装置、存储介质及空调,方法包括:获取第一语音数据(S110);根据第一语音数据调整第二语音数据的采集状态,并基于调整后的采集状态获取第二语音数据(S120);利用预设的远场语音识别模型对获取的第二语音数据进行远场语音识别,以得到与获取的第二语音数据对应的语义信息(S130)。可以解决利用深度学习方法或麦克风阵列方法去除远场语音数据中的混响和噪音,存在远场语音识别效果差的问题,达到提升远场语音识别效果的效果。

Description

一种语音识别方法、装置、存储介质及空调
本申请要求于2019年2月21日提交至中国国家知识产权局、申请号为201910130206.9、发明名称为“一种调整微波雷达设备的输出功率的方法及装置”的专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本申请属于语音控制技术领域,具体涉及一种语音识别方法、装置、存储介质及空调,尤其涉及一种基于微波雷达的远场语音识别方法、装置、存储介质及空调。
背景技术
语音识别技术是目前应用较为成熟的人机交互方式,从最初的手持设备这种近场的语音识别,如Sirfi语音识别以及各种语音助手,到现在,语音识别的应用已经完成向智能硬件、家电设备、机器人等领域上的延伸。但新的人机交互方式对硬件、软件、算法等方面的要求更加苛刻,特别是远场语音识别技术面临巨大的挑战。
随着智能家居***的不断发展,智能家居如语音空调属于远场语音识别技术。首先,人机之间的语音交互(这里主要指智能硬件、机器人等),区别于传统的有屏手持设备,在传统的语音交互中,因为是近场,语音信号质量相对较高,而且有触摸屏辅助,所以交互链路可以相对简单。通过点击屏幕触发,再通过点击屏幕或者能量VAD(Voice Activity Detection,语音活动检测)检测,来结束语音信号采集,即可完成一次交互,整个过程通过语音识别、语义理解、语音合成即可完成。
而对于人机之间的交互,由于涉及到远场,环境比较复杂,而且无屏交互,如果要像人与人之间的交流一样自然、持续、双向、可打断,整个交互过程需要解决的问题更多,为完成类似人类的语音交互,是一个需要软硬件一体、云+端相互配合的过程。
目前的语音识别算法的训练数据主要是利用手机上收集的语音进行训练,只适用近场识别。对于复杂的远场语音数据,存在大量的混响和噪音。相关技 术主要是利用深度学习方法或麦克风阵列方法去除混响和噪音,在实际应用过程中无法同时感知声源的位置和方向数据,从而只能使用通用方法(例如:前端的麦克风阵列方法和后端的神经网络算法)去处理语音数据,存在远场语音识别率低、响应时间长、降噪效果差等问题。
其中,利用深度学习方法或麦克风阵列方法去除混响和噪音,可以包括:
(1)麦克风阵列方法:主要是在混响的场景下提高音波方向估计的鲁棒性。通过集成多个麦克风来检测波束的方向后,利用波束形成技术抑制周围的非平稳噪声。但由于产品价格和尺寸的限制,麦克风的个数及每个麦克风的间距有限,能够分辨的方向范围较小。
(2)深度学习方法:通过信号处理的手段对混响和噪声的语音数据进行过滤和单一化处理,利用DNN或RNN等算法替代波束形成技术,实现语音增强。但处理效果较差,尤其在噪声很大的环境里远场语音识别效果不好。
上述内容仅用于辅助理解本申请的技术方案,并不代表承认上述内容是相关技术。
发明内容
本申请的目的在于,针对上述缺陷,提供一种语音识别方法、装置、存储介质及空调,以解决利用深度学习方法或麦克风阵列方法去除远场语音数据中的混响和噪音,存在远场语音识别效果差的问题,达到提升远场语音识别效果的效果。
本申请提供一种语音识别方法,包括:获取第一语音数据;根据所述第一语音数据调整第二语音数据的采集状态,并基于调整后的采集状态获取第二语音数据;利用预设的远场语音识别模型对获取的第二语音数据进行远场语音识别,以得到与获取的第二语音数据对应的语义信息。
可选地,其中,该第一语音数据,包括:语音唤醒词;所述语音唤醒词,为用于唤醒语音设备的语音数据;和/或,该第二语音数据,包括:语音指令;所述语音指令,为用于控制语音设备的语音数据;和/或,获取第一语音数据的操作、根据所述第一语音数据调整第二语音数据的采集状态的操作、以及基于调整后的采集状态获取第二语音数据的操作,在语音设备的本地侧执行;和/或,利用预设的远场语音识别模型对获取的第二语音数据进行远场语音识别的操作,由语音设备在本地侧接收云端处理后的反馈信息。
可选地,其中,获取第一语音数据,包括:获取由语音采集设备采集得到的第一语音数据;和/或,获取第二语音数据,包括:获取由调整采集状态后的语音采集设备采集得到的第二语音数据;其中,所述语音采集设备,包括:麦克风阵列;在所述麦克风阵列中,设置有用于对一个以上方向上的语音数据进行采集的一个以上麦克风。
可选地,根据所述第一语音数据调整第二语音数据的采集状态,包括:确定发送所述第一语音数据的声源的位置信息;增强获取第一语音数据的语音采集设备对该位置信息上的第二语音数据的采集强度,和/或抑制采集第一语音数据的语音采集设备对除该位置信息以外的其它位置上的第二语音数据的采集强度。
可选地,其中,确定发送所述第一语音数据的声源的位置信息,包括:利用语音采集设备确定发送所述第一语音数据的声源的方向;利用位置定位设备基于该方向对所述声源进行定位,得到所述声源的位置信息;其中,所述位置定位设备,包括:微波雷达模块;所述位置信息,包括:距离和方向;和/或,增强获取第一语音数据的语音采集设备对该位置信息上的第二语音数据的采集强度,包括:在所述语音采集设备包括麦克风阵列的情况下,开启所述麦克风阵列中该位置信息上的麦克风,和/或增加所述麦克风阵列中该位置信息上的麦克风的开启数量;和/或,抑制采集第一语音数据的语音采集设备对除该位置信息以外的其它位置上的第二语音数据的采集强度,包括:关闭所述麦克风阵列上除该位置信息以外的其它位置上的麦克风,和/或减少所述麦克风阵列上除该位置信息以外的其它位置上的开启数量。
可选地,利用预设的远场语音识别模型对获取的第二语音数据进行远场语音识别,包括:对采集到的第二语音数据进行预处理,得到语音信息;再利用预设的远场语音识别模型,对预处理后的语音信息进行远场语音识别处理;其中,所述远场语音识别模型,包括:基于LSTM算法进行深度学习训练得到的远场声学模型。
可选地,还包括:收集语音数据及其声源数据;对所述语音数据及其声源数据进行预处理后,利用LSTM模型进行训练,得到基于LSTM的远场语音识别模型。
与上述方法相匹配,本申请另一方面提供一种语音识别装置,包括:获取单元,用于获取第一语音数据;所述获取单元,还用于根据所述第一语音数据调整第二语音数据的采集状态,并基于调整后的采集状态获取第二语音数据;识别单元,用于利用预设的远场语音识别模型对获取的第二语音数据进行远场语音识别,以得到与获取的第二语音数据对应的语义信息。
可选地,其中,该第一语音数据,包括:语音唤醒词;所述语音唤醒词,为用于唤醒语音设备的语音数据;和/或,该第二语音数据,包括:语音指令;所述语音指令,为用于控制语音设备的语音数据;和/或,获取第一语音数据的操作、根据所述第一语音数据调整第二语音数据的采集状态的操作、以及基于调整后的采集状态获取第二语音数据的操作,在语音设备的本地侧执行;和/或,利用预设的远场语音识别模型对获取的第二语音数据进行远场语音识别的操作,由语音设备在本地侧接收云端处理后的反馈信息。
可选地,其中,所述获取单元获取第一语音数据,包括:获取由语音采集设备采集得到的第一语音数据;和/或,所述获取单元获取第二语音数据,包括:获取由调整采集状态后的语音采集设备采集得到的第二语音数据;其中,所述语音采集设备,包括:麦克风阵列;在所述麦克风阵列中,设置有用于对一个以上方向上的语音数据进行采集的一个以上麦克风。
可选地,所述获取单元根据所述第一语音数据调整第二语音数据的采集状态,包括:确定发送所述第一语音数据的声源的位置信息;增强获取第一语音数据的语音采集设备对该位置信息上的第二语音数据的采集强度,和/或抑制采集第一语音数据的语音采集设备对除该位置信息以外的其它位置上的第二语音数据的采集强度。
可选地,其中,所述获取单元确定发送所述第一语音数据的声源的位置信息,包括:利用语音采集设备确定发送所述第一语音数据的声源的方向;利用位置定位设备基于该方向对所述声源进行定位,得到所述声源的位置信息;其中,所述位置定位设备,包括:微波雷达模块;所述位置信息,包括:距离和方向;和/或,所述获取单元增强获取第一语音数据的语音采集设备对该位置信息上的第二语音数据的采集强度,包括:在所述语音采集设备包括麦克风阵列的情况下,开启所述麦克风阵列中该位置信息上的麦克风,和/或增加所述麦克风阵列中该位置信息上的麦克风的开启数量;和/或,所述获取单元抑制采集第 一语音数据的语音采集设备对除该位置信息以外的其它位置上的第二语音数据的采集强度,包括:关闭所述麦克风阵列上除该位置信息以外的其它位置上的麦克风,和/或减少所述麦克风阵列上除该位置信息以外的其它位置上的开启数量。
可选地,所述识别单元利用预设的远场语音识别模型对获取的第二语音数据进行远场语音识别,包括:对采集到的第二语音数据进行预处理,得到语音信息;再利用预设的远场语音识别模型,对预处理后的语音信息进行远场语音识别处理;其中,所述远场语音识别模型,包括:基于LSTM算法进行深度学习训练得到的远场声学模型。
可选地,还包括:所述获取单元,还用于收集语音数据及其声源数据;所述识别单元,还用于对所述语音数据及其声源数据进行预处理后,利用LSTM模型进行训练,得到基于LSTM的远场语音识别模型。
与上述装置相匹配,本申请再一方面提供一种空调,包括:以上所述的语音识别装置。
与上述方法相匹配,本申请再一方面提供一种存储介质,包括:所述存储介质中存储有多条指令;所述多条指令,用于由处理器加载并执行以上所述的语音识别方法。
与上述方法相匹配,本申请再一方面提供一种空调,包括:处理器,用于执行多条指令;存储器,用于存储多条指令;其中,所述多条指令,用于由所述存储器存储,并由所述处理器加载并执行以上所述的语音识别方法。
本申请的方案,通过微波雷达技术对的各种周边环境进行自动识别,利用深度学习算法可以提升远场语音识别准确率,用户体验好。
进一步,本申请的方案,通过利用微波雷达技术定位声源位置,根据声源位置调整麦克风阵列的采集状态,并进一步利用基于LSTM深度学习算法训练得到的远场语音识别模型对语音数据进行远场识别,可以保证高识别率,从而满足复杂环境下的使用需求。
进一步,本申请的方案,通过在微波雷达技术的基础,结合LSTM深度学习算法模型,利用声源和语音数据训练出远场语音识别模型,将语音数据准确高效地转化成文本数据,可以提升远场语音识别效果。
进一步,本申请的方案,通过将前端信息处理技术和后端语音识别技术相 结合,即:通过结合微波雷达技术获取声源的位置参数,将音频数据和位置数据(如声源的位置参数)相结合,通过适用于长音频数据和音频数据上下文的LSTM算法训练出远场声学模型,可以缩短响应时间短和提升降噪效果。
进一步,本申请的方案,通过利用麦克风阵列对唤醒词语音进行粗略地识别声源方向的基础上,利用微波雷达技术实时精确计算声源的距离和方向,再用边缘计算技术实时调控麦克风阵列的状态,结合声源数据和语音数据,训练并使用基于LSTM的远场声学模型,可以提升远场识别效率和降噪效果,缩短响应时间。
由此,本申请的方案,通过利用微波雷达技术定位声源位置,根据声源位置调整麦克风阵列的采集状态,并进一步利用基于LSTM深度学习算法训练得到的远场语音识别模型对语音数据进行远场识别,解决利用深度学习方法或麦克风阵列方法去除远场语音数据中的混响和噪音,存在远场语音识别效果差的问题,从而,克服相关技术中远场语音识别率低、响应时间长、降噪效果差的缺陷,实现远场识别效率高、响应时间短和降噪效果好的有益效果。
本申请的其它特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本申请而了解。
下面通过附图和实施例,对本申请的技术方案做进一步的详细描述。
附图说明
图1为本申请的语音识别方法的一实施例的流程示意图;
图2为本申请的方法中根据所述第一语音数据调整第二语音数据的采集状态的一实施例的流程示意图;
图3为本申请的方法中确定发送所述第一语音数据的声源的位置信息的一实施例的流程示意图;
图4为本申请的方法中利用预设的远场语音识别模型对获取的第二语音数据进行远场语音识别的一实施例的流程示意图;
图5为本申请的方法中训练得到预设的远场语音识别模型的一实施例的流程示意图;
图6为本申请的语音识别装置的一实施例的结构示意图;
图7为本申请的空调的一实施例的基于微波雷达的远场语音识别***的结构示意图;
图8为本申请的空调的一实施例的基于微波雷达的远场语音识别算法的流程示意图。
结合附图,本申请实施例中附图标记如下:
102-获取单元;104-识别单元。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合本申请具体实施例及相应的附图对本申请技术方案进行清楚、完整地描述。显然,所描述的实施例仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
根据本申请的实施例,提供了一种语音识别方法,如图1所示本申请的方法的一实施例的流程示意图。该语音识别方法可以包括:步骤S110至步骤S130。
在步骤S110处,获取第一语音数据。
其中,该第一语音数据,可以包括:语音唤醒词,当然该第一语音数据也可以包括语音指令。所述语音唤醒词,为可以用于唤醒语音设备的语音数据。
由此,通过获取多种形式的第一语音数据,可以方便在不同场合下基于第一语音数据调整第二语音数据的采集状态,提升用户使用的便捷性和通用性。
可选地,步骤S110中获取第一语音数据,可以包括:获取由语音采集设备采集得到的第一语音数据。
由此,通过语音采集设备采集第一语音数据的方式获取第一语音数据,使得对第一语音数据的获取便捷且精准。
在步骤S120处,根据所述第一语音数据调整第二语音数据的采集状态,并基于调整后的采集状态获取第二语音数据。
例如:在设备端处理平台上,首先利用麦克风阵列定位唤醒词声源大致方位(例如:通过麦克风阵列通过声波的方向判断唤醒词语音声源位置),再用微波雷达模块对声源进行精确定位,采集距离和方向(即声源的距离和方向)数据;然后根据该数据打开和关闭麦克风阵列模块上相对应位置上的麦克风;最后采集远场的音频数据。
其中,该第二语音数据,可以包括:语音指令,当然该第二语音数据也可以包括下一语音唤醒词。所述语音指令,为可以用于控制语音设备的语音数据。
由此,通过获取多种形式的第二语音数据,可以方便用户的多种语音控制需求,灵活且便捷。
具体地,步骤S110中获取第一语音数据的操作、步骤S120中根据所述第一语音数据调整第二语音数据的采集状态的操作、以及基于调整后的采集状态获取第二语音数据的操作,在语音设备的本地侧执行。
由此,通过在语音设备的本地侧执行获取第一语音数据和第二语音数据、并基于第一语音数据调整第二语音数据的采集状态的操作,可以提升获取的精准性和可靠性,并提升处理效率。
可选地,可以结合图2所示本申请的方法中根据所述第一语音数据调整第二语音数据的采集状态的一实施例流程示意图,进一步说明步骤S120中根据所述第一语音数据调整第二语音数据的采集状态的具体过程,可以包括:步骤S210和步骤S220。
步骤S210,确定发送所述第一语音数据的声源的位置信息。
更可选地,可以结合图3所示本申请的方法中确定发送所述第一语音数据的声源的位置信息的一实施例流程示意图,进一步说明步骤S210中确定发送所述第一语音数据的声源的位置信息的具体过程,可以包括:步骤S310和步骤S320。
步骤S310,利用语音采集设备确定发送所述第一语音数据的声源的方向。
例如:利用麦克风阵列对唤醒词语音进行粗略地识别声源方向,可以包括:语音识别***是需要先通过语音唤醒词(如:某某空调)来唤醒设备。本申请的方案中可以首先通过麦克风阵列技术获取唤醒词语音声源的大致方向。
步骤S320,利用位置定位设备基于该方向对所述声源进行定位,得到所述声源的位置信息。
其中,所述位置定位设备,可以包括:微波雷达模块,当然该位置定位设备也可以包括其它定位模块,从而,可以在微波雷达定位技术的基础上,解决复杂环境下的远场语音识别问题。所述位置信息,可以包括:距离和方向。
例如:利用微波雷达技术实时精确计算声源的距离和方向,可以包括:微波雷达通过发送装置发出微波信号,信号在遇到物体后会产生反射,通过接收装置收反射回来的微波信号,就可以得到环境里的物***置、大小、形状等数据。本申请的方案中可以利用该技术获得声源(发出声音的人)的位置数据。
由此,通过语音采集设备确定第一语音数据的声源的方向,进一步基于该方向利用位置定位设备对该声源进行定位从而确定该声源的位置信息,使得对第一语音数据的声源的位置信息的确定精准而可靠。
步骤S220,基于该位置信息,增强获取第一语音数据的语音采集设备对该位置信息上的第二语音数据的采集强度,和/或抑制采集第一语音数据的语音采集设备对除该位置信息以外的其它位置上的第二语音数据的采集强度以调整所述语音采集设备对第二语音数据的采集状态。其中,语音采集设备的采集状态,可以包括:语音采集设备的采集强度。
例如:在远场环境下,采用云(即云端处理平台)和端(即设备处理端或设备端处理平台)相结合的处理方式。在设备处理端,首先利用麦克风阵列对唤醒词语音进行粗略地识别声源方向的基础上,然后利用微波雷达技术实时精确计算声源的距离和方向,再用边缘计算技术实时调控麦克风阵列的状态。
由此,通过基于第一语音数据的声源的位置信息对语音采集设备对第二语音数据的采集强度进行调整,有利于提升对第二语音数据的采集的便捷性和可靠性。
更可选地,步骤S220中增强获取第一语音数据的语音采集设备对该位置信息上的第二语音数据的采集强度、和/或抑制采集第一语音数据的语音采集设备对除该位置信息以外的其它位置上的第二语音数据的采集强度的操作,可以包括以下至少一种调整情形。
第一种调整情形:增强获取第一语音数据的语音采集设备对该位置信息上的第二语音数据的采集强度,可以包括:在所述语音采集设备可以包括麦克风阵列的情况下,开启所述麦克风阵列中该位置信息上的麦克风,和/或增加所述麦克风阵列中该位置信息上的麦克风的开启数量。
第二种调整情形:抑制采集第一语音数据的语音采集设备对除该位置信息以外的其它位置上的第二语音数据的采集强度,可以包括:关闭所述麦克风阵列上除该位置信息以外的其它位置上的麦克风,和/或减少所述麦克风阵列上除该位置信息以外的其它位置上的开启数量。
例如:用边缘计算技术实时调控麦克风阵列的状态,包括:本申请的麦克风阵列里有多个麦克风设备,在通过唤醒词获得声源大致方位的基础上,通过前端设备控制麦克风的状态。例如:麦克风阵列有不同方向上的4个麦克效果, 获得了声源的位置在正前方,这时可以增强该方向上的麦克风接收效果(接收音频信号的能力),抑制其他方向上的麦克风接收效果,从而去除其他方向上的噪声。
由此,通过基于第一语音数据的位置信息对语音采集设备在不同位置上的采集强度进行增强或降低,可以提升语音采集设备对第二语音数据采集的精准性和可靠性,进而有利于提升语音识别和语音控制的精准性和可靠性。
可选地,步骤S120中获取第二语音数据,可以包括:获取由调整采集状态后的语音采集设备采集得到的第二语音数据。
由此,通过语音采集设备采集第二语音数据的方式获取第二语音数据,使得对第二语音数据的获取便捷且精准。
其中,所述语音采集设备,可以包括:麦克风阵列。在所述麦克风阵列中,设置有可以用于对一个以上方向上的语音数据进行采集的一个以上麦克风。
由此,通过使用麦克风阵列获取语音数据,获取的方式灵活,且获取的结果可靠。
在步骤S130处,利用预设的远场语音识别模型对获取的第二语音数据进行远场语音识别,以得到与获取的第二语音数据对应的语义信息,以控制语音设备按该语义信息执行该第二语音数据。其中,该语义信息,可以包括:语义文本数据。例如:文本数据,可以是通过训练的声学模型将语音数据转化成得到的文本数据。
例如:在云端处理平台上,首先利用人工采集和标注的声源和音频数据库训练LSTM声学模型,得到远场语音识别模型;然后,通过实时采集语音数据,在上述模型上进行实时远场语音识别;最后得到复杂环境下、高准确率的语音文本数据。在复杂场景下,可以基于微波雷达技术,准确高效地进行远场语音识别。
由此,通过基于第一语音数据调整第二语音数据的采集状态后再获取第二语音数据,可以保证对第二语音数据获取的精准性和可靠性;并利用预设的远场语音识别模型对第二语音数据进行远场语音识别,可以提升对第二语音数据进行远场语音识别的效率和效果。
具体地,步骤S130中利用预设的远场语音识别模型对获取的第二语音数据进行远场语音识别的操作,由语音设备在本地侧接收云端处理后的反馈信息。
由此,通过云端执行利用预设的远场语音识别模型对获取的第二语音数据进行远场语音识别的操作后再将操作结果反馈至语音设备的本地侧,一方面可以提升数据处理的效率和存储可靠性,另一方面可以减轻语音设备的本地侧的数据处理和存储压力,进而提升语音设备进行语音控制的便捷性和可靠性。
可选地,可以结合图4所示本申请的方法中利用预设的远场语音识别模型对获取的第二语音数据进行远场语音识别的一实施例流程示意图,进一步说明步骤S130中利用预设的远场语音识别模型对获取的第二语音数据进行远场语音识别的具体过程,可以包括:步骤S410和步骤S420。
步骤S410,对采集到的第二语音数据进行预处理,得到语音信息。
步骤S420,再利用预设的远场语音识别模型,对预处理后的语音信息进行远场语音识别处理。该预处理,可以包括:缺失值、标准化、降噪等预处理。
其中,所述远场语音识别模型,可以包括:基于LSTM算法进行深度学习训练得到的远场声学模型。
例如:在图7所示的***中,麦克风阵列:接收语音数据并判断唤醒词声源大致方位;微波雷达:获得声源的位置参数(方向和距离数据),即获得声源数据;调整麦克风阵列状态:根据声源位置数据增强或抑制相应方向上的麦克风;基于LSTM的远场声学模型:通过声源数据和语音数据训练的声学模型,将语音数据转化成对应的文本数据。
例如:参见图8所示的例子,训练LSTM声学模型后,采集实时语音即对空调的语音进行实时监测,采集语音数据和声源数据;数据预处理:可以与步骤1中训练LSTM声学模型的数据预处理方式相同;基于LSTM的远场声学模型:利用训练LSTM声学模型训练出的LSTM远场声学模型进行语音识别;语音文本数据:根据模型的语音识别结果,得到对应的文本数据。在微波雷达技术的基础,结合LSTM深度学习算法模型,利用声源和语音数据训练出远场语音识别模型,将语音数据准确高效地转化成文本数据,提供满足用户需求、高识别率化的远场语音***。
由此,通过对采集到的第二语音数据进行预处理,可以提升第二语音数据本身的精准性和可靠性;进而利用预设的远场语音识别模型对预处理后得到的语音信息进行远场语音识别,可以保证对第二语音数据识别的精准性和可靠性。
在一个可选实施方式中,还可以包括:训练得到预设的远场语音识别模型 的过程。
下面结合图5所示本申请的方法中训练得到预设的远场语音识别模型的一实施例流程示意图,进一步说明训练得到预设的远场语音识别模型的具体过程,可以包括:步骤S510和步骤S520。
步骤S510,收集语音数据及其声源数据。该语音数据,可以包括:语音唤醒词和/或语音指令。例如:声源数据,可以包括声源的位置参数(方向和距离数据);语音数据,可以是通过调整麦克风阵列状态后的麦克风接收到的语音数据。
步骤S520,对所述语音数据及其声源数据进行预处理后,利用LSTM模型进行训练,得到基于LSTM的远场语音识别模型。其中,收集语音数据及其声源数据的操作、对所述语音数据及其声源数据进行预处理的操作、以及利用LSTM模型进行训练的操作,由语音设备在本地侧接收云端处理后的反馈信息。例如:在云处理端,结合声源数据和语音数据,训练并使用基于LSTM的远场声学模型。
例如:将前端信息处理技术和后端语音识别技术相结合,即:通过结合微波雷达技术获取声源的位置参数,将音频数据和位置数据(如声源的位置参数)相结合,通过适用于长音频数据和音频数据上下文的LSTM算法训练出远场声学模型。通过微波雷达技术对的各种周边环境进行自动识别,利用深度学习算法提升远场语音识别准确率。
例如:参见图8所示的例子,训练LSTM声学模型,具体可以包括:收集上述历史数据(声源和语音的历史记录数据);数据预处理:对数据进行处理缺失值、标准化、降噪等预处理;通过LSTM模型的输入层将数据载入模型中;LSTM模型的中间处理层;文本输出层:将语音数据转化的文本数据输出,得到基于LSTM的远场声学模型。
由此,通过预先收集语音数据及其声源数据并进行预处理后利用LSTM模型进行训练,得到基于LSTM的远场语音识别模型,可以方便利用该远场语音识别模型对第二语音数据进行远场语音识别,且识别效率高、识别效果好。
经大量的试验验证,采用本实施例的技术方案,通过微波雷达技术对的各种周边环境进行自动识别,利用深度学习算法可以提升远场语音识别准确率,用户体验好。
根据本申请的实施例,还提供了对应于语音识别方法的一种语音识别装置。参见图6所示本申请的装置的一实施例的结构示意图。该语音识别装置可以包括:获取单元102和识别单元104。
在一个可选例子中,获取单元102,可以用于获取第一语音数据。该获取单元102的具体功能及处理参见步骤S110。
其中,该第一语音数据,可以包括:语音唤醒词,当然该第一语音数据也可以包括语音指令。所述语音唤醒词,为可以用于唤醒语音设备的语音数据。
由此,通过获取多种形式的第一语音数据,可以方便在不同场合下基于第一语音数据调整第二语音数据的采集状态,提升用户使用的便捷性和通用性。
可选地,所述获取单元102获取第一语音数据,可以包括:所述获取单元102,具体还可以用于获取由语音采集设备采集得到的第一语音数据。
由此,通过语音采集设备采集第一语音数据的方式获取第一语音数据,使得对第一语音数据的获取便捷且精准。
在一个可选例子中,所述获取单元102,还可以用于根据所述第一语音数据调整第二语音数据的采集状态,并基于调整后的采集状态获取第二语音数据。该获取单元102的具体功能及处理还参见步骤S120。
例如:在设备端处理平台上,首先利用麦克风阵列定位唤醒词声源大致方位(例如:通过麦克风阵列通过声波的方向判断唤醒词语音声源位置),再用微波雷达模块对声源进行精确定位,采集距离和方向(即声源的距离和方向)数据;然后根据该数据打开和关闭麦克风阵列模块上相对应位置上的麦克风;最后采集远场的音频数据。
其中,该第二语音数据,可以包括:语音指令,当然该第二语音数据也可以包括下一语音唤醒词。所述语音指令,为可以用于控制语音设备的语音数据。
由此,通过获取多种形式的第二语音数据,可以方便用户的多种语音控制需求,灵活且便捷。
具体地,所述获取单元102获取第一语音数据的操作、所述获取单元102根据所述第一语音数据调整第二语音数据的采集状态的操作、以及基于调整后的采集状态获取第二语音数据的操作,在语音设备的本地侧执行。
由此,通过在语音设备的本地侧执行获取第一语音数据和第二语音数据、 并基于第一语音数据调整第二语音数据的采集状态的操作,可以提升获取的精准性和可靠性,并提升处理效率。
可选地,所述获取单元102根据所述第一语音数据调整第二语音数据的采集状态,可以包括:
所述获取单元102,具体还可以用于确定发送所述第一语音数据的声源的位置信息。该获取单元102的具体功能及处理还参见步骤S210。
更可选地,所述获取单元102确定发送所述第一语音数据的声源的位置信息,可以包括:
所述获取单元102,具体还可以用于利用语音采集设备确定发送所述第一语音数据的声源的方向。该获取单元102的具体功能及处理还参见步骤S310。
例如:利用麦克风阵列对唤醒词语音进行粗略地识别声源方向,可以包括:语音识别***是需要先通过语音唤醒词(如:某某空调)来唤醒设备。本申请的方案中可以首先通过麦克风阵列技术获取唤醒词语音声源的大致方向。
所述获取单元102,具体还可以用于利用位置定位设备基于该方向对所述声源进行定位,得到所述声源的位置信息。该获取单元102的具体功能及处理还参见步骤S320。
其中,所述位置定位设备,可以包括:微波雷达模块,当然该位置定位设备也可以包括其它定位模块,从而,可以在微波雷达定位技术的基础上,解决复杂环境下的远场语音识别问题。所述位置信息,可以包括:距离和方向。
例如:利用微波雷达技术实时精确计算声源的距离和方向,可以包括:微波雷达通过发送装置发出微波信号,信号在遇到物体后会产生反射,通过接收装置收反射回来的微波信号,就可以得到环境里的物***置、大小、形状等数据。本申请的方案中可以利用该技术获得声源(发出声音的人)的位置数据。
由此,通过语音采集设备确定第一语音数据的声源的方向,进一步基于该方向利用位置定位设备对该声源进行定位从而确定该声源的位置信息,使得对第一语音数据的声源的位置信息的确定精准而可靠。
所述获取单元102,具体还可以用于基于该位置信息,增强获取第一语音数据的语音采集设备对该位置信息上的第二语音数据的采集强度,和/或抑制采集第一语音数据的语音采集设备对除该位置信息以外的其它位置上的第二语音数据的采集强度以调整所述语音采集设备对第二语音数据的采集状态。其中, 语音采集设备的采集状态,可以包括:语音采集设备的采集强度。该获取单元102的具体功能及处理还参见步骤S220。
例如:在远场环境下,采用云(即云端处理平台)和端(即设备处理端或设备端处理平台)相结合的处理方式。在设备处理端,首先利用麦克风阵列对唤醒词语音进行粗略地识别声源方向的基础上,然后利用微波雷达技术实时精确计算声源的距离和方向,再用边缘计算技术实时调控麦克风阵列的状态。
由此,通过基于第一语音数据的声源的位置信息对语音采集设备对第二语音数据的采集强度进行调整,有利于提升对第二语音数据的采集的便捷性和可靠性。
更可选地,所述获取单元102增强获取第一语音数据的语音采集设备对该位置信息上的第二语音数据的采集强度、和/或抑制采集第一语音数据的语音采集设备对除该位置信息以外的其它位置上的第二语音数据的采集强度的操作,可以包括以下至少一种调整情形。
第一种调整情形:所述获取单元102增强获取第一语音数据的语音采集设备对该位置信息上的第二语音数据的采集强度,可以包括:所述获取单元102,具体还可以用于在所述语音采集设备可以包括麦克风阵列的情况下,开启所述麦克风阵列中该位置信息上的麦克风,和/或增加所述麦克风阵列中该位置信息上的麦克风的开启数量。
第二种调整情形:所述获取单元102抑制采集第一语音数据的语音采集设备对除该位置信息以外的其它位置上的第二语音数据的采集强度,可以包括:所述获取单元102,具体还可以用于关闭所述麦克风阵列上除该位置信息以外的其它位置上的麦克风,和/或减少所述麦克风阵列上除该位置信息以外的其它位置上的开启数量。
例如:用边缘计算技术实时调控麦克风阵列的状态,包括:本申请的麦克风阵列里有多个麦克风设备,在通过唤醒词获得声源大致方位的基础上,通过前端设备控制麦克风的状态。例如:麦克风阵列有不同方向上的4个麦克效果,获得了声源的位置在正前方,这时可以增强该方向上的麦克风接收效果(接收音频信号的能力),抑制其他方向上的麦克风接收效果,从而去除其他方向上的噪声。
由此,通过基于第一语音数据的位置信息对语音采集设备在不同位置上的 采集强度进行增强或降低,可以提升语音采集设备对第二语音数据采集的精准性和可靠性,进而有利于提升语音识别和语音控制的精准性和可靠性。
可选地,所述获取单元102获取第二语音数据,可以包括:所述获取单元102,具体还可以用于获取由调整采集状态后的语音采集设备采集得到的第二语音数据。
由此,通过语音采集设备采集第二语音数据的方式获取第二语音数据,使得对第二语音数据的获取便捷且精准。
其中,所述语音采集设备,可以包括:麦克风阵列。在所述麦克风阵列中,设置有可以用于对一个以上方向上的语音数据进行采集的一个以上麦克风。
由此,通过使用麦克风阵列获取语音数据,获取的方式灵活,且获取的结果可靠。
在一个可选例子中,识别单元104,可以用于利用预设的远场语音识别模型对获取的第二语音数据进行远场语音识别,以得到与获取的第二语音数据对应的语义信息,以控制语音设备按该语义信息执行该第二语音数据。该识别单元104的具体功能及处理参见步骤S130。其中,该语义信息,可以包括:语义文本数据。例如:文本数据,可以是通过训练的声学模型将语音数据转化成得到的文本数据。
例如:在云端处理平台上,首先利用人工采集和标注的声源和音频数据库训练LSTM声学模型,得到远场语音识别模型;然后,通过实时采集语音数据,在上述模型上进行实时远场语音识别;最后得到复杂环境下、高准确率的语音文本数据。在复杂场景下,可以基于微波雷达技术,准确高效地进行远场语音识别。
由此,通过基于第一语音数据调整第二语音数据的采集状态后再获取第二语音数据,可以保证对第二语音数据获取的精准性和可靠性;并利用预设的远场语音识别模型对第二语音数据进行远场语音识别,可以提升对第二语音数据进行远场语音识别的效率和效果。
具体地,所述识别单元104利用预设的远场语音识别模型对获取的第二语音数据进行远场语音识别的操作,由语音设备在本地侧接收云端处理后的反馈信息。
由此,通过云端执行利用预设的远场语音识别模型对获取的第二语音数据 进行远场语音识别的操作后再将操作结果反馈至语音设备的本地侧,一方面可以提升数据处理的效率和存储可靠性,另一方面可以减轻语音设备的本地侧的数据处理和存储压力,进而提升语音设备进行语音控制的便捷性和可靠性。
可选地,所述识别单元104利用预设的远场语音识别模型对获取的第二语音数据进行远场语音识别,可以包括:
所述识别单元104,具体还可以用于对采集到的第二语音数据进行预处理,得到语音信息。该识别单元104的具体功能及处理还参见步骤S410。
所述识别单元104,具体还可以用于再利用预设的远场语音识别模型,对预处理后的语音信息进行远场语音识别处理。该预处理,可以包括:缺失值、标准化、降噪等预处理。该识别单元104的具体功能及处理还参见步骤S420。
其中,所述远场语音识别模型,可以包括:基于LSTM算法进行深度学习训练得到的远场声学模型。
例如:在图7所示的***中,麦克风阵列:接收语音数据并判断唤醒词声源大致方位;微波雷达:获得声源的位置参数(方向和距离数据),即获得声源数据;调整麦克风阵列状态:根据声源位置数据增强或抑制相应方向上的麦克风;基于LSTM的远场声学模型:通过声源数据和语音数据训练的声学模型,将语音数据转化成对应的文本数据。
例如:参见图8所示的例子,训练LSTM声学模型后,采集实时语音即对空调的语音进行实时监测,采集语音数据和声源数据;数据预处理:可以与步骤1中训练LSTM声学模型的数据预处理方式相同;基于LSTM的远场声学模型:利用训练LSTM声学模型训练出的LSTM远场声学模型进行语音识别;语音文本数据:根据模型的语音识别结果,得到对应的文本数据。在微波雷达技术的基础,结合LSTM深度学习算法模型,利用声源和语音数据训练出远场语音识别模型,将语音数据准确高效地转化成文本数据,提供满足用户需求、高识别率化的远场语音***。
由此,通过对采集到的第二语音数据进行预处理,可以提升第二语音数据本身的精准性和可靠性;进而利用预设的远场语音识别模型对预处理后得到的语音信息进行远场语音识别,可以保证对第二语音数据识别的精准性和可靠性
在一个可选实施方式中,还可以包括:训练得到预设的远场语音识别模型的过程,具体可以如下:
所述获取单元102,还可以用于收集语音数据及其声源数据。该语音数据,可以包括:语音唤醒词和/或语音指令。该获取单元102的具体功能及处理还参见步骤S510。例如:声源数据,可以包括声源的位置参数(方向和距离数据);语音数据,可以是通过调整麦克风阵列状态后的麦克风接收到的语音数据。
所述识别单元104,还可以用于对所述语音数据及其声源数据进行预处理后,利用LSTM模型进行训练,得到基于LSTM的远场语音识别模型。其中,收集语音数据及其声源数据的操作、对所述语音数据及其声源数据进行预处理的操作、以及利用LSTM模型进行训练的操作,由语音设备在本地侧接收云端处理后的反馈信息。该识别单元104的具体功能及处理还参见步骤S520。例如:在云处理端,结合声源数据和语音数据,训练并使用基于LSTM的远场声学模型。
例如:将前端信息处理技术和后端语音识别技术相结合,即:通过结合微波雷达技术获取声源的位置参数,将音频数据和位置数据(如声源的位置参数)相结合,通过适用于长音频数据和音频数据上下文的LSTM算法训练出远场声学模型。通过微波雷达技术对的各种周边环境进行自动识别,利用深度学习算法提升远场语音识别准确率。
例如:参见图8所示的例子,训练LSTM声学模型,具体可以包括:收集上述历史数据(声源和语音的历史记录数据);数据预处理:对数据进行处理缺失值、标准化、降噪等预处理;通过LSTM模型的输入层将数据载入模型中;LSTM模型的中间处理层;文本输出层:将语音数据转化的文本数据输出,得到基于LSTM的远场声学模型。
由此,通过预先收集语音数据及其声源数据并进行预处理后利用LSTM模型进行训练,得到基于LSTM的远场语音识别模型,可以方便利用该远场语音识别模型对第二语音数据进行远场语音识别,且识别效率高、识别效果好。
由于本实施例的装置所实现的处理及功能基本相应于前述图1至图5所示的方法的实施例、原理和实例,故本实施例的描述中未详尽之处,可以参见前述实施例中的相关说明,在此不做赘述。
经大量的试验验证,采用本申请的技术方案,通过利用微波雷达技术定位声源位置,根据声源位置调整麦克风阵列的采集状态,并进一步利用基于LSTM深度学习算法训练得到的远场语音识别模型对语音数据进行远场识别,可以保 证高识别率,从而满足复杂环境下的使用需求。
根据本申请的实施例,还提供了对应于语音识别装置的一种空调。该空调可以包括:以上所述的语音识别装置。
考虑到传统的远场语音识别技术主要利用麦克风阵列和声源定位,可以较好地实现远场距离拾音,解决噪声、混响、回声带来的影响,但对于复杂环境下的人声检测和断句问题,处理效果较差。例如:一般声学模型只能针对音频数据进行降噪和识别处理,在复杂环境下,模型的准确度不够高。
例如:前端的麦克风阵列技术通过增加麦克风数量来提升语音识别效果,但由于产品价格和尺寸的限制,麦克风的个数及每个麦克风的间距都是有限的,并且每个麦克风的功能效果相同,这个会接收到多个方向上的噪音,降低语音识别准确率,故该技术的性价比较低,能够分辨的方向范围较小。
例如:现有的声学模型主要是用来处理一些近场短音频数据,并只能对语音音频数据进行处理,无法感知和获取声源的位置参数(距离和方向),故只能适应特定环境下的语音识别。而且现有的声学模型属于后端语音识别处理技术,没有和前端的信号处理设备或算法紧密结合。
在一个可选实施方式中,本申请的方案,在微波雷达定位技术的基础上,解决复杂环境下的远场语音识别问题。
其中,民用微波雷达及其传感器是一个新兴的高科技产业,在测速、车流量检测、物位计等方面已有广发应用。LSTM(Long Short-Term Memory,长短期记忆网络)是一种时间递归神经网络***,可以用来处理和预测时间序列中间隔和延迟相对较长的重要事件。
具体地,本申请的方案,将前端信息处理技术和后端语音识别技术相结合,即:通过结合微波雷达技术获取声源的位置参数,将音频数据和位置数据(如声源的位置参数)相结合,通过适用于长音频数据和音频数据上下文的LSTM算法训练出远场声学模型。
其中,长音频是指时间长的音频,是相对于短音频的,现在的大部分技术适用于短音频处理,本申请的方案可以实现长音频的处理,从而可以提取出更多的信息。
在一个可选例子中,在远场环境下,采用云(即云端处理平台)和端(即 设备处理端或设备端处理平台)相结合的处理方式。在设备处理端,首先利用麦克风阵列对唤醒词语音进行粗略地识别声源方向的基础上,然后利用微波雷达技术实时精确计算声源的距离和方向,再用边缘计算技术实时调控麦克风阵列的状态。在云处理端,结合声源数据和语音数据,训练并使用基于LSTM的远场声学模型。
可选地,利用麦克风阵列对唤醒词语音进行粗略地识别声源方向,可以包括:语音识别***是需要先通过语音唤醒词(如:某某空调)来唤醒设备。本申请的方案中可以首先通过麦克风阵列技术获取唤醒词语音声源的大致方向。
可选地,利用微波雷达技术实时精确计算声源的距离和方向,可以包括:微波雷达通过发送装置发出微波信号,信号在遇到物体后会产生反射,通过接收装置收反射回来的微波信号,就可以得到环境里的物***置、大小、形状等数据。本申请的方案中可以利用该技术获得声源(发出声音的人)的位置数据。
可选地,用边缘计算技术实时调控麦克风阵列的状态,包括:本申请的麦克风阵列里有多个麦克风设备,在通过唤醒词获得声源大致方位的基础上,通过前端设备控制麦克风的状态。例如:麦克风阵列有不同方向上的4个麦克效果,获得了声源的位置在正前方,这时可以增强该方向上的麦克风接收效果(接收音频信号的能力),抑制其他方向上的麦克风接收效果,从而去除其他方向上的噪声。
例如:增强该方向上的麦克风接收效果(接收音频信号的能力),抑制其他方向上的麦克风接收效果,主要可以包括:打开和关闭麦克风阵列中不同方向上的麦克风,也有通过过滤麦克风接收的音频。例如:通过控制开关和过滤某个方向上的麦克风,从而使该方向的上接收少量的音频。
可见,本申请的方案中,远场语音识别是一项技术难点,通过微波雷达技术对的各种周边环境进行自动识别,利用深度学习算法提升远场语音识别准确率。
在一个可选具体实施方式中,可以结合图7和图8所示的例子,对本申请的方案的具体实现过程进行示例性说明。
在一个可选具体例子中,本申请的方案中,主要包含微波雷达定位、深度学***台和云端处理平台。
具体地,在图7所示的***中,麦克风阵列:接收语音数据并判断唤醒词声源大致方位;微波雷达:获得声源的位置参数(方向和距离数据),即获得声源数据;调整麦克风阵列状态:根据声源位置数据增强或抑制相应方向上的麦克风;基于LSTM的远场声学模型:通过声源数据和语音数据训练的声学模型,将语音数据转化成对应的文本数据。其中,声源数据,可以包括声源的位置参数(方向和距离数据);语音数据,可以是通过调整麦克风阵列状态后的麦克风接收到的语音数据;文本数据,可以是通过训练的声学模型将语音数据转化成得到的文本数据。
参见图7所示的例子,本申请的方案的实现原理,可以包括:
一方面,在设备端处理平台上,首先利用麦克风阵列定位唤醒词声源大致方位(例如:通过麦克风阵列通过声波的方向判断唤醒词语音声源位置),再用微波雷达模块对声源进行精确定位,采集距离和方向(即声源的距离和方向)数据;然后根据该数据打开和关闭麦克风阵列模块上相对应位置上的麦克风;最后采集远场的音频数据。
另一方面,在云端处理平台上,首先利用人工采集和标注的声源和音频数据库训练LSTM声学模型,得到远场语音识别模型;然后,通过实时采集语音数据,在上述模型上进行实时远场语音识别;最后得到复杂环境下、高准确率的语音文本数据。
其中,主要是标注声源位置数据,是为了在训练中做标记。
在一个可选具体例子中,本申请的方案中,在复杂场景下,可以基于微波雷达技术,准确高效地进行远场语音识别。其中,参见图8所示的例子,本申请的方案中基于微波雷达的远场语音识别的具体过程,可以包括:
步骤1、训练LSTM声学模型,具体可以包括:
步骤11、收集上述历史数据(声源和语音的历史记录数据)。
步骤12、数据预处理:对数据进行处理缺失值、标准化、降噪等预处理。
例如:处理缺失值是对可能缺失的数据项,用总体均值或其他方法进行填充。标准化是通过数据归一化或同量度化让不同数据的同类化,如让音频数据和位置数据可以变成同一类数据。
步骤13、通过LSTM模型的输入层将数据载入模型中。
步骤14、LSTM模型的中间处理层。
其中,中间处理层是神经网络的一个处理过程,这是LSTM算法里固定的操作。例如:中间处理层通过输入、遗忘、输出的方法来更新网络中的细胞状态和细胞间连接的权值。
步骤15、文本输出层:将语音数据转化的文本数据输出,得到基于LSTM的远场声学模型。
步骤2、实时语音:对空调的语音进行实时监测。
步骤3、采集语音数据和声源数据。
步骤4、数据预处理:可以与步骤1中训练LSTM声学模型的数据预处理方式相同。
步骤5、基于LSTM的远场声学模型:利用步骤1中训练LSTM声学模型训练出的LSTM远场声学模型进行语音识别。
步骤6、语音文本数据:根据模型的语音识别结果,得到对应的文本数据。
可见,对于复杂环境下的语音设备使用过程中,需要准确、高效、实时的远场识别技术,解决噪声、混响、回声带来的影响,提高用户体验效果,迫切需要一种智能化、高效化、准确性高、可靠性强的远场识别***。而目前市场上的远场识别主要是以单一化麦克风阵列和声学模型的形式,进行简单的识别,复杂场景下的识别准确度不高,暂时没有一种针对远场语音的高准确度、可靠的识别方法。而本申请的方案,在微波雷达技术的基础,结合LSTM深度学习算法模型,利用声源和语音数据训练出远场语音识别模型,将语音数据准确高效地转化成文本数据,提供满足用户需求、高识别率化的远场语音***。
例如:语音转化成文本数据后,对文本数据进行提取和识别,才能控制相应的设备。这是语音识别***的必备步骤。
由于本实施例的空调所实现的处理及功能基本相应于前述图6所示的装置的实施例、原理和实例,故本实施例的描述中未详尽之处,可以参见前述实施例中的相关说明,在此不做赘述。
经大量的试验验证,采用本申请的技术方案,通过在微波雷达技术的基础,结合LSTM深度学习算法模型,利用声源和语音数据训练出远场语音识别模型,将语音数据准确高效地转化成文本数据,可以提升远场语音识别效果。
根据本申请的实施例,还提供了对应于语音识别方法的一种存储介质。该存储介质,可以包括:所述存储介质中存储有多条指令;所述多条指令,用于 由处理器加载并执行以上所述的语音识别方法。
由于本实施例的存储介质所实现的处理及功能基本相应于前述图1至图5所示的方法的实施例、原理和实例,故本实施例的描述中未详尽之处,可以参见前述实施例中的相关说明,在此不做赘述。
经大量的试验验证,采用本申请的技术方案,通过将前端信息处理技术和后端语音识别技术相结合,即:通过结合微波雷达技术获取声源的位置参数,将音频数据和位置数据相结合,通过适用于长音频数据和音频数据上下文的LSTM算法训练出远场声学模型,可以缩短响应时间短和提升降噪效果
根据本申请的实施例,还提供了对应于语音识别方法的一种空调。该空调,可以包括:处理器,用于执行多条指令;存储器,用于存储多条指令;其中,所述多条指令,用于由所述存储器存储,并由所述处理器加载并执行以上所述的语音识别方法。
由于本实施例的空调所实现的处理及功能基本相应于前述图1至图5所示的方法的实施例、原理和实例,故本实施例的描述中未详尽之处,可以参见前述实施例中的相关说明,在此不做赘述。
经大量的试验验证,采用本申请的技术方案,通过利用麦克风阵列对唤醒词语音进行粗略地识别声源方向的基础上,利用微波雷达技术实时精确计算声源的距离和方向,再用边缘计算技术实时调控麦克风阵列的状态,结合声源数据和语音数据,训练并使用基于LSTM的远场声学模型,可以提升远场识别效率和降噪效果,缩短响应时间。
综上,本领域技术人员容易理解的是,在不冲突的前提下,上述各有利方式可以自由地组合、叠加。
以上所述仅为本申请的实施例而已,并不用于限制本申请,对于本领域的技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的权利要求范围之内。

Claims (23)

  1. 一种语音识别方法,包括:
    获取第一语音数据;
    根据所述第一语音数据调整第二语音数据的采集状态,并基于调整后的采集状态获取第二语音数据;
    利用预设的远场语音识别模型对获取的第二语音数据进行远场语音识别,以得到与获取的第二语音数据对应的语义信息。
  2. 根据权利要求1所述的方法,其中,
    该第一语音数据,包括:语音唤醒词;所述语音唤醒词,为用于唤醒语音设备的语音数据;
    该第二语音数据,包括:语音指令;所述语音指令,为用于控制语音设备的语音数据。
  3. 根据权利要求1或2所述的方法,其中,
    获取第一语音数据的操作、根据所述第一语音数据调整第二语音数据的采集状态的操作、以及基于调整后的采集状态获取第二语音数据的操作,在语音设备的本地侧执行;
    利用预设的远场语音识别模型对获取的第二语音数据进行远场语音识别的操作,由语音设备在本地侧接收云端处理后的反馈信息执行。
  4. 根据权利要求1或2所述的方法,其中,
    获取第一语音数据,包括:
    获取由语音采集设备采集得到的第一语音数据;
    获取第二语音数据,包括:
    获取由调整采集状态后的语音采集设备采集得到的第二语音数据;
    其中,所述语音采集设备,包括:麦克风阵列;在所述麦克风阵列中,设置有用于对一个以上方向上的语音数据进行采集的一个以上麦克风。
  5. 根据权利要求1-3之一所述的方法,其中,根据所述第一语音数据调整第二语音数据的采集状态,包括:在确定发送所述第一语音数据的声源的位置信息之后,执行以下至少之一;
    增强获取第一语音数据的语音采集设备对该位置信息上的第二语音数据 的采集强度;
    抑制采集第一语音数据的语音采集设备对除该位置信息以外的其它位置上的第二语音数据的采集强度。
  6. 根据权利要求5所述的方法,其中,
    确定发送所述第一语音数据的声源的位置信息,包括:
    利用语音采集设备确定发送所述第一语音数据的声源的方向;
    利用位置定位设备基于该方向对所述声源进行定位,得到所述声源的位置信息;
    其中,所述位置定位设备,包括:微波雷达模块;所述位置信息,包括:距离和方向。
  7. 根据权利要求6所述的方法,其中,
    增强获取第一语音数据的语音采集设备对该位置信息上的第二语音数据的采集强度,包括以下至少之一:
    在所述语音采集设备包括麦克风阵列的情况下,开启所述麦克风阵列中该位置信息上的麦克风;
    在所述语音采集设备包括麦克风阵列的情况下,增加所述麦克风阵列中该位置信息上的麦克风的开启数量。
  8. 根据权利要求6所述的方法,其中,
    抑制采集第一语音数据的语音采集设备对除该位置信息以外的其它位置上的第二语音数据的采集强度,包括以下至少之一:
    关闭所述麦克风阵列上除该位置信息以外的其它位置上的麦克风;
    减少所述麦克风阵列上除该位置信息以外的其它位置上的开启数量。
  9. 根据权利要求1-5之一所述的方法,其中,利用预设的远场语音识别模型对获取的第二语音数据进行远场语音识别,包括:
    对采集到的第二语音数据进行预处理,得到语音信息;
    再利用预设的远场语音识别模型,对预处理后的语音信息进行远场语音识别处理;
    其中,所述远场语音识别模型,包括:基于LSTM算法进行深度学习训练得到的远场声学模型。
  10. 根据权利要求1-9之一所述的方法,其中,还包括:
    收集语音数据及其声源数据;
    对所述语音数据及其声源数据进行预处理后,利用LSTM模型进行训练,得到基于LSTM的远场语音识别模型。
  11. 一种语音识别装置,包括:
    获取单元,设置为获取第一语音数据;
    所述获取单元,还设置为根据所述第一语音数据调整第二语音数据的采集状态,并基于调整后的采集状态获取第二语音数据;
    识别单元,设置为利用预设的远场语音识别模型对获取的第二语音数据进行远场语音识别,以得到与获取的第二语音数据对应的语义信息。
  12. 根据权利要求11所述的装置,其中,
    该第一语音数据,包括:语音唤醒词;所述语音唤醒词,为用于唤醒语音设备的语音数据;
    该第二语音数据,包括:语音指令;所述语音指令,为用于控制语音设备的语音数据;
  13. 根据权利要求11或12所述的装置,其中,
    获取第一语音数据的操作、根据所述第一语音数据调整第二语音数据的采集状态的操作、以及基于调整后的采集状态获取第二语音数据的操作,在语音设备的本地侧执行;
    利用预设的远场语音识别模型对获取的第二语音数据进行远场语音识别的操作,由语音设备在本地侧接收云端处理后的反馈信息执行。
  14. 根据权利要求11或12所述的装置,其中,
    所述获取单元获取第一语音数据,包括:
    获取由语音采集设备采集得到的第一语音数据;
    所述获取单元获取第二语音数据,包括:
    获取由调整采集状态后的语音采集设备采集得到的第二语音数据;
    其中,所述语音采集设备,包括:麦克风阵列;在所述麦克风阵列中,设置有用于对一个以上方向上的语音数据进行采集的一个以上麦克风。
  15. 根据权利要求11-14之一所述的装置,其中,所述获取单元根据所述第一语音数据调整第二语音数据的采集状态,包括:在确定发送所述第一语音数据的声源的位置信息之后,执行以下至少之一:
    增强获取第一语音数据的语音采集设备对该位置信息上的第二语音数据的采集强度;
    抑制采集第一语音数据的语音采集设备对除该位置信息以外的其它位置上的第二语音数据的采集强度。
  16. 根据权利要求15所述的装置,其中,
    所述获取单元确定发送所述第一语音数据的声源的位置信息,包括:
    利用语音采集设备确定发送所述第一语音数据的声源的方向;
    利用位置定位设备基于该方向对所述声源进行定位,得到所述声源的位置信息;
    其中,所述位置定位设备,包括:微波雷达模块;所述位置信息,包括:距离和方向。
  17. 根据权利要求16所述的装置,其中,
    所述获取单元增强获取第一语音数据的语音采集设备对该位置信息上的第二语音数据的采集强度,包括以下至少之一:
    在所述语音采集设备包括麦克风阵列的情况下,开启所述麦克风阵列中该位置信息上的麦克风;
    在所述语音采集设备包括麦克风阵列的情况下,增加所述麦克风阵列中该位置信息上的麦克风的开启数量。
  18. 根据权利要求16所述的装置,其中,
    所述获取单元抑制采集第一语音数据的语音采集设备对除该位置信息以外的其它位置上的第二语音数据的采集强度,包括以下至少之一:
    关闭所述麦克风阵列上除该位置信息以外的其它位置上的麦克风;
    减少所述麦克风阵列上除该位置信息以外的其它位置上的开启数量。
  19. 根据权利要求11-18之一所述的装置,其中,所述识别单元利用预设的远场语音识别模型对获取的第二语音数据进行远场语音识别,包括:
    对采集到的第二语音数据进行预处理,得到语音信息;
    再利用预设的远场语音识别模型,对预处理后的语音信息进行远场语音识别处理;
    其中,所述远场语音识别模型,包括:基于LSTM算法进行深度学习训练得到的远场声学模型。
  20. 根据权利要求11-19之一所述的装置,其中,还包括:
    所述获取单元,还用于收集语音数据及其声源数据;
    所述识别单元,还用于对所述语音数据及其声源数据进行预处理后,利用LSTM模型进行训练,得到基于LSTM的远场语音识别模型。
  21. 一种空调,包括:如权利要求11-20任一所述的语音识别装置。
  22. 一种存储介质,所述存储介质中存储有多条指令;所述多条指令,用于由处理器加载并执行如权利要求1-10任一所述的语音识别方法。
  23. 一种空调,包括:
    处理器,用于执行多条指令;
    存储器,用于存储多条指令;
    其中,所述多条指令,用于由所述存储器存储,并由所述处理器加载并执行如权利要求1-10任一所述的语音识别方法。
PCT/CN2019/110107 2019-02-21 2019-10-09 一种语音识别方法、装置、存储介质及空调 WO2020168727A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP19915991.4A EP3923273B1 (en) 2019-02-21 2019-10-09 Voice recognition method and device, storage medium, and air conditioner
ES19915991T ES2953525T3 (es) 2019-02-21 2019-10-09 Método y dispositivo de reconocimiento de voz, medio de almacenamiento y acondicionador de aire
US17/407,443 US11830479B2 (en) 2019-02-21 2021-08-20 Voice recognition method and apparatus, and air conditioner

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910130206.9A CN109767769B (zh) 2019-02-21 2019-02-21 一种语音识别方法、装置、存储介质及空调
CN201910130206.9 2019-02-21

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/407,443 Continuation US11830479B2 (en) 2019-02-21 2021-08-20 Voice recognition method and apparatus, and air conditioner

Publications (1)

Publication Number Publication Date
WO2020168727A1 true WO2020168727A1 (zh) 2020-08-27

Family

ID=66457008

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/110107 WO2020168727A1 (zh) 2019-02-21 2019-10-09 一种语音识别方法、装置、存储介质及空调

Country Status (6)

Country Link
US (1) US11830479B2 (zh)
EP (1) EP3923273B1 (zh)
CN (1) CN109767769B (zh)
ES (1) ES2953525T3 (zh)
PT (1) PT3923273T (zh)
WO (1) WO2020168727A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113793596A (zh) * 2021-09-15 2021-12-14 深圳金贝奇电子有限公司 一种基于语音增强技术的耳机远场交互***

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220291328A1 (en) * 2015-07-17 2022-09-15 Muhammed Zahid Ozturk Method, apparatus, and system for speech enhancement and separation based on audio and radio signals
CN109767769B (zh) * 2019-02-21 2020-12-22 珠海格力电器股份有限公司 一种语音识别方法、装置、存储介质及空调
CN110223686A (zh) * 2019-05-31 2019-09-10 联想(北京)有限公司 语音识别方法、语音识别装置和电子设备
CN110415694A (zh) * 2019-07-15 2019-11-05 深圳市易汇软件有限公司 一种多台智能音箱协同工作的方法
CN110992974B (zh) * 2019-11-25 2021-08-24 百度在线网络技术(北京)有限公司 语音识别方法、装置、设备以及计算机可读存储介质
CN110931019B (zh) * 2019-12-06 2022-06-21 广州国音智能科技有限公司 公安语音数据采集方法、装置、设备和计算机存储介质
CN110807909A (zh) * 2019-12-09 2020-02-18 深圳云端生活科技有限公司 一种雷达和语音处理组合控制的方法
WO2021131532A1 (ja) * 2019-12-27 2021-07-01 アイリスオーヤマ株式会社 送風機
CN111688580B (zh) * 2020-05-29 2023-03-14 阿波罗智联(北京)科技有限公司 智能后视镜进行拾音的方法以及装置
CN111755006B (zh) * 2020-07-28 2023-05-30 斑马网络技术有限公司 一种定向收声装置和车载语音触发方法
CN112700771A (zh) * 2020-12-02 2021-04-23 珠海格力电器股份有限公司 空调、立体声控识别方法、计算机设备、存储介质及终端
CN112562671A (zh) * 2020-12-10 2021-03-26 上海雷盎云智能技术有限公司 一种服务机器人的语音控制方法和装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6122613A (en) * 1997-01-30 2000-09-19 Dragon Systems, Inc. Speech recognition using multiple recognizers (selectively) applied to the same input sample
CN103095911A (zh) * 2012-12-18 2013-05-08 苏州思必驰信息科技有限公司 一种通过语音唤醒寻找手机的方法及***
CN107464564A (zh) * 2017-08-21 2017-12-12 腾讯科技(深圳)有限公司 语音交互方法、装置及设备
CN107862060A (zh) * 2017-11-15 2018-03-30 吉林大学 一种追踪目标人的语义识别装置及识别方法
CN109767769A (zh) * 2019-02-21 2019-05-17 珠海格力电器股份有限公司 一种语音识别方法、装置、存储介质及空调

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040003070A1 (en) * 2002-06-26 2004-01-01 Clarus Systems, Inc. Centrally controlled end-to-end service quality monitoring system and method in a distributed environment
US8892443B2 (en) * 2009-12-15 2014-11-18 At&T Intellectual Property I, L.P. System and method for combining geographic metadata in automatic speech recognition language and acoustic models
EP2916567B1 (en) * 2012-11-02 2020-02-19 Sony Corporation Signal processing device and signal processing method
US9747917B2 (en) * 2013-06-14 2017-08-29 GM Global Technology Operations LLC Position directed acoustic array and beamforming methods
CN105825855A (zh) * 2016-04-13 2016-08-03 联想(北京)有限公司 一种信息处理方法及主终端设备
US20170366897A1 (en) * 2016-06-15 2017-12-21 Robert Azarewicz Microphone board for far field automatic speech recognition
JP6496942B2 (ja) * 2016-07-26 2019-04-10 ソニー株式会社 情報処理装置
US10431211B2 (en) * 2016-07-29 2019-10-01 Qualcomm Incorporated Directional processing of far-field audio
US10467509B2 (en) * 2017-02-14 2019-11-05 Microsoft Technology Licensing, Llc Computationally-efficient human-identifying smart assistant computer
KR20190084789A (ko) * 2018-01-09 2019-07-17 엘지전자 주식회사 전자 장치 및 그 제어 방법
CN111742091B (zh) * 2018-02-23 2023-07-18 三星电子株式会社 洗衣机及其控制方法
CN108538305A (zh) * 2018-04-20 2018-09-14 百度在线网络技术(北京)有限公司 语音识别方法、装置、设备及计算机可读存储介质
CN109119071A (zh) * 2018-09-26 2019-01-01 珠海格力电器股份有限公司 一种语音识别模型的训练方法及装置
CN109215656A (zh) * 2018-11-14 2019-01-15 珠海格力电器股份有限公司 语音遥控装置装置及方法、存储介质、电子装置
CN109360579A (zh) * 2018-12-05 2019-02-19 途客电力科技(天津)有限公司 充电桩语音控制装置以及***

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6122613A (en) * 1997-01-30 2000-09-19 Dragon Systems, Inc. Speech recognition using multiple recognizers (selectively) applied to the same input sample
CN103095911A (zh) * 2012-12-18 2013-05-08 苏州思必驰信息科技有限公司 一种通过语音唤醒寻找手机的方法及***
CN107464564A (zh) * 2017-08-21 2017-12-12 腾讯科技(深圳)有限公司 语音交互方法、装置及设备
CN107862060A (zh) * 2017-11-15 2018-03-30 吉林大学 一种追踪目标人的语义识别装置及识别方法
CN109767769A (zh) * 2019-02-21 2019-05-17 珠海格力电器股份有限公司 一种语音识别方法、装置、存储介质及空调

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113793596A (zh) * 2021-09-15 2021-12-14 深圳金贝奇电子有限公司 一种基于语音增强技术的耳机远场交互***

Also Published As

Publication number Publication date
PT3923273T (pt) 2023-07-07
EP3923273A4 (en) 2022-07-13
US11830479B2 (en) 2023-11-28
EP3923273B1 (en) 2023-06-21
EP3923273A1 (en) 2021-12-15
ES2953525T3 (es) 2023-11-14
US20210383795A1 (en) 2021-12-09
CN109767769B (zh) 2020-12-22
CN109767769A (zh) 2019-05-17

Similar Documents

Publication Publication Date Title
WO2020168727A1 (zh) 一种语音识别方法、装置、存储介质及空调
CN106910500B (zh) 对带麦克风阵列的设备进行语音控制的方法及设备
CN107464564B (zh) 语音交互方法、装置及设备
WO2018032930A1 (zh) 一种智能设备的语音交互控制方法和装置
CN105009204B (zh) 语音识别功率管理
WO2020083110A1 (zh) 一种语音识别、及语音识别模型训练方法及装置
CN106440192B (zh) 一种家电控制方法、装置、***及智能空调
US11295760B2 (en) Method, apparatus, system and storage medium for implementing a far-field speech function
US9875081B2 (en) Device selection for providing a response
CN106782563B (zh) 一种智能家居语音交互***
WO2020048431A1 (zh) 一种语音处理方法、电子设备和显示设备
CN202110564U (zh) 结合视频通道的智能家居语音控制***
CN108681440A (zh) 一种智能设备音量控制方法及***
CN109286875A (zh) 用于定向拾音的方法、装置、电子设备和存储介质
CN109920419B (zh) 语音控制方法和装置、电子设备及计算机可读介质
CN109166575A (zh) 智能设备的交互方法、装置、智能设备和存储介质
CN108966077A (zh) 一种音箱音量的控制方法及***
CN107464565A (zh) 一种远场语音唤醒方法及设备
CN106782519A (zh) 一种机器人
CN108297108A (zh) 一种球形跟随机器人及其跟随控制方法
CN107045308A (zh) 智能互动服务机器人
CN110767228B (zh) 一种声音获取方法、装置、设备及***
CN208367199U (zh) 分离式麦克风阵列
CN107452381B (zh) 一种多媒体语音识别装置及方法
CN109994129A (zh) 语音处理***、方法和设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19915991

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019915991

Country of ref document: EP

Effective date: 20210908