CN109545196B - Speech recognition method, device and computer readable storage medium - Google Patents

Speech recognition method, device and computer readable storage medium Download PDF

Info

Publication number
CN109545196B
CN109545196B CN201811644306.5A CN201811644306A CN109545196B CN 109545196 B CN109545196 B CN 109545196B CN 201811644306 A CN201811644306 A CN 201811644306A CN 109545196 B CN109545196 B CN 109545196B
Authority
CN
China
Prior art keywords
voice
user
sound
model
background sound
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811644306.5A
Other languages
Chinese (zh)
Other versions
CN109545196A (en
Inventor
袁晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Ikmak Tech Co ltd
Original Assignee
Shenzhen Ikmak Tech Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Ikmak Tech Co ltd filed Critical Shenzhen Ikmak Tech Co ltd
Priority to CN201811644306.5A priority Critical patent/CN109545196B/en
Publication of CN109545196A publication Critical patent/CN109545196A/en
Application granted granted Critical
Publication of CN109545196B publication Critical patent/CN109545196B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Manipulator (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a voice recognition method, which comprises the following steps: monitoring voice information sent by a user; denoising the voice information and identifying a voice instruction of a user according to a pre-stored voice model; collecting background sounds of the surrounding environment of a user; identifying the background sound according to a pre-stored background sound model, and determining the position of a user according to an identification result; and combining the voice command and the position information to form a final recognition result and outputting the final recognition result. The invention also discloses a voice recognition device and a computer readable storage medium. The invention can improve the voice recognition accuracy of the intelligent terminal equipment.

Description

Speech recognition method, speech recognition device and computer-readable storage medium
Technical Field
The present invention relates to the field of speech recognition, and in particular, to a speech recognition method, apparatus, and computer-readable storage medium.
Background
With the development of science and technology and the progress of computer technology, speech recognition technology is already applied to various fields such as life and industry, and the prior art has various speech recognition methods or devices for realizing human-computer interaction, thereby making great contribution to the economic development of human society. However, the existing voice recognition technology can generally only recognize the pronunciation of a normal person, and when the pronunciation of a user is inaccurate or language barriers exist, the existing voice recognition technology is difficult to recognize or inaccurate to recognize. Take the old as an example: with the age, some diseases in language are in a high-incidence state in the elderly population, such as aphasia. Aphasia patients may have a disorder of language expression when speaking, reading, or writing, but the intelligence is not affected by aphasia. The existing voice recognition technology is difficult to perform voice recognition on the people suffering from aphasia, or the recognition accuracy is greatly reduced, so that the related technology is difficult to apply, for example, when the voice recognition technology is applied to a companion robot, the companion robot is difficult to really play the role of the companion robot due to the difficulty in recognizing the voice.
In view of the above, it is necessary to provide a speech recognition technology to improve the accuracy of speech recognition and expand the application range of the speech recognition technology.
Disclosure of Invention
The invention mainly aims to provide a voice recognition method, aiming at improving the accuracy of voice recognition and expanding the application range of voice recognition technology.
In order to achieve the above object, the present invention provides a speech recognition method, including:
monitoring voice information sent by a user;
denoising the voice information and identifying a voice instruction of a user according to a pre-stored voice model;
collecting background sounds of the surrounding environment of a user;
identifying the background sound according to a pre-stored background sound model, and determining the position of a user according to an identification result;
and combining the voice command and the position information to form a final recognition result and outputting the final recognition result.
Preferably, the denoising the voice information and recognizing the voice instruction of the user according to a pre-stored voice model includes:
acquiring characteristic parameters of plosive, fricative and nasal sound in voice information of a user and comparing the characteristic parameters with corresponding preset models;
and when the amplitude of the plosive, the fricative or the nasal sound is smaller than a preset range, performing enhancement treatment on the plosive, the fricative or the nasal sound.
Preferably, the method further comprises:
and linearly analyzing the voice change of the user according to the collected voice information at a plurality of preset moments, and forming and storing a new voice model according to the analysis result.
Preferably, the recognizing the background sound according to a pre-stored background sound model, and determining the location of the user according to the recognition result includes:
and comparing the collected sound emitted by the preset sound source and the background sound in the environment with the background sound model respectively, and determining the position of the user according to the comparison result.
Preferably, the method may further include: and displaying the recognition result in a picture and text mode for a user to select or confirm, outputting the recognition result to external equipment after the user selects or confirms, and/or broadcasting the recognition result to the user through voice and receiving feedback information of the user.
The present invention also provides a voice recognition apparatus, comprising:
the voice acquisition module is used for monitoring voice information sent by a user;
the first processing module is used for carrying out denoising processing on the voice information and identifying a voice instruction of a user according to a pre-stored voice model;
the background sound monitoring module is used for acquiring background sounds of the surrounding environment of the user;
the second processing module is used for identifying the background sound according to a pre-stored background sound model and determining the position of the user according to an identification result;
and the output module is used for combining the voice command and the position information to form a final recognition result and outputting the final recognition result.
Preferably, the voice acquisition module is configured to:
acquiring characteristic parameters of plosive, fricative and nasal sound in user voice information and comparing the characteristic parameters with corresponding preset models;
and when the amplitude of the plosive, the fricative or the nasal sound is smaller than a preset range, performing enhancement treatment on the plosive, the fricative or the nasal sound.
Preferably, the above apparatus further comprises:
and the updating module is used for linearly analyzing the voice change of the user according to the collected voice information at a plurality of preset moments, and forming and storing a new voice model according to the analysis result.
The present invention further provides a computer-readable storage medium, in which computer-executable instructions are stored, and a computer program of the computer-executable instructions is executed by a processor to implement the foregoing speech recognition method.
According to the invention, the voice instruction and the background sound of the user are extracted, the voice instruction of the user is extracted and the recognition of the background sound in the environment is combined, and when the pronunciation of the user is not complete or clear enough, the real intention of the user is judged by means of the recognition result of the environment, so that the voice recognition accuracy is improved.
Drawings
FIG. 1 is a flow chart of a speech recognition method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart illustrating the steps of comparing the voice information of the user with the voice model to obtain the voice command of the user in the voice recognition method according to the embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a first processing module and a second processing module of a speech recognition apparatus according to an embodiment of the present invention.
Detailed Description
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1, the present invention provides a voice recognition method, including:
step S10, voice information sent by a user is intercepted; in the embodiment of the invention, a voice monitoring device can be arranged on intelligent equipment such as a mobile phone, a tablet or a robot and the like to collect voice information sent by a user.
S20, denoising the voice information and identifying a voice instruction of a user according to a pre-stored voice model; when voice information is collected, denoising processing is carried out through a voice chip, and a voice instruction sent by a user is obtained.
S30, collecting background sounds of the surrounding environment of the user; after the voice instruction is obtained, a second voice monitoring device in intelligent equipment such as a mobile phone, a tablet or a robot is awakened, and background sounds in the environment are detected and received.
S40, identifying the background sound according to a pre-stored background sound model, and determining the position of a user according to an identification result; for example, the background sound is analyzed by the voice chip, and whether the user is outdoors or indoors is judged according to the volume of the sound, and further whether the user is in a horizontal type, a living room or a kitchen can be judged according to the volume or the type.
And S50, combining the voice command and the position information to form a final recognition result and outputting the final recognition result. And when the voice command and the position information are clear, giving a recognition result and outputting the recognition result. In the embodiment of the invention, when the pronunciation of the user is not complete or clear enough, the real intention of the user is judged by means of the recognition result of the environment, so that the accuracy rate of voice recognition is improved.
The following is an application scenario of the present invention, by means of which the detailed solution of the speech recognition of the present invention can be further understood:
scene one: the old people turn on the air conditioner in the bedroom by saying 'air conditioner' or 'turning on the air conditioner' with the help of the robot. The specific process is as follows:
step A: a user sends a voice command to the accompanying robot;
and B, step B: a first sound receiving device of the accompanying robot receives a voice signal of a user;
step C: and analyzing by a microprocessor of the accompanying robot to obtain a first recognition result: turning on the air conditioner, simultaneously waking up the second sound receiving device, and receiving a background sound signal from the ambient environment;
step D: and analyzing by the microprocessor of the accompanying robot to obtain a second recognition result: a bedroom;
and E, step E: and (3) carrying out micro-processing comprehensive analysis on the accompany robot to obtain a final recognition result: opening an air conditioner in a bedroom;
step F: and the network device of the accompanying robot sends an operation command to the air conditioner in the bedroom according to the preset position information of the storage device, so that the air conditioner starts to start and operate.
In the embodiment of the present invention, before performing all the steps, the method may further include: training and modeling the voice information of the user and the background sound, forming a voice model and a background sound model and storing the voice model and the background sound model. In the embodiment of the invention, the voices of people with difficult pronunciation or obstacles are collected for training and modeling, so that the pronunciation of a user can be correctly recognized during application. In addition, the background sounds of the indoor and outdoor are collected and modeled to identify the environment where the user is located, for example, the background sounds of a plurality of bedroom environments can be collected at different time periods, trained, modeled and stored, and the user can extract a background sound model for comparison in practical application, so that the environment where the user is located is determined.
It can be understood that the foregoing step of denoising the voice information to recognize the voice instruction of the user includes:
denoising the received voice information to obtain the voice information of the user;
comparing the voice information of the user with a voice model to obtain a voice instruction of the user;
the step of identifying the background sound according to the pre-stored background sound model and determining the position of the user according to the identification result comprises the following steps:
and denoising the acquired background sound, and determining the position of the user according to the denoised background sound to obtain position information.
Considering that the background sound models of some environments may be very similar or the same, sound sources for identifying the environments can be set in different environments in advance, real-time collection is carried out through the voice collection module, the voice chip compares the collected sound emitted by the preset sound sources and the background sound in the environments with the background sound models respectively, and the position of the user is determined according to the comparison result. For example, the environment may be represented by a wind chime as a living room or a kitchen, and when a user gives a voice command in the environment, the voice chip may recognize the location based on a background sound given by the environment sound source.
Specifically, the scheme of the present invention can be further understood through the following application scenarios:
scene two: the old man can turn on the lamp of its environment by sending out voice "turn on the light". The specific process is as follows:
step A1: the user sends a voice command to the accompanying robot: turning on a lamp;
step B1: a first sound receiving device of the accompanying robot receives a voice signal of a user;
step C1: the microprocessor of the accompanying robot extracts a voice model and analyzes the voice model to obtain a first recognition result: turning on the light, simultaneously waking up the second sound receiving device, and receiving background sound signals from the surrounding environment;
step D1: since the user is located between two environments (such as a living room and a kitchen), the microprocessor of the accompanying robot acquires sounds emitted by sound sources of the living room and the kitchen, and obtains a second recognition result according to different analysis of the sounds: a living room;
step E1: and (3) carrying out micro-processing comprehensive analysis on the accompanying robot to obtain a final recognition result: turning on a lamp of the living room;
step F1: and the network device of the accompanying robot sends a command to the switch of the hall lamp according to the preset position information of the storage device, so that the network device executes a lamp-on command.
The extraction and selection of acoustic features is an important link of speech recognition. The extraction of the acoustic features is a process of greatly compressing information and a signal deconvolution process, and aims to enable the mode divider to divide better.
Due to the time-varying nature of speech signals, feature extraction must be performed on a small segment of the speech signal, i.e., a short analysis. This segment is considered to be a stationary analysis interval, commonly referred to as a frame, and the frame-to-frame offset typically takes 1/2 or 1/3 of the frame length. The signal is usually pre-emphasized to boost the high frequencies and windowed to avoid the effects of short-time speech segment edges.
Some of the acoustic features that are commonly used:
(1) Linear Predictive Coefficient (LPC): linear predictive analysis starts with the human phonation mechanism and through the study of the short-tube cascade model of the vocal tract, the transfer function of the system is considered to match the form of the all-pole digital filter, so that the signal at time n can be estimated using the linear combination of the signals at the first few times.
(2) Cepstral coefficients: by using a homomorphic processing method, after Discrete Fourier Transform (DFT) is solved for a voice signal, logarithm is taken, and then inverse transformation iDFT is solved to obtain a cepstrum coefficient.
(3) Mel-Frequency Cepstral Coefficinets (MFCCs) and Perceptual Linear Prediction (PLP): unlike LPC and the like, which are derived through the study of human vocal mechanisms, mel-frequency cepstral coefficients MFCC and perceptual linear predictive PLP are acoustic features derived by being motivated by the research results of human auditory systems.
Chinese acoustic features: taking mandarin pronunciation as an example, the pronunciation of a word can be cut into two parts, namely initial (initials) and final (finals). In the process of pronunciation, the conversion from Initial consonant to Final consonant is a gradual change rather than an instant change, and for this, right-Context-Dependent Initial and Final consonant mode (RCDIF) is used as an analysis method, so that the correct Syllable (Syllable) can be more accurately identified.
Considering that the old people or the people with difficult pronunciation are difficult to pronounce accurate pronunciation, the pronunciation of the consonant is divided into the following four categories according to different characteristics of the consonant and modeled:
plosive (Plosive) in which the lips are closed during speaking, and then the airflow is discharged to produce a sound similar to a burst. The amplitude change of the sound will decrease to a minimum value (representing tight lips) and then increase sharply.
Fricative (Fricative): when the voice is sounded, the tongue is tightly attached to the hard jaw to form a narrow channel, and airflow generates turbulence to generate friction when passing through the narrow channel, so that the sound is sounded. Because the airflow is stably output during the friction sound, the amplitude change of the sound is smaller than that of the plosive.
Affricite (affricite): this type of sound generation model has both the sound generation characteristics of plosive and fricative. The main sounding structure is like friction sound, and the tongue clings to the hard palate to generate friction sound when air flow passes through. The channels are more compact, so that the airflow can be instantly rushed out to generate characteristics like a plosive.
Nasal sound (Nasal): when speaking, the soft palate is pressed downwards, and after pressing downwards, the airflow from the trachea is blocked and can not enter the oral cavity, thus turning to the nasal cavity. And thus the nasal cavity and the oral cavity resonate.
Referring to fig. 2, in an embodiment of the present invention, when a user pronounces voice, characteristic parameters of plosive, fricative, and nasal sound in voice information of the user are obtained and compared with corresponding preset models; and when the amplitude of the plosive, the fricative or the nasal sound is smaller than a preset range, performing enhancement treatment on the plosive, the fricative or the nasal sound. Therefore, even if the pronunciation of the user is inaccurate, the voice command of the user can be accurately recognized. For example, after the characteristic parameters of the plosive, the fricative and the nasal sound in the voice information of the user are obtained, the characteristic parameters are compared with corresponding preset models, when the amplitude of the plosive, the fricative or the nasal sound is within a preset range, analysis is continued, and the next characteristic parameter is compared and adjusted until all parameters are compared and adjusted.
In the embodiment of the invention, in order to further enhance the identification accuracy, the identification result can be displayed in a picture and text mode for the user to select or confirm, and the identification result is output to external equipment after the user selects or confirms, and/or the identification result is broadcasted to the user through voice and the feedback information of the user is received. For example, when the voice instruction sent by the user is to turn on the air conditioner, the voice chip cannot accurately recognize the voice instruction of the user, at this time, a plurality of results (turning on the air conditioner, turning on an air conditioning fan, turning on a fan) can be sent to the user interaction module, the user confirms through the touch screen, and then executes the command of turning on the air conditioner after confirming.
In a preferred embodiment of the present invention, the method further includes:
and linearly analyzing the voice change of the user according to the collected voice information at a plurality of preset moments, and forming and storing a new voice model according to the analysis result. For example, the language ability of the old people gradually declines, a plurality of periods can be preset, the change of the voice of the old people is judged according to different pronunciations corresponding to the same voice command collected in one period, and the voice model is updated so as to adapt.
The present invention also provides a speech recognition apparatus for implementing the above method, and as shown in fig. 3, the speech recognition apparatus includes:
the voice acquisition module 10 is used for intercepting voice information sent by a user; in the embodiment of the present invention, the voice collecting module 10 may be an intercepting device such as a microphone in an intelligent terminal such as a mobile phone, a tablet computer, or a robot, and is configured to collect voice information sent by a user.
The first processing module 20 is configured to perform denoising processing on the voice information and recognize a voice instruction of a user according to a pre-stored voice model; the first processing module 20 may be a voice processing chip, and when the voice information is collected, performs denoising processing on the voice information to obtain a voice instruction sent by a user.
The background sound monitoring module 30 is used for collecting the background sound of the surrounding environment of the user; the background sound interception module 30 may be an interception device such as a microphone disposed at different positions, and is configured to collect background sound information emitted in the environment. After obtaining the voice command, the smart device such as a mobile phone, a tablet, or a robot may wake up the background sound listening module 30 through the chip to detect and receive the background sound in the environment.
The second processing module 40 is configured to identify the background sound according to a pre-stored background sound model, and determine a location of the user according to an identification result; for example, the background sound is analyzed by the voice chip, and whether the user is outdoors or indoors is judged according to the volume of the sound, and further whether the user is in a horizontal type, a living room or a kitchen can be judged according to the volume or the type.
And the output module 50 is used for combining the voice command and the position information to form a final recognition result and outputting the final recognition result. And when the voice command and the position information are clear, giving a recognition result and outputting the recognition result. In the embodiment of the invention, when the pronunciation of the user is not complete or clear enough, the real intention of the user is judged by means of the recognition result of the environment, so that the accuracy rate of voice recognition is improved.
In a preferred embodiment, the voice recognition apparatus further includes:
and the model establishing module 60 is used for training and modeling the voice information and the background sound of the user, forming a voice model and a background sound model and storing the voice model and the background sound model. In the embodiment of the present invention, the model building module 60 collects voices of people with difficulty in pronunciation or with obstacles to train and model, so as to correctly recognize pronunciation of the user during application. In addition, the model building module 60 collects the background sounds of indoor and outdoor environments and performs modeling to identify the environment where the user is located, for example, the background sounds of a plurality of bedroom environments can be collected at different time periods, trained, modeled and stored, and the user can extract the background sound model for comparison in actual application, so as to determine the environment where the user is located.
Referring to fig. 4, in one embodiment, the first processing module 20 includes:
a denoising unit 21, configured to perform denoising processing on the received voice information to obtain voice information of a user;
a voice instruction obtaining unit 22, configured to compare the voice information of the user with a voice model to obtain a voice instruction of the user;
the second processing module 40 includes:
the position information obtaining unit 41 performs denoising processing on the acquired background sound, and determines the position of the user according to the denoised background sound to obtain the position information.
Preferably, the voice instruction obtaining unit 22 is configured to:
acquiring characteristic parameters of plosive, fricative and nasal sound in voice information of a user and comparing the characteristic parameters with corresponding preset models; and when the amplitude of the plosive, the fricative or the nasal sound is smaller than a preset range, performing enhancement treatment on the plosive, the fricative or the nasal sound. Therefore, even if the pronunciation of the user is inaccurate, the voice command of the user can be accurately recognized.
In an embodiment of the present invention, the apparatus may further include:
and the updating module 70 is used for linearly analyzing the voice change of the user according to the collected voice information at a plurality of preset moments, and forming and storing a new voice model according to the analysis result. For example, the language ability of the elderly gradually declines, a plurality of periods can be preset, the updating module 70 judges the voice change of the elderly according to different pronunciations corresponding to the same voice command collected in one period, and updates the voice model so as to adapt.
The present invention further provides a computer-readable storage medium, in which computer-executable instructions are stored, and a computer program of the computer-executable instructions is executed by a processor to implement the foregoing speech recognition method. The computer-readable storage medium provided by the present invention can store a program for implementing the aforementioned voice recognition method, and is carried and loaded on a computer device, where such a computer device can be an intelligent terminal such as a mobile phone, a tablet computer, or a service robot.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present invention are included in the protection scope of the present invention.

Claims (10)

1. A method of speech recognition, the method comprising:
monitoring voice information sent by a user;
denoising the voice information and identifying a voice instruction of a user according to a pre-stored voice model;
collecting background sounds of the surrounding environment of a user;
identifying the background sound according to a pre-stored background sound model, and determining the position of a user according to an identification result, wherein the background sound model is trained and modeled based on the background sounds of a plurality of environments corresponding to different time periods;
and combining the voice command and the position information, and when the pronunciation of the user is not complete enough or clear enough, performing sentence completion on the voice command by using the position information to form a final recognition result and outputting the final recognition result.
2. The method of claim 1, wherein denoising the voice information and recognizing the user's voice command according to the pre-stored voice model comprises:
acquiring characteristic parameters of plosive, fricative and nasal sound in user voice information and comparing the characteristic parameters with corresponding preset models;
and when the amplitude of the plosive, the fricative or the nasal sound is smaller than a preset range, performing enhancement treatment on the plosive, the fricative or the nasal sound.
3. The method of claim 1 or 2, further comprising:
and linearly analyzing the voice change of the user according to the collected voice information at a plurality of preset moments, and forming and storing a new voice model according to the analysis result.
4. The method according to claim 3, wherein the recognizing the background sound according to the pre-stored background sound model and determining the position of the user according to the recognition result comprises:
and comparing the collected sound emitted by the preset sound source and the background sound in the environment with the background sound model respectively, and determining the position of the user according to the comparison result.
5. The method of claim 4, further comprising: and displaying the recognition result in a picture and text mode for a user to select or confirm, outputting the recognition result to external equipment after the user selects or confirms, and/or broadcasting the recognition result to the user through voice and receiving feedback information of the user.
6. A speech recognition apparatus, comprising:
the voice acquisition module is used for monitoring voice information sent by a user;
the first processing module is used for carrying out denoising processing on the voice information and identifying a voice instruction of a user according to a pre-stored voice model;
the background sound monitoring module is used for acquiring background sounds of the surrounding environment of the user;
the second processing module is used for identifying the background sound according to a pre-stored background sound model and determining the position of the user according to an identification result, wherein the background sound model is trained and modeled based on the background sounds of a plurality of environments corresponding to different time periods;
and the output module is used for combining the voice command and the position information, and when the pronunciation of the user is not complete enough or clear enough, the position information is used for completing the sentence of the voice command to form a final recognition result and outputting the final recognition result.
7. The speech recognition device of claim 6, wherein the speech acquisition module is configured to:
acquiring characteristic parameters of plosive, fricative and nasal sound in user voice information and comparing the characteristic parameters with corresponding preset models;
and when the amplitude of the plosive, the fricative or the nasal sound is smaller than a preset range, performing enhancement treatment on the plosive, the fricative or the nasal sound.
8. The speech recognition device according to claim 6 or 7, further comprising:
and the updating module is used for linearly analyzing the voice change of the user according to the collected voice information at a plurality of preset moments, and forming and storing a new voice model according to the analysis result.
9. The speech recognition device of claim 6, wherein the first processing module comprises:
and comparing the collected sound emitted by the preset sound source and the background sound in the environment with the background sound model respectively, and determining the position of the user according to the comparison result.
10. A computer-readable storage medium, in which computer-executable instructions are stored, a computer program of which computer-executable instructions, when executed by a processor, implement the speech recognition method of any one of claims 1 to 5.
CN201811644306.5A 2018-12-29 2018-12-29 Speech recognition method, device and computer readable storage medium Active CN109545196B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811644306.5A CN109545196B (en) 2018-12-29 2018-12-29 Speech recognition method, device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811644306.5A CN109545196B (en) 2018-12-29 2018-12-29 Speech recognition method, device and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN109545196A CN109545196A (en) 2019-03-29
CN109545196B true CN109545196B (en) 2022-11-29

Family

ID=65831549

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811644306.5A Active CN109545196B (en) 2018-12-29 2018-12-29 Speech recognition method, device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN109545196B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109974225A (en) * 2019-04-09 2019-07-05 珠海格力电器股份有限公司 A kind of air conditioning control method, device, storage medium and air-conditioning
CN110473547B (en) * 2019-07-12 2021-07-30 云知声智能科技股份有限公司 Speech recognition method
CN110867184A (en) * 2019-10-23 2020-03-06 张家港市祥隆五金厂 Voice intelligent terminal equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102918591A (en) * 2010-04-14 2013-02-06 谷歌公司 Geotagged environmental audio for enhanced speech recognition accuracy
CN108877773A (en) * 2018-06-12 2018-11-23 广东小天才科技有限公司 A kind of audio recognition method and electronic equipment

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8762143B2 (en) * 2007-05-29 2014-06-24 At&T Intellectual Property Ii, L.P. Method and apparatus for identifying acoustic background environments based on time and speed to enhance automatic speech recognition
WO2014182453A2 (en) * 2013-05-06 2014-11-13 Motorola Mobility Llc Method and apparatus for training a voice recognition model database
CN104143342B (en) * 2013-05-15 2016-08-17 腾讯科技(深圳)有限公司 A kind of pure and impure sound decision method, device and speech synthesis system
CN105448292B (en) * 2014-08-19 2019-03-12 北京羽扇智信息科技有限公司 A kind of time Speech Recognition System and method based on scene
CN105913039B (en) * 2016-04-26 2020-08-18 北京光年无限科技有限公司 Interactive processing method and device for dialogue data based on vision and voice
CN106941506A (en) * 2017-05-17 2017-07-11 北京京东尚科信息技术有限公司 Data processing method and device based on biological characteristic
CN107742517A (en) * 2017-10-10 2018-02-27 广东中星电子有限公司 A kind of detection method and device to abnormal sound

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102918591A (en) * 2010-04-14 2013-02-06 谷歌公司 Geotagged environmental audio for enhanced speech recognition accuracy
CN108877773A (en) * 2018-06-12 2018-11-23 广东小天才科技有限公司 A kind of audio recognition method and electronic equipment

Also Published As

Publication number Publication date
CN109545196A (en) 2019-03-29

Similar Documents

Publication Publication Date Title
CN109545196B (en) Speech recognition method, device and computer readable storage medium
WO2017084360A1 (en) Method and system for speech recognition
JP2020524308A (en) Method, apparatus, computer device, program and storage medium for constructing voiceprint model
CN108847215B (en) Method and device for voice synthesis based on user timbre
Kumar et al. Design of an automatic speaker recognition system using MFCC, vector quantization and LBG algorithm
CN108648527B (en) English pronunciation matching correction method
US20190279644A1 (en) Speech processing device, speech processing method, and recording medium
CN108470476B (en) English pronunciation matching correction system
WO2013052292A9 (en) Waveform analysis of speech
CN110047474A (en) A kind of English phonetic pronunciation intelligent training system and training method
Chauhan et al. Speech to text converter using Gaussian Mixture Model (GMM)
CN114121006A (en) Image output method, device, equipment and storage medium of virtual character
JP4811993B2 (en) Audio processing apparatus and program
Sahoo et al. MFCC feature with optimized frequency range: An essential step for emotion recognition
CN109545195B (en) Accompanying robot and control method thereof
CN113593523B (en) Speech detection method and device based on artificial intelligence and electronic equipment
CN115331670A (en) Off-line voice remote controller for household appliances
Proença et al. Children's reading aloud performance: a database and automatic detection of disfluencies
Barczewska et al. Detection of disfluencies in speech signal
CN116705070B (en) Method and system for correcting speech pronunciation and nasal sound after cleft lip and palate operation
Scharf et al. Lombard Effect for Bilingual Speakers in Cantonese and English: importance of spectro-temporal features
Verkhodanova et al. Automatic detection of speech disfluencies in the spontaneous Russian speech
Sedigh Application of polyscale methods for speaker verification
JP2006293102A (en) Education system accompanied by check on intelligibility by judgment on whether trainee has self-confidence
RU2589851C2 (en) System and method of converting voice signal into transcript presentation with metadata

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant