CN109545196A - Audio recognition method, device and computer readable storage medium - Google Patents

Audio recognition method, device and computer readable storage medium Download PDF

Info

Publication number
CN109545196A
CN109545196A CN201811644306.5A CN201811644306A CN109545196A CN 109545196 A CN109545196 A CN 109545196A CN 201811644306 A CN201811644306 A CN 201811644306A CN 109545196 A CN109545196 A CN 109545196A
Authority
CN
China
Prior art keywords
user
sound
background sound
speech
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811644306.5A
Other languages
Chinese (zh)
Other versions
CN109545196B (en
Inventor
袁晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Comexe Ikang Science And Technology Co Ltd
Original Assignee
Shenzhen Comexe Ikang Science And Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Comexe Ikang Science And Technology Co Ltd filed Critical Shenzhen Comexe Ikang Science And Technology Co Ltd
Priority to CN201811644306.5A priority Critical patent/CN109545196B/en
Publication of CN109545196A publication Critical patent/CN109545196A/en
Application granted granted Critical
Publication of CN109545196B publication Critical patent/CN109545196B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Manipulator (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a kind of audio recognition methods, which comprises listens to the voice messaging of user's sending;Denoising is carried out to the voice messaging and according to the phonetic order of the speech model identification user prestored;Acquire the background sound of user surrounding environment;The background sound is identified according to the background sound model prestored, and user present position is determined according to recognition result;The phonetic order and location information are combined, final recognition result is formed and exports.The present invention further simultaneously discloses a kind of speech recognition equipment and computer readable storage medium.The present invention can promote the speech recognition accuracy of intelligent terminal.

Description

Audio recognition method, device and computer readable storage medium
Technical field
The present invention relates to field of speech recognition more particularly to a kind of audio recognition methods, device and computer-readable storage Medium.
Background technique
With the development of science and technology with the progress of computer technology, it is each that speech recognition technology already applies to life and industry etc. Big field, it is the economy of human society that there are a variety of audio recognition methods or device for being used to realize human-computer interaction for the prior art Development is made that tremendous contribution.However, existing speech recognition technology is generally only capable of the pronunciation of identification normal person, when the hair of user Sound inaccuracy is perhaps difficult to identify or identify inaccuracy there are speech recognition technology existing when aphasis.It is with old man Example: with advancing age, the disease in terms of some language is in high-incidence state, such as aphasia etc. in old man crowd.It loses Language disease patient may speak, read, or write, and have the obstacle of ability of language expression, but intelligence can't be by aphasia The influence of disease.Existing speech recognition technology is just difficult to carry out trouble aphemic crowd speech recognition, or identification accuracy It will be greatly reduced, therefore the relevant technologies are difficult to apply, for example, when speech recognition technology is applied to company robot, due to It is difficult to identify this kind of voice, robot is accompanied to be difficult to really play its effect.
In view of this, to promote the accuracy rate of speech recognition, extended voice is known it is necessary to provide a kind of speech recognition technology The application range of other technology.
Summary of the invention
The main purpose of the present invention is to provide a kind of audio recognition methods, it is intended to promote the accuracy rate of speech recognition, expand Open up the application range of speech recognition technology.
To achieve the goals above, the present invention provides a kind of audio recognition method, which comprises
Listen to the voice messaging of user's sending;
Denoising is carried out to the voice messaging and according to the phonetic order of the speech model identification user prestored;
Acquire the background sound of user surrounding environment;
The background sound is identified according to the background sound model prestored, and according to recognition result determine user locating for position It sets;
The phonetic order and location information are combined, final recognition result is formed and exports.
Preferably, described denoising to be carried out to the voice messaging and according to the language of the speech model identification user prestored Sound instructs
Obtain plosive, fricative and the nasal sound characteristic parameter in user speech information and by itself and corresponding preset model It is compared;
When the amplitude of vibration of the plosive, fricative or nasal sound is less than preset range, enhancing processing is carried out to it.
Preferably, the above method further include:
Changed according to the voice of the voice messaging linear analysis user of collected multiple predetermined instants, result shape will be analyzed The speech model of Cheng Xin and storage.
Preferably, the background sound model that the basis prestores identifies the background sound, and true according to recognition result Determining user present position includes:
Background sound in the sound and environment of collected preset source of sound sending is compared with background sound model respectively, User present position is determined according to comparison result.
Preferably, the above method, which may also include that, shows that the recognition result selects or confirms for user by picture and text form, And export the recognition result to external equipment after user selects or confirms, and/or, by voice to described in user's broadcast Recognition result and the feedback information for receiving user.
The present invention also provides a kind of speech recognition equipments comprising:
Voice acquisition module, for listening to the voice messaging of user's sending;
First processing module, for carrying out denoising to the voice messaging and being used according to the speech model identification prestored The phonetic order at family;
Background sound listens to module, for acquiring the background sound of user surrounding environment;
Second processing module, for being identified according to the background sound model prestored to the background sound, and according to identification As a result user present position is determined;
Output module forms final recognition result and exports for combining the phonetic order and location information.
Preferably, the voice acquisition module is used for:
Obtain plosive, fricative and the nasal sound characteristic parameter in user speech information and by itself and corresponding preset model It is compared;
When the amplitude of vibration of the plosive, fricative or nasal sound is less than preset range, enhancing processing is carried out to it.
Preferably, above-mentioned apparatus further include:
Update module, the voice for the voice messaging linear analysis user according to collected multiple predetermined instants become Change, analysis result is formed into new speech model and storage.
The present invention separately provides a kind of computer readable storage medium, and calculating is stored in the computer readable storage medium The computer program of machine executable instruction, the computer executable instructions realizes speech recognition above-mentioned when being executed by processor Method.
The present invention is extracted the phonetic order of user and is combined in environment by the phonetic order and background sound of extraction user The identification of background sound judges user's by the recognition result of environment when user pronunciation is not complete enough or not enough understands True intention, to promote speech recognition accuracy.
Detailed description of the invention
Fig. 1 is the flow diagram of audio recognition method of the embodiment of the present invention;
Fig. 2 is to be compared the voice messaging of user with speech model in audio recognition method of the embodiment of the present invention, is obtained To the step flow diagram of the phonetic order of user;
Fig. 3 is speech recognition equipment of embodiment of the present invention structural schematic diagram;
Fig. 4 is the structural schematic diagram of speech recognition equipment of embodiment of the present invention first processing module and Second processing module.
Specific embodiment
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.It should Understand, the specific embodiments described herein are merely illustrative of the present invention, is not intended to limit the present invention.
Referring to Fig.1, the present invention provides a kind of audio recognition method, this method comprises:
Step S10 listens to the voice messaging of user's sending;It, can be in mobile phone, plate or machine in the embodiment of the present invention Voice interception device, the voice messaging that acquisition user issues is arranged in the smart machines such as people.
Step S20 carries out denoising to the voice messaging and according to the voice of the speech model identification user prestored Instruction;When collecting voice messaging, denoising is carried out by speech chip, obtains the phonetic order that user issues.
Step S30 acquires the background sound of user surrounding environment;After obtaining phonetic order, mobile phone, plate or machine are waken up Second voice interception device in the smart machines such as device people, detects and receives the background sound in environment.
Step S40 identifies the background sound according to the background sound model prestored, and is determined and used according to recognition result Family present position;For example, analyzing by speech chip background sound, judge that user is in family according to the volume of sound It is outer or indoor, further, it can also judge that user is in horizontal, parlor or kitchen according to volume or type Room.
The phonetic order and location information are combined, form final recognition result and export by step S50.When voice refers to When enabling clear with location information, provides recognition result and export.In the embodiment of the present invention, when user pronunciation not enough completely or When not enough understanding, the true intention of user is judged by the recognition result of environment, to promote speech recognition accuracy.
It is an application scenarios of the invention below, by the application scenarios, can further appreciates that voice of the present invention is known Other detailed protocol:
Scene one: old man need to only say " air-conditioning " or " opening air-conditioning ", turn on the air-conditioning in bedroom by robot. Detailed process is as follows:
Step A: user goes out voice command to company machine human hair;
Step B: the first sound receiver of robot is accompanied to receive the voice signal of user;
Step C: it accompanies the microprocessor analysis of robot to obtain the first recognition result: opening air-conditioning, while waking up second Sound receiver receives the background sound signal from ambient enviroment;
Step D: the microprocessor analysis of robot is accompanied to obtain the second recognition result: bedroom;
Step E: it accompanies the micro process comprehensive analysis of robot to obtain final recognition result: opening the air-conditioning in bedroom;
Step F: accompanying the network equipment of robot, according to the preset position information of storage device, sends out to the air-conditioning in bedroom Operation order out enables it start starting, operation.
In the embodiment of the present invention, it may also include that the voice messaging and background sound to user before executing all steps It is trained and models, form speech model and background sound model and store.In the embodiment of the present invention, acquires dysphonia or deposit It is trained and models in the voice of obstacle crowd, correctly to identify user pronunciation in application.In addition, collection room interior room Outer background sound is simultaneously modeled, to identify user's local environment, for example, the back of multiple bedroom environment can be acquired in different periods Scape sound is simultaneously trained it and models and store, and user can extract background sound model in practical application and be compared, from And determine user's local environment.
It is understood that aforementioned the step of carrying out phonetic order of the denoising to identify user to voice messaging packet It includes:
Denoising is carried out to the voice messaging received, obtains the voice messaging of user;
The voice messaging of the user is compared with speech model, obtains the phonetic order of user;
The background sound model that aforementioned basis prestores identifies the background sound, and determines user institute according to recognition result Locate position the step of include:
Denoising is carried out to collected background sound, user present position is determined according to the background sound after denoising, is obtained Location information.
In view of the background sound model of some environment may be very much like or identical, can also be arranged in advance in different environment The source of sound of environment for identification is acquired in real time by voice acquisition module, and speech chip sends out collected preset source of sound The background sound in sound and environment out is compared with background sound model respectively, according to comparison result determine user locating for position It sets.For example, can indicate that local environment is parlor or kitchen with windbell, when user is when this environment issues phonetic order, language Sound chip can identify present position according to the background sound that the environment source of sound issues.
Specifically, the present invention program can be further understood by following application scenarios:
Scene two: the lamp of old man's sending voice " turning on light " its i.e. openable local environment.Detailed process is as follows:
Step A1: user goes out voice command to company machine human hair: turning on light;
Step B1: the first sound receiver of robot is accompanied to receive the voice signal of user;
Step C1: it accompanies the microprocessor of robot to extract speech model and analyzes to obtain the first recognition result: turning on light, simultaneously Second sound reception device is waken up, the background sound signal from ambient enviroment is received;
Step D1: since user is between two environment (such as parlor and kitchen), the microprocessor of robot is accompanied The sound for obtaining parlor and the sending of kitchen source of sound obtains the second recognition result: parlor according to the different analyses of sound;
Step E1: it accompanies the micro process comprehensive analysis of robot to obtain final recognition result: turning on the lamp in parlor;
Step F1: accompanying the network equipment of robot, according to the preset position information of storage device, to the switch of parlor lamp Order is issued, it is enabled to execute order of turning on light.
The extraction and selection of acoustic feature are an important links of speech recognition.The extraction of acoustic feature is both a letter Cease the process and a signal uncoiling process significantly compressed, it is therefore an objective to mode division device be enable preferably to divide.
Due to the time-varying characteristics of voice signal, feature extraction must carry out on a bit of voice signal, namely carry out short When analyze.This section is considered as stable analystal section, normally referred to as frame, and the offset between frame and frame usually takes frame length 1/2 or 1/3.Preemphasis is usually carried out to signal to promote high frequency, to signal adding window to avoid the shadow at Short Time Speech section edge It rings.
Common some acoustic features:
(1) linear predictor coefficient (Linear Predictive Coefficient, LPC): linear prediction analysis is from people's Sound generating mechanism is started with, and the research of the short tube cascade model to sound channel is passed through, it is believed that system transter matches full pole number The form of filter, so that the signal at n moment can be estimated with the linear combination of the signal at preceding several moment.
(2) cepstrum coefficient: utilizing Homomorphic processing, take logarithm after seeking discrete Fourier transform DFT to voice signal, then It negates and converts iDFT cepstrum coefficient just can be obtained.
(3) mel-frequency cepstrum coefficient (Mel-Frequency Cepstral Coefficinets, MFCCs) and perception Linear prediction (Perceptual Linear Predictive, PLP): different from the grinding by the sound generating mechanism to people such as LPC Acoustic feature obtained from studying carefully, Mel cepstrum coefficient MFCC and perception linear prediction PLP are pushed away by the auditory system research achievement of people Dynamic and derived acoustic feature.
Chinese acoustic feature: by taking Mandarin Chinese speech as an example, the pronunciation of a word can be cut into two parts, be sound respectively Female (initials) and simple or compound vowel of a Chinese syllable (finals).And among the process of pronunciation, it is that one progressive rather than wink that initial consonant, which is converted to simple or compound vowel of a Chinese syllable, Between change, to this using the related sound master mold formula of right text (Right-Context-Dependent Initial Final, RCDIF it) is used as analysis method, can more accurately identify correct syllable (Syllable).
In view of old man or crackjaw crowd are difficult to issue accurate pronunciation, the present invention is special according to the difference of initial consonant Sign, is pronounced to be divided into following four classes and is modeled:
Plosive (Plosive): after lip closes when pronunciation, discharge air-flow produces the sound of similar explosion.Its sound Amplitude of vibration variation (represents lip to close) after being first down to minimum, then steeply rises.
Fricative (Fricative): when pronunciation, tongue is close to hard palate, forms narrow channel, rapids is caused when air-flow passes through Stream rubs, and thus issues the sound.Through output gas flow is stablized when due to fricative, so that the variation of sound amplitude of vibration is compared to quick-fried Distorsion amplitude of variation is smaller.
Quick-fried fricative (Affricate): the sonification model of this type has both the sound mark of plosive and fricative.It is main It is the sound for being close to generate friction by tongue when hard palate causes air flow through such as fricative that sounding, which is constructed,.And its channel more steps up It is close, so that air-flow can be gone out in moment, produce the feature as plosive.
Nasal sound (Nasal): when pronunciation, soft palate can be pushed, and after pushing, be blocked, be cannot be introduced by the air-flow that tracheae spues Oral cavity, thus transfer nasal cavity to.Also therefore nasal cavity and oral cavity can generate resonance.
Referring to shown in Fig. 2, in one embodiment of the invention, when user pronunciation, obtain plosive in user speech information, It is simultaneously compared by fricative and nasal sound characteristic parameter with corresponding preset model;In the plosive, fricative or nasal sound Amplitude of vibration be less than preset range when, enhancing processing is carried out to it.Accordingly even when user pronunciation is inaccurate, user can be also accurately identified Phonetic order.For example, after getting the plosive in user speech information, fricative and nasal sound characteristic parameter, by its with Corresponding preset model is compared, when the amplitude of vibration of plosive, fricative or nasal sound within a preset range when, continue to analyze, into The comparison and adjustment of the next characteristic parameter of row, until all parameters all compare and adjustment is completed.
In the embodiment of the present invention, in order to further enhance recognition accuracy, the identification can be also shown by picture and text form As a result it selects or confirms for user, and export the recognition result to external equipment after user selects or confirms, and/or, The recognition result is broadcasted to user by voice and receives the feedback information of user.For example, when the phonetic order of user's sending To open air-conditioning, speech chip fails the phonetic order that user is recognized accurately, at this point, can multiple results (be opened air-conditioning, be beaten Turn on the aircondition fan, open fan) be sent to user interactive module, user confirms by touch screen, executes again dozen after confirmation The order turned on the aircondition.
In a preferred embodiment of the present invention, the above method may also include that
Changed according to the voice of the voice messaging linear analysis user of collected multiple predetermined instants, result shape will be analyzed The speech model of Cheng Xin and storage.For example, the language competence of old man is gradually failing, multiple periods can be preset, according to one In period pronunciation different corresponding to the collected same phonetic order judge old man's voice change, update speech model from And it is adapted to.
The present invention also provides a kind of speech recognition equipments, shown in Figure 3 for realizing the above method, the speech recognition Device includes:
Voice acquisition module 10, for listening to the voice messaging of user's sending;In the embodiment of the present invention, voice acquisition module 10 can be the interception devices such as the microphone in the intelligent terminals such as mobile phone, tablet computer or robot, for acquiring user's sending Voice messaging.
First processing module 20, for carrying out denoising to the voice messaging and being identified according to the speech model prestored The phonetic order of user;First processing module 20 can be a speech processing chip, when collecting voice messaging, believe voice Breath carries out denoising, obtains the phonetic order that user issues.
Background sound listens to module 30, for acquiring the background sound of user surrounding environment;Background sound listens to module 30 The interception devices such as the microphone of different parts are set, for acquiring the background noise information issued in environment.Obtaining phonetic order Afterwards, the smart machines such as mobile phone, plate or robot can wake up background sound by chip and listen to module 30, detect and receive environment In background sound.
Second processing module 40, for being identified according to the background sound model prestored to the background sound, and according to knowledge Other result determines user present position;For example, being analyzed by speech chip background sound, sentenced according to the volume of sound Disconnected user is in outdoor or indoor, further, can also judge according to volume or type user in it is horizontal, Parlor or kitchen.
Output module 50 forms final recognition result and exports for combining the phonetic order and location information.When When phonetic order and clear location information, provides recognition result and export.In the embodiment of the present invention, when user pronunciation is not complete enough When understanding entirely or not enough, the true intention of user is judged by the recognition result to environment, to promote speech recognition standard True rate.
In a preferred embodiment, above-mentioned speech recognition equipment further include:
Model building module 60, for user voice messaging and background sound be trained and model, form voice Model and background sound model simultaneously store.In the embodiment of the present invention, model building module 60 acquires dysphonia or there are obstacle people The voice of group is trained and models, correctly to identify user pronunciation in application.In addition, model building module 60 acquires The background sound of indoor and outdoor is simultaneously modeled, to identify user's local environment, for example, multiple bedroom rings can be acquired in different periods The background sound in border is simultaneously trained it and models and store, and user can extract background sound model in practical application and compare It is right, so that it is determined that user's local environment.
Referring to Fig. 4, in one embodiment, first processing module 20 includes:
Unit 21 is denoised, for carrying out denoising to the voice messaging received, obtains the voice messaging of user;
Phonetic order acquiring unit 22 is used for the voice messaging of the user to be compared with speech model The phonetic order at family;
The Second processing module 40 includes:
Location information acquiring unit 41 carries out denoising to collected background sound, true according to the background sound after denoising Determine user present position, obtains location information.
Preferably, the phonetic order acquiring unit 22 is used for:
Obtain plosive, fricative and the nasal sound characteristic parameter in user speech information and by itself and corresponding preset model It is compared;When the amplitude of vibration of the plosive, fricative or nasal sound is less than preset range, enhancing processing is carried out to it.In this way Even if user pronunciation is inaccurate, the phonetic order of user can be also accurately identified.
In one embodiment of the invention, above-mentioned apparatus may also include that
Update module 70, the voice for the voice messaging linear analysis user according to collected multiple predetermined instants become Change, analysis result is formed into new speech model and storage.For example, the language competence of old man is gradually failing, can preset more A period, the pronunciation judgement different according to corresponding to the same phonetic order collected in a cycle of update module 70 are old The human speech change of tune updates speech model to be adapted to.
The present invention separately provides a kind of computer readable storage medium, and calculating is stored in the computer readable storage medium The computer program of machine executable instruction, the computer executable instructions realizes speech recognition above-mentioned when being executed by processor Method.The program that computer readable storage medium provided by the invention can store for realizing aforementioned voice recognition methods, and It is carried and loads on computer equipment, this kind of computer equipment can be the intelligence such as mobile phone, tablet computer or service robot It can terminal.
The foregoing is only a preferred embodiment of the present invention, is not intended to limit the scope of the present invention.It is all Made any modifications, equivalent replacements, and improvements etc. within the spirit and scope of the present invention is all contained in protection model of the invention Within enclosing.

Claims (10)

1. a kind of audio recognition method, which is characterized in that the described method includes:
Listen to the voice messaging of user's sending;
Denoising is carried out to the voice messaging and according to the phonetic order of the speech model identification user prestored;
Acquire the background sound of user surrounding environment;
The background sound is identified according to the background sound model prestored, and user present position is determined according to recognition result;
The phonetic order and location information are combined, final recognition result is formed and exports.
2. the method according to claim 1, wherein described, to voice messaging progress denoising, simultaneously basis is prestored Speech model identification user phonetic order include:
It obtains plosive, fricative and the nasal sound characteristic parameter in user speech information and carries out it with corresponding preset model Compare;
When the amplitude of vibration of the plosive, fricative or nasal sound is less than preset range, enhancing processing is carried out to it.
3. method according to claim 1 or 2, which is characterized in that further include:
Changed according to the voice of the voice messaging linear analysis user of collected multiple predetermined instants, analysis result is formed newly Speech model and storage.
4. according to the method described in claim 3, it is characterized in that, the background sound model that prestores of the basis is to the background sound It is identified, and determines that user present position includes: according to recognition result
Background sound in the sound and environment of collected preset source of sound sending is compared with background sound model respectively, according to Comparison result determines user present position.
5. according to the method described in claim 4, it is characterized by further comprising: showing the recognition result by picture and text form It selects or confirms for user, and export the recognition result to external equipment after user selects or confirms, and/or, pass through Voice broadcasts the recognition result to user and receives the feedback information of user.
6. a kind of speech recognition equipment characterized by comprising
Voice acquisition module, for listening to the voice messaging of user's sending;
First processing module, for carrying out denoising to the voice messaging and according to the speech model identification user's prestored Phonetic order;
Background sound listens to module, for acquiring the background sound of user surrounding environment;
Second processing module, for being identified according to the background sound model prestored to the background sound, and according to recognition result Determine user present position;
Output module forms final recognition result and exports for combining the phonetic order and location information.
7. speech recognition equipment according to claim 6, which is characterized in that the voice acquisition module is used for:
It obtains plosive, fricative and the nasal sound characteristic parameter in user speech information and carries out it with corresponding preset model Compare;
When the amplitude of vibration of the plosive, fricative or nasal sound is less than preset range, enhancing processing is carried out to it.
8. speech recognition equipment according to claim 6 or 7, which is characterized in that further include:
Update module, the voice for the voice messaging linear analysis user according to collected multiple predetermined instants change, will Analysis result forms new speech model and storage.
9. speech recognition equipment according to claim 6, which is characterized in that the first processing module includes:
Background sound in the sound and environment of collected preset source of sound sending is compared with background sound model respectively, according to Comparison result determines user present position.
10. a kind of computer readable storage medium, which is characterized in that be stored with computer in the computer readable storage medium Realize that claim 1 to 5 is any when the computer program of executable instruction, the computer executable instructions is executed by processor Audio recognition method described in.
CN201811644306.5A 2018-12-29 2018-12-29 Speech recognition method, device and computer readable storage medium Active CN109545196B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811644306.5A CN109545196B (en) 2018-12-29 2018-12-29 Speech recognition method, device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811644306.5A CN109545196B (en) 2018-12-29 2018-12-29 Speech recognition method, device and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN109545196A true CN109545196A (en) 2019-03-29
CN109545196B CN109545196B (en) 2022-11-29

Family

ID=65831549

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811644306.5A Active CN109545196B (en) 2018-12-29 2018-12-29 Speech recognition method, device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN109545196B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109974225A (en) * 2019-04-09 2019-07-05 珠海格力电器股份有限公司 Air conditioner control method and device, storage medium and air conditioner
CN110473547A (en) * 2019-07-12 2019-11-19 云知声智能科技股份有限公司 A kind of audio recognition method
CN110867184A (en) * 2019-10-23 2020-03-06 张家港市祥隆五金厂 Voice intelligent terminal equipment

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080300871A1 (en) * 2007-05-29 2008-12-04 At&T Corp. Method and apparatus for identifying acoustic background environments to enhance automatic speech recognition
CN102918591A (en) * 2010-04-14 2013-02-06 谷歌公司 Geotagged environmental audio for enhanced speech recognition accuracy
CN104143342A (en) * 2013-05-15 2014-11-12 腾讯科技(深圳)有限公司 Voiceless sound and voiced sound judging method and device and voice synthesizing system
CN105448292A (en) * 2014-08-19 2016-03-30 北京羽扇智信息科技有限公司 Scene-based real-time voice recognition system and method
CN105580071A (en) * 2013-05-06 2016-05-11 谷歌技术控股有限责任公司 Method and apparatus for training a voice recognition model database
CN105913039A (en) * 2016-04-26 2016-08-31 北京光年无限科技有限公司 Visual-and-vocal sense based dialogue data interactive processing method and apparatus
CN106941506A (en) * 2017-05-17 2017-07-11 北京京东尚科信息技术有限公司 Data processing method and device based on biological characteristic
CN107742517A (en) * 2017-10-10 2018-02-27 广东中星电子有限公司 A kind of detection method and device to abnormal sound
CN108877773A (en) * 2018-06-12 2018-11-23 广东小天才科技有限公司 Voice recognition method and electronic equipment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080300871A1 (en) * 2007-05-29 2008-12-04 At&T Corp. Method and apparatus for identifying acoustic background environments to enhance automatic speech recognition
CN102918591A (en) * 2010-04-14 2013-02-06 谷歌公司 Geotagged environmental audio for enhanced speech recognition accuracy
CN105580071A (en) * 2013-05-06 2016-05-11 谷歌技术控股有限责任公司 Method and apparatus for training a voice recognition model database
CN104143342A (en) * 2013-05-15 2014-11-12 腾讯科技(深圳)有限公司 Voiceless sound and voiced sound judging method and device and voice synthesizing system
CN105448292A (en) * 2014-08-19 2016-03-30 北京羽扇智信息科技有限公司 Scene-based real-time voice recognition system and method
CN105913039A (en) * 2016-04-26 2016-08-31 北京光年无限科技有限公司 Visual-and-vocal sense based dialogue data interactive processing method and apparatus
CN106941506A (en) * 2017-05-17 2017-07-11 北京京东尚科信息技术有限公司 Data processing method and device based on biological characteristic
CN107742517A (en) * 2017-10-10 2018-02-27 广东中星电子有限公司 A kind of detection method and device to abnormal sound
CN108877773A (en) * 2018-06-12 2018-11-23 广东小天才科技有限公司 Voice recognition method and electronic equipment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109974225A (en) * 2019-04-09 2019-07-05 珠海格力电器股份有限公司 Air conditioner control method and device, storage medium and air conditioner
CN110473547A (en) * 2019-07-12 2019-11-19 云知声智能科技股份有限公司 A kind of audio recognition method
CN110473547B (en) * 2019-07-12 2021-07-30 云知声智能科技股份有限公司 Speech recognition method
CN110867184A (en) * 2019-10-23 2020-03-06 张家港市祥隆五金厂 Voice intelligent terminal equipment

Also Published As

Publication number Publication date
CN109545196B (en) 2022-11-29

Similar Documents

Publication Publication Date Title
US11875820B1 (en) Context driven device arbitration
CN107799126B (en) Voice endpoint detection method and device based on supervised machine learning
US9728188B1 (en) Methods and devices for ignoring similar audio being received by a system
Eyben et al. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing
WO2017084360A1 (en) Method and system for speech recognition
WO2020043123A1 (en) Named-entity recognition method, named-entity recognition apparatus and device, and medium
WO2019148586A1 (en) Method and device for speaker recognition during multi-person speech
CN108847215B (en) Method and device for voice synthesis based on user timbre
CN104575504A (en) Method for personalized television voice wake-up by voiceprint and voice identification
KR101616112B1 (en) Speaker separation system and method using voice feature vectors
CN109545196A (en) Audio recognition method, device and computer readable storage medium
US20190279644A1 (en) Speech processing device, speech processing method, and recording medium
WO2016173132A1 (en) Method and device for voice recognition, and user equipment
CN109524011A (en) A kind of refrigerator awakening method and device based on Application on Voiceprint Recognition
US20220238118A1 (en) Apparatus for processing an audio signal for the generation of a multimedia file with speech transcription
US20120078625A1 (en) Waveform analysis of speech
CN109215634A (en) A kind of method and its system of more word voice control on-off systems
CN112133277A (en) Sample generation method and device
CN110827853A (en) Voice feature information extraction method, terminal and readable storage medium
WO2020062679A1 (en) End-to-end speaker diarization method and system employing deep learning
WO2020052135A1 (en) Music recommendation method and apparatus, computing apparatus, and storage medium
Ronzhin et al. Speaker turn detection based on multimodal situation analysis
Sahoo et al. MFCC feature with optimized frequency range: An essential step for emotion recognition
JP2013182150A (en) Speech production section detector and computer program for speech production section detection
Nguyen et al. Vietnamese voice recognition for home automation using MFCC and DTW techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant