CN109545196A

CN109545196A - Audio recognition method, device and computer readable storage medium

Info

Publication number: CN109545196A
Application number: CN201811644306.5A
Authority: CN
Inventors: 袁晖
Original assignee: Shenzhen Comexe Ikang Science And Technology Co Ltd
Current assignee: Shenzhen Comexe Ikang Science And Technology Co Ltd
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2019-03-29
Anticipated expiration: 2038-12-29
Also published as: CN109545196B

Abstract

The invention discloses a kind of audio recognition methods, which comprises listens to the voice messaging of user's sending；Denoising is carried out to the voice messaging and according to the phonetic order of the speech model identification user prestored；Acquire the background sound of user surrounding environment；The background sound is identified according to the background sound model prestored, and user present position is determined according to recognition result；The phonetic order and location information are combined, final recognition result is formed and exports.The present invention further simultaneously discloses a kind of speech recognition equipment and computer readable storage medium.The present invention can promote the speech recognition accuracy of intelligent terminal.

Description

Audio recognition method, device and computer readable storage medium

Technical field

The present invention relates to field of speech recognition more particularly to a kind of audio recognition methods, device and computer-readable storage Medium.

Background technique

With the development of science and technology with the progress of computer technology, it is each that speech recognition technology already applies to life and industry etc. Big field, it is the economy of human society that there are a variety of audio recognition methods or device for being used to realize human-computer interaction for the prior art Development is made that tremendous contribution.However, existing speech recognition technology is generally only capable of the pronunciation of identification normal person, when the hair of user Sound inaccuracy is perhaps difficult to identify or identify inaccuracy there are speech recognition technology existing when aphasis.It is with old man Example: with advancing age, the disease in terms of some language is in high-incidence state, such as aphasia etc. in old man crowd.It loses Language disease patient may speak, read, or write, and have the obstacle of ability of language expression, but intelligence can't be by aphasia The influence of disease.Existing speech recognition technology is just difficult to carry out trouble aphemic crowd speech recognition, or identification accuracy It will be greatly reduced, therefore the relevant technologies are difficult to apply, for example, when speech recognition technology is applied to company robot, due to It is difficult to identify this kind of voice, robot is accompanied to be difficult to really play its effect.

In view of this, to promote the accuracy rate of speech recognition, extended voice is known it is necessary to provide a kind of speech recognition technology The application range of other technology.

Summary of the invention

The main purpose of the present invention is to provide a kind of audio recognition methods, it is intended to promote the accuracy rate of speech recognition, expand Open up the application range of speech recognition technology.

To achieve the goals above, the present invention provides a kind of audio recognition method, which comprises

Listen to the voice messaging of user's sending；

Denoising is carried out to the voice messaging and according to the phonetic order of the speech model identification user prestored；

Acquire the background sound of user surrounding environment；

The background sound is identified according to the background sound model prestored, and according to recognition result determine user locating for position It sets；

The phonetic order and location information are combined, final recognition result is formed and exports.

Preferably, described denoising to be carried out to the voice messaging and according to the language of the speech model identification user prestored Sound instructs

Obtain plosive, fricative and the nasal sound characteristic parameter in user speech information and by itself and corresponding preset model It is compared；

When the amplitude of vibration of the plosive, fricative or nasal sound is less than preset range, enhancing processing is carried out to it.

Preferably, the above method further include:

Changed according to the voice of the voice messaging linear analysis user of collected multiple predetermined instants, result shape will be analyzed The speech model of Cheng Xin and storage.

Preferably, the background sound model that the basis prestores identifies the background sound, and true according to recognition result Determining user present position includes:

Background sound in the sound and environment of collected preset source of sound sending is compared with background sound model respectively, User present position is determined according to comparison result.

Preferably, the above method, which may also include that, shows that the recognition result selects or confirms for user by picture and text form, And export the recognition result to external equipment after user selects or confirms, and/or, by voice to described in user's broadcast Recognition result and the feedback information for receiving user.

The present invention also provides a kind of speech recognition equipments comprising:

Voice acquisition module, for listening to the voice messaging of user's sending；

First processing module, for carrying out denoising to the voice messaging and being used according to the speech model identification prestored The phonetic order at family；

Background sound listens to module, for acquiring the background sound of user surrounding environment；

Second processing module, for being identified according to the background sound model prestored to the background sound, and according to identification As a result user present position is determined；

Output module forms final recognition result and exports for combining the phonetic order and location information.

Preferably, the voice acquisition module is used for:

Preferably, above-mentioned apparatus further include:

Update module, the voice for the voice messaging linear analysis user according to collected multiple predetermined instants become Change, analysis result is formed into new speech model and storage.

The present invention separately provides a kind of computer readable storage medium, and calculating is stored in the computer readable storage medium The computer program of machine executable instruction, the computer executable instructions realizes speech recognition above-mentioned when being executed by processor Method.

The present invention is extracted the phonetic order of user and is combined in environment by the phonetic order and background sound of extraction user The identification of background sound judges user's by the recognition result of environment when user pronunciation is not complete enough or not enough understands True intention, to promote speech recognition accuracy.

Detailed description of the invention

Fig. 1 is the flow diagram of audio recognition method of the embodiment of the present invention；

Fig. 2 is to be compared the voice messaging of user with speech model in audio recognition method of the embodiment of the present invention, is obtained To the step flow diagram of the phonetic order of user；

Fig. 3 is speech recognition equipment of embodiment of the present invention structural schematic diagram；

Fig. 4 is the structural schematic diagram of speech recognition equipment of embodiment of the present invention first processing module and Second processing module.

Specific embodiment

The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.It should Understand, the specific embodiments described herein are merely illustrative of the present invention, is not intended to limit the present invention.

Referring to Fig.1, the present invention provides a kind of audio recognition method, this method comprises:

Step S10 listens to the voice messaging of user's sending；It, can be in mobile phone, plate or machine in the embodiment of the present invention Voice interception device, the voice messaging that acquisition user issues is arranged in the smart machines such as people.

Step S20 carries out denoising to the voice messaging and according to the voice of the speech model identification user prestored Instruction；When collecting voice messaging, denoising is carried out by speech chip, obtains the phonetic order that user issues.

Step S30 acquires the background sound of user surrounding environment；After obtaining phonetic order, mobile phone, plate or machine are waken up Second voice interception device in the smart machines such as device people, detects and receives the background sound in environment.

Step S40 identifies the background sound according to the background sound model prestored, and is determined and used according to recognition result Family present position；For example, analyzing by speech chip background sound, judge that user is in family according to the volume of sound It is outer or indoor, further, it can also judge that user is in horizontal, parlor or kitchen according to volume or type Room.

The phonetic order and location information are combined, form final recognition result and export by step S50.When voice refers to When enabling clear with location information, provides recognition result and export.In the embodiment of the present invention, when user pronunciation not enough completely or When not enough understanding, the true intention of user is judged by the recognition result of environment, to promote speech recognition accuracy.

It is an application scenarios of the invention below, by the application scenarios, can further appreciates that voice of the present invention is known Other detailed protocol:

Scene one: old man need to only say " air-conditioning " or " opening air-conditioning ", turn on the air-conditioning in bedroom by robot. Detailed process is as follows:

Step A: user goes out voice command to company machine human hair；

Step B: the first sound receiver of robot is accompanied to receive the voice signal of user；

Step C: it accompanies the microprocessor analysis of robot to obtain the first recognition result: opening air-conditioning, while waking up second Sound receiver receives the background sound signal from ambient enviroment；

Step D: the microprocessor analysis of robot is accompanied to obtain the second recognition result: bedroom；

Step E: it accompanies the micro process comprehensive analysis of robot to obtain final recognition result: opening the air-conditioning in bedroom；

Step F: accompanying the network equipment of robot, according to the preset position information of storage device, sends out to the air-conditioning in bedroom Operation order out enables it start starting, operation.

In the embodiment of the present invention, it may also include that the voice messaging and background sound to user before executing all steps It is trained and models, form speech model and background sound model and store.In the embodiment of the present invention, acquires dysphonia or deposit It is trained and models in the voice of obstacle crowd, correctly to identify user pronunciation in application.In addition, collection room interior room Outer background sound is simultaneously modeled, to identify user's local environment, for example, the back of multiple bedroom environment can be acquired in different periods Scape sound is simultaneously trained it and models and store, and user can extract background sound model in practical application and be compared, from And determine user's local environment.

It is understood that aforementioned the step of carrying out phonetic order of the denoising to identify user to voice messaging packet It includes:

Denoising is carried out to the voice messaging received, obtains the voice messaging of user；

The voice messaging of the user is compared with speech model, obtains the phonetic order of user；

The background sound model that aforementioned basis prestores identifies the background sound, and determines user institute according to recognition result Locate position the step of include:

Denoising is carried out to collected background sound, user present position is determined according to the background sound after denoising, is obtained Location information.

In view of the background sound model of some environment may be very much like or identical, can also be arranged in advance in different environment The source of sound of environment for identification is acquired in real time by voice acquisition module, and speech chip sends out collected preset source of sound The background sound in sound and environment out is compared with background sound model respectively, according to comparison result determine user locating for position It sets.For example, can indicate that local environment is parlor or kitchen with windbell, when user is when this environment issues phonetic order, language Sound chip can identify present position according to the background sound that the environment source of sound issues.

Specifically, the present invention program can be further understood by following application scenarios:

Scene two: the lamp of old man's sending voice " turning on light " its i.e. openable local environment.Detailed process is as follows:

Step A1: user goes out voice command to company machine human hair: turning on light；

Step B1: the first sound receiver of robot is accompanied to receive the voice signal of user；

Step C1: it accompanies the microprocessor of robot to extract speech model and analyzes to obtain the first recognition result: turning on light, simultaneously Second sound reception device is waken up, the background sound signal from ambient enviroment is received；

Step D1: since user is between two environment (such as parlor and kitchen), the microprocessor of robot is accompanied The sound for obtaining parlor and the sending of kitchen source of sound obtains the second recognition result: parlor according to the different analyses of sound；

Step E1: it accompanies the micro process comprehensive analysis of robot to obtain final recognition result: turning on the lamp in parlor；

Step F1: accompanying the network equipment of robot, according to the preset position information of storage device, to the switch of parlor lamp Order is issued, it is enabled to execute order of turning on light.

The extraction and selection of acoustic feature are an important links of speech recognition.The extraction of acoustic feature is both a letter Cease the process and a signal uncoiling process significantly compressed, it is therefore an objective to mode division device be enable preferably to divide.

Due to the time-varying characteristics of voice signal, feature extraction must carry out on a bit of voice signal, namely carry out short When analyze.This section is considered as stable analystal section, normally referred to as frame, and the offset between frame and frame usually takes frame length 1/2 or 1/3.Preemphasis is usually carried out to signal to promote high frequency, to signal adding window to avoid the shadow at Short Time Speech section edge It rings.

Common some acoustic features:

(1) linear predictor coefficient (Linear Predictive Coefficient, LPC): linear prediction analysis is from people's Sound generating mechanism is started with, and the research of the short tube cascade model to sound channel is passed through, it is believed that system transter matches full pole number The form of filter, so that the signal at n moment can be estimated with the linear combination of the signal at preceding several moment.

(2) cepstrum coefficient: utilizing Homomorphic processing, take logarithm after seeking discrete Fourier transform DFT to voice signal, then It negates and converts iDFT cepstrum coefficient just can be obtained.

(3) mel-frequency cepstrum coefficient (Mel-Frequency Cepstral Coefficinets, MFCCs) and perception Linear prediction (Perceptual Linear Predictive, PLP): different from the grinding by the sound generating mechanism to people such as LPC Acoustic feature obtained from studying carefully, Mel cepstrum coefficient MFCC and perception linear prediction PLP are pushed away by the auditory system research achievement of people Dynamic and derived acoustic feature.

Chinese acoustic feature: by taking Mandarin Chinese speech as an example, the pronunciation of a word can be cut into two parts, be sound respectively Female (initials) and simple or compound vowel of a Chinese syllable (finals).And among the process of pronunciation, it is that one progressive rather than wink that initial consonant, which is converted to simple or compound vowel of a Chinese syllable, Between change, to this using the related sound master mold formula of right text (Right-Context-Dependent Initial Final, RCDIF it) is used as analysis method, can more accurately identify correct syllable (Syllable).

In view of old man or crackjaw crowd are difficult to issue accurate pronunciation, the present invention is special according to the difference of initial consonant Sign, is pronounced to be divided into following four classes and is modeled:

Plosive (Plosive): after lip closes when pronunciation, discharge air-flow produces the sound of similar explosion.Its sound Amplitude of vibration variation (represents lip to close) after being first down to minimum, then steeply rises.

Fricative (Fricative): when pronunciation, tongue is close to hard palate, forms narrow channel, rapids is caused when air-flow passes through Stream rubs, and thus issues the sound.Through output gas flow is stablized when due to fricative, so that the variation of sound amplitude of vibration is compared to quick-fried Distorsion amplitude of variation is smaller.

Quick-fried fricative (Affricate): the sonification model of this type has both the sound mark of plosive and fricative.It is main It is the sound for being close to generate friction by tongue when hard palate causes air flow through such as fricative that sounding, which is constructed,.And its channel more steps up It is close, so that air-flow can be gone out in moment, produce the feature as plosive.

Nasal sound (Nasal): when pronunciation, soft palate can be pushed, and after pushing, be blocked, be cannot be introduced by the air-flow that tracheae spues Oral cavity, thus transfer nasal cavity to.Also therefore nasal cavity and oral cavity can generate resonance.

Referring to shown in Fig. 2, in one embodiment of the invention, when user pronunciation, obtain plosive in user speech information, It is simultaneously compared by fricative and nasal sound characteristic parameter with corresponding preset model；In the plosive, fricative or nasal sound Amplitude of vibration be less than preset range when, enhancing processing is carried out to it.Accordingly even when user pronunciation is inaccurate, user can be also accurately identified Phonetic order.For example, after getting the plosive in user speech information, fricative and nasal sound characteristic parameter, by its with Corresponding preset model is compared, when the amplitude of vibration of plosive, fricative or nasal sound within a preset range when, continue to analyze, into The comparison and adjustment of the next characteristic parameter of row, until all parameters all compare and adjustment is completed.

In the embodiment of the present invention, in order to further enhance recognition accuracy, the identification can be also shown by picture and text form As a result it selects or confirms for user, and export the recognition result to external equipment after user selects or confirms, and/or, The recognition result is broadcasted to user by voice and receives the feedback information of user.For example, when the phonetic order of user's sending To open air-conditioning, speech chip fails the phonetic order that user is recognized accurately, at this point, can multiple results (be opened air-conditioning, be beaten Turn on the aircondition fan, open fan) be sent to user interactive module, user confirms by touch screen, executes again dozen after confirmation The order turned on the aircondition.

In a preferred embodiment of the present invention, the above method may also include that

Changed according to the voice of the voice messaging linear analysis user of collected multiple predetermined instants, result shape will be analyzed The speech model of Cheng Xin and storage.For example, the language competence of old man is gradually failing, multiple periods can be preset, according to one In period pronunciation different corresponding to the collected same phonetic order judge old man's voice change, update speech model from And it is adapted to.

The present invention also provides a kind of speech recognition equipments, shown in Figure 3 for realizing the above method, the speech recognition Device includes:

Voice acquisition module 10, for listening to the voice messaging of user's sending；In the embodiment of the present invention, voice acquisition module 10 can be the interception devices such as the microphone in the intelligent terminals such as mobile phone, tablet computer or robot, for acquiring user's sending Voice messaging.

First processing module 20, for carrying out denoising to the voice messaging and being identified according to the speech model prestored The phonetic order of user；First processing module 20 can be a speech processing chip, when collecting voice messaging, believe voice Breath carries out denoising, obtains the phonetic order that user issues.

Background sound listens to module 30, for acquiring the background sound of user surrounding environment；Background sound listens to module 30 The interception devices such as the microphone of different parts are set, for acquiring the background noise information issued in environment.Obtaining phonetic order Afterwards, the smart machines such as mobile phone, plate or robot can wake up background sound by chip and listen to module 30, detect and receive environment In background sound.

Second processing module 40, for being identified according to the background sound model prestored to the background sound, and according to knowledge Other result determines user present position；For example, being analyzed by speech chip background sound, sentenced according to the volume of sound Disconnected user is in outdoor or indoor, further, can also judge according to volume or type user in it is horizontal, Parlor or kitchen.

Output module 50 forms final recognition result and exports for combining the phonetic order and location information.When When phonetic order and clear location information, provides recognition result and export.In the embodiment of the present invention, when user pronunciation is not complete enough When understanding entirely or not enough, the true intention of user is judged by the recognition result to environment, to promote speech recognition standard True rate.

In a preferred embodiment, above-mentioned speech recognition equipment further include:

Model building module 60, for user voice messaging and background sound be trained and model, form voice Model and background sound model simultaneously store.In the embodiment of the present invention, model building module 60 acquires dysphonia or there are obstacle people The voice of group is trained and models, correctly to identify user pronunciation in application.In addition, model building module 60 acquires The background sound of indoor and outdoor is simultaneously modeled, to identify user's local environment, for example, multiple bedroom rings can be acquired in different periods The background sound in border is simultaneously trained it and models and store, and user can extract background sound model in practical application and compare It is right, so that it is determined that user's local environment.

Referring to Fig. 4, in one embodiment, first processing module 20 includes:

Unit 21 is denoised, for carrying out denoising to the voice messaging received, obtains the voice messaging of user；

Phonetic order acquiring unit 22 is used for the voice messaging of the user to be compared with speech model The phonetic order at family；

The Second processing module 40 includes:

Location information acquiring unit 41 carries out denoising to collected background sound, true according to the background sound after denoising Determine user present position, obtains location information.

Preferably, the phonetic order acquiring unit 22 is used for:

Obtain plosive, fricative and the nasal sound characteristic parameter in user speech information and by itself and corresponding preset model It is compared；When the amplitude of vibration of the plosive, fricative or nasal sound is less than preset range, enhancing processing is carried out to it.In this way Even if user pronunciation is inaccurate, the phonetic order of user can be also accurately identified.

In one embodiment of the invention, above-mentioned apparatus may also include that

Update module 70, the voice for the voice messaging linear analysis user according to collected multiple predetermined instants become Change, analysis result is formed into new speech model and storage.For example, the language competence of old man is gradually failing, can preset more A period, the pronunciation judgement different according to corresponding to the same phonetic order collected in a cycle of update module 70 are old The human speech change of tune updates speech model to be adapted to.

The present invention separately provides a kind of computer readable storage medium, and calculating is stored in the computer readable storage medium The computer program of machine executable instruction, the computer executable instructions realizes speech recognition above-mentioned when being executed by processor Method.The program that computer readable storage medium provided by the invention can store for realizing aforementioned voice recognition methods, and It is carried and loads on computer equipment, this kind of computer equipment can be the intelligence such as mobile phone, tablet computer or service robot It can terminal.

The foregoing is only a preferred embodiment of the present invention, is not intended to limit the scope of the present invention.It is all Made any modifications, equivalent replacements, and improvements etc. within the spirit and scope of the present invention is all contained in protection model of the invention Within enclosing.

Claims

1. a kind of audio recognition method, which is characterized in that the described method includes:

Listen to the voice messaging of user's sending；

Acquire the background sound of user surrounding environment；

The background sound is identified according to the background sound model prestored, and user present position is determined according to recognition result；

2. the method according to claim 1, wherein described, to voice messaging progress denoising, simultaneously basis is prestored Speech model identification user phonetic order include:

It obtains plosive, fricative and the nasal sound characteristic parameter in user speech information and carries out it with corresponding preset model Compare；

3. method according to claim 1 or 2, which is characterized in that further include:

Changed according to the voice of the voice messaging linear analysis user of collected multiple predetermined instants, analysis result is formed newly Speech model and storage.

4. according to the method described in claim 3, it is characterized in that, the background sound model that prestores of the basis is to the background sound It is identified, and determines that user present position includes: according to recognition result

Background sound in the sound and environment of collected preset source of sound sending is compared with background sound model respectively, according to Comparison result determines user present position.

5. according to the method described in claim 4, it is characterized by further comprising: showing the recognition result by picture and text form It selects or confirms for user, and export the recognition result to external equipment after user selects or confirms, and/or, pass through Voice broadcasts the recognition result to user and receives the feedback information of user.

6. a kind of speech recognition equipment characterized by comprising

First processing module, for carrying out denoising to the voice messaging and according to the speech model identification user's prestored Phonetic order；

Second processing module, for being identified according to the background sound model prestored to the background sound, and according to recognition result Determine user present position；

7. speech recognition equipment according to claim 6, which is characterized in that the voice acquisition module is used for:

8. speech recognition equipment according to claim 6 or 7, which is characterized in that further include:

Update module, the voice for the voice messaging linear analysis user according to collected multiple predetermined instants change, will Analysis result forms new speech model and storage.

9. speech recognition equipment according to claim 6, which is characterized in that the first processing module includes:

10. a kind of computer readable storage medium, which is characterized in that be stored with computer in the computer readable storage medium Realize that claim 1 to 5 is any when the computer program of executable instruction, the computer executable instructions is executed by processor Audio recognition method described in.