CN109473111A

CN109473111A - A kind of voice enabling apparatus and method

Info

Publication number: CN109473111A
Application number: CN201811644724.4A
Authority: CN
Inventors: 雷雄国; 涂长宇; 郑炜乔; 郭彭亮; 刘强; 何家锋; 徐瑞婷; 卢玉环
Original assignee: AI Speech Ltd
Current assignee: AI Speech Ltd
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2019-03-15
Anticipated expiration: 2038-12-29
Also published as: CN109473111B

Abstract

The present invention discloses a kind of voice enabling apparatus, including, sound source acquisition module is exported for acquiring audio data to speech processing module；Speech processing module generates the first audio data and second audio data for handling audio data；Data transmission module exports the first audio data and second audio data to the external equipment being attached thereto for realizing the data interaction with external equipment.The invention also discloses a kind of application devices to carry out the method that voice is energized, the apparatus according to the invention and method may be implemented will not speech identifying function host equipment assign voice interactive function, and the noise treatment problem in the prior art to speech recognition is overcome, speech recognition result is optimized.And it reduces power consumption, be not take up resource.

Description

A kind of voice enabling apparatus and method

Technical field

The present invention relates to technical field of voice interaction, especially a kind of voice enabling apparatus and method.

Background technique

With the development of science and technology, smart machine is more more and more universal, but at present on the market, most of smart machine does not have Interactive voice ability, and commonly the equipment with voice interactive function is mostly that near field pickup interaction or simple single-wheel dialogue are set Meter, it is not high for the accuracy of processing and the speech recognition of noise in interactive voice, while host equipment can not be played Source of sound eliminated, to cannot achieve far field Speech processing.

Another aspect, the interactive voice of most equipment are all run on the host device, have certain influence to power consumption, usually It can be unable to reach low-power consumption requirement, while most of front end signal processing is also placed in host equipment and carries out operation, to system resource There is larger occupancy, influences running efficiency of system.

Summary of the invention

In view of the above-mentioned problems, the present invention is directed to propose a kind of technical side for the far field interactive voice that can be realized host equipment Case especially convenient can be realized and be handed over the far field voice of host equipment on the basis of not changing host equipment structure The solution of mutual Function Extension.

According to the first aspect of the invention, a kind of voice enabling apparatus is provided, including

Sound source acquisition module is exported for acquiring audio data to following speech processing modules；

Speech processing module generates the first audio data for handling the audio data；

Data transmission module exports the first audio data to connecting therewith for realizing the data interaction with external equipment The external equipment connect.

According to the second aspect of the invention, a kind of method for realizing that voice is energized by voice enabling apparatus is provided, Include the following steps:

Voice enabling apparatus is connected to main equipment by data transmission module；

Voice enabling apparatus acquires audio data, and handles the audio data, generate the first audio data and Second audio data；

Voice enabling apparatus exports the first audio data and second audio data to main equipment.

The device and method provided according to the present invention, may be implemented will not speech identifying function host equipment assign language Sound interactive function, and can be by data transmission module directly and host devices communication, with the acquisition of degree of realization audio-frequency information And processing, enable the host equipment being attached thereto easily to possess far field interactive voice ability, pole easily extends master The phonetic function of machine equipment.In addition, device and method provided in an embodiment of the present invention can carry out front end signal to audio data Processing, the problems such as front end signal processing bring reduces power consumption, occupancy resource will be carried out by overcoming host equipment in the prior art.

Detailed description of the invention

Fig. 1 is the voice enabling apparatus functional block diagram of an embodiment of the present invention；

Fig. 2 is the voice enabling apparatus functional block diagram of a further embodiment of this invention；

Fig. 3 is the method flow diagram energized by voice enabling apparatus realization voice of an embodiment of the present invention.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.

It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.

The present invention can describe in the general context of computer-executable instructions executed by a computer, such as program Module.Generally, program module includes routines performing specific tasks or implementing specific abstract data types, programs, objects, member Part, data structure etc..The present invention can also be practiced in a distributed computing environment, in these distributed computing environments, by Task is executed by the connected remote processing devices of communication network.In a distributed computing environment, program module can be with In the local and remote computer storage media including storage equipment.

In the present invention, the fingers such as " module ", " device ", " system " are applied to the related entities of computer, such as hardware, hardware Combination, software or software in execution with software etc..In detail, for example, element can with but be not limited to run on processing Process, processor, object, executable element, execution thread, program and/or the computer of device.In addition, running on server Application program or shell script, server can be element.One or more elements can be in the process and/or thread of execution In, and element can be localized and/or be distributed between two or multiple stage computers on one computer, and can be by each Kind computer-readable medium operation.Element can also according to the signal with one or more data packets, for example, from one with Another element interacts in local system, distributed system, and/or the network in internet passes through signal and other system interactions The signals of data communicated by locally and/or remotely process.

Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, the terms "include", "comprise", not only include those elements, and And further include other elements that are not explicitly listed, or further include for this process, method, article or equipment institute it is intrinsic Element.In the absence of more restrictions, the element limited by sentence " including ... ", it is not excluded that including described want There is also other identical elements in the process, method, article or equipment of element.

The invention will now be described in further detail with reference to the accompanying drawings.

Fig. 1 schematically shows a kind of voice enabling apparatus functional block diagram of embodiment according to the present invention.Such as Fig. 1 It is shown,

Voice enabling apparatus includes: sound source acquisition module 1, speech processing module 2 and data transmission module 3.

Wherein, sound source acquisition module 1 is exported for acquiring audio data to speech processing module 2.Exemplary, the module is real It is now multiple microphones, is especially moveable shotgun microphone, the positioning to sound source may be implemented, user can be directly opposite The module issues the instruction, such as " I will record " etc. of interactive voice, to realize far field pickup.And set removable for microphone Dynamic, then it can be enhanced by adjusting the direction of microphone and realizing for Sounnd source direction, other angles noise is weakened, It thereby may be ensured that the quality of audio.

Speech processing module 2 is for handling audio data, the first audio data of generation and second audio data, the One audio data is the phonetic order that user issues, and second audio data is to wake up control signal, i.e., relevant to result is waken up Content-data can carry out voice wake-up in the present apparatus according to the phonetic order that user issues and obtain wake-up control signal.? Other realize in example that it does not include that voice wakes up identifying processing that voice enabling apparatus, which may be set to be, only generate the first audio number According to i.e. only progress front end signal processing.

Data transmission module 3 is for realizing the data interaction with external equipment, by the first audio data and the second audio number According to output to the external equipment being attached thereto, can thus make without the host equipment of voice interactive function according to the first sound Frequency evidence and second audio data realize voice interactive function.Data transmission module 3 supports usb protocol, Bluetooth protocol and WiFi At least one of agreement illustratively can be implemented as USB interface.Front end signal processing is only carried out in voice enabling apparatus In the case of, data transmission module 3 exports the first audio data to the external equipment being attached thereto.

Wherein, sound source acquisition module 1 includes the first sound source acquisition component 101 and the second sound source acquisition component 102.First sound Source acquisition component 101 is for acquiring sound source audio data；Second sound source acquisition component 102 is for acquiring reference audio data.Show Example property, the first sound source acquisition component 101 and the second sound source acquisition component 102 are embodied as two moveable microphones, right The voice of typing carries out the audio collection of 16k/16bit.Acquire sound source audio data when, can by user directly against two can Mobile microphone is spoken, by 101 typing sound source audio of the first sound source acquisition component.Reference audio data then predominantly for Connection host equipment background sound, can be directly by moveable microphone close to sound mouth (such as loudspeaker of host equipment ), or multidigit angle is rotated against needing to shield the direction of source of sound, to collect the source of sound or shield side of host equipment broadcasting To source of sound as reference audio data.Two audio datas that will acquire are transmitted to speech processing module 2.

Speech processing module 2 includes that noise eliminates unit 201 and beam forming unit 203.

Noise eliminates unit 201 for going according to sound source audio data and reference audio data to sound source audio data It makes an uproar processing, so as to optimizing speech recognition as a result, obtain more accurate speech recognition effect, overcomes in the prior art The interference of background sound.

Beam forming unit 203 is used to carry out Wave beam forming to the sound source audio data after denoising, after realizing to denoising Sound source audio data filtering processing, to obtain to export to the first pure audio data of external equipment.

Wherein, noise eliminates the noise reduction technology that unit 201 mainly applies DSP (Digital Signal Processing), including modulus turns Change component 2011, echo cancellor component 2012 and digital-to-analogue transition components 2013.Analog-to-digital conversion component 2011 is used for sound source audio Data and reference audio data carry out analog-to-digital conversion, which is internally provided with the circuit that can carry out analog-to-digital conversion, referring to existing There is the analog-to-digital conversion mode of technology to generate digital signal.Echo cancellor component 2012 is used for the number generated according to analog-to-digital conversion component Word signal carries out subtraction, the sound source digital signal after obtaining denoising subtracts the corresponding digital signal of sound source audio data Digital signal after going the corresponding digital signal of reference audio data to be denoised, as sound source digital signal.Digital-to-analogue conversion group Part 2013 is used to carry out digital-to-analogue conversion to the sound source digital signal after denoising, the sound source audio data after generating denoising.According to this Working in coordination for several components can obtain the audio data removed with reference to sound data.

Filtering, which forms unit 203 and is referred to the prior art, to be realized, therefore to its implementation without repeating.

It may be implemented to assign some host equipment interactive voice abilities without voice interactive function according to the present embodiment, And the front end signals processing such as denoise, filter for the phonetic order of the user of acquisition, content, it is more excellent so as to obtain The speech recognition result of change.Meanwhile the device of the embodiment of the present invention can external equipment simply and can be achieved with far field to pick up Sound, the design for integrating multiple microphones facilitate the positioning for carrying out sound source, and to be enhanced for Sounnd source direction, and other angles are made an uproar Sound is weakened, to guarantee the quality of audio.And it, can be specifically for the patch of property for the background sound issued on host equipment Nearly host equipment sound mouth is used as so as to collect source of sound or the shield direction source of sound of host equipment broadcasting with reference to sound, and Echo cancellor is carried out, such source of sound is interfered and carries out anti-noise processing, to realize the function of Statistical error audio.

In addition, the functions such as the front end signal processing and wake-up of voice are integrated into hardware chip, to no longer occupy master The system resource of machine equipment, while in power consumption, on special speech chip, there can be larger optimization to phonetic algorithm, To realize low-power consumption requirement.

Fig. 2 is the voice enabling apparatus functional block diagram of a further embodiment of this invention.As shown in Fig. 2,

The speech processing module 2 of the voice enabling apparatus further includes waking up authentication unit 202 and second audio data generation Unit 205.

It wakes up authentication unit 202 to be used to carry out the sound source audio data after denoising wake-up identification, generates and wake up control letter Number and wake up angle, which knows is parsed by the voice content to the sound source audio after denoising or right otherwise The semantic interpretation answered, is identified according to semanteme, show that the wake-up word to be expressed of user, implementation are referred to existing skill Art, wherein wake up parameter of the angle for inventor according to semantic parsing addition, the mode for obtaining wake-up angle may is that in sound At sound acquisition, the microphone being made of multiple microphones acquires array, by data that multiple microphones acquire while being given to voice Wake up authentication unit 202, the unit can using wake up phonetic algorithm according to different microphones receive audio case propagation delays and Ability is distributed to confirm point source of sound, since each frame audio can all have sound positioning, so passing through the confirmation sound when waking up verifying Source point, so that it may obtain sound positioning result, be exported as angle is waken up.The time delay feelings of audio are determined using phonetic algorithm Condition and ability distribution can be achieved by the prior art.

Preferably, in the embodiment of the present invention, speech processing module further includes the first audio data generation unit 204.With this Meanwhile beam forming unit 203 is used to carry out Wave beam forming to the sound source audio data after denoising, generates three road audio streams i.e. three The audio output of road 16k.First audio data generation unit 204 is used for the three road audio streams generated to beam forming unit 203 It carries out processing and generates the output of the first audio data, specifically take any road audio to export as the first audio data, then rely on sound source The pointed wake-up angle of positioning, wake-up angle pointed by auditory localization result are when waking up processing and to wake up result It exports together.

For second audio data, it comprises the control signals of wake-up, and it is raw directly to transmit it to second audio data At unit 205, it is used to be handled (number turns audio) to the wake-up control signal for waking up the generation of authentication unit 202 equally raw At the audio of 48k, i.e. second audio data exports.

First audio data and the two audio datas of second audio data are transmitted by the driving of data transmission module 3 To the application layer of host equipment, application layer carries out the first audio data to split into three parts by the audio data of acquisition two-way Audio A, B, C are stored to round-robin queue, are recalled based on OneShot.Duration is carried out to the wake-up signal in second audio data Monitoring.When listening to wake-up signal, obtain the wake-up signal is to which road audio of A, B, C according to beam forming unit 203 As identification object, so that corresponding identification object be matched with wake-up signal, interactive voice is realized.

According to the present embodiment may be implemented will not speech identifying function host equipment assign voice interactive function, and The noise treatment problem in the prior art to speech recognition is overcome, speech recognition result is optimized.Also, before voice The functions such as end signal processing and wake-up are integrated into hardware chip, so that the system resource of host equipment is no longer occupied, while Power consumption can have larger optimization on special speech chip to phonetic algorithm, to realize low-power consumption requirement.

Fig. 3 schematically shows that application voice enabling apparatus according to an embodiment of the present invention realizes the voice side of energizing Method flow chart, as shown in figure 3, the present embodiment includes the following steps:

Step S301: voice enabling apparatus is connected to main equipment by data transmission module.Can by usb protocol, Bluetooth protocol and WIFI agreement etc. establish connection with main equipment, which supports a plurality of types of main equipments.

Step S302: voice enabling apparatus acquires audio data, and handles audio data, generates the first audio number According to and second audio data.Wherein, the audio data of voice enabling apparatus acquisition includes sound source audio data and reference audio number According to.Specific implementation are as follows: denoising, the denoising are carried out to sound source audio data according to sound source audio data and reference audio data The mode of processing applies noise reduction technology in DSP.In order to facilitate the calculating process of denoising, first by sound source audio data and with reference to sound Frequency carries out subtraction, the number that will be obtained after subtraction according to digital signal is respectively converted into, to the digital signal after conversion Word signal is converted to analog signal, thus the sound source audio data after being denoised.It is thus achieved that the effect of optimization interactive voice Fruit.

And Wave beam forming is carried out to the sound source audio data after denoising, generates the first audio data, it is also right at the same time Sound source audio data after denoising carries out wake-up identification, generates second audio data.And to the sound source audio number after denoising When according to carrying out Wave beam forming, audio selection is carried out also according to angle is waken up, specifically, since sound source acquisition module 1 includes multiple Microphone generates having MCVF multichannel voice frequency after beamforming (beam forming) algorithm, respectively corresponds different angle Enhance audio, and specifically take which road audio is exported as the first audio data, then relies on wake-up angle pointed by auditory localization It spends, wake-up angle pointed by auditory localization result is to export together when waking up processing with wake-up result.Specifically Implementation is referred to the device realization principle of Fig. 2.

Step S303: voice enabling apparatus exports the first audio data and second audio data to main equipment.The data The mode of transmission is referred to step S301, and specific implementation can establish in voice enabling apparatus and adapt to multiple types main equipment Multiple interfaces.

According to this method may be implemented will not speech identifying function host equipment assign voice interactive function, and gram The noise treatment problem in the prior art to speech recognition has been taken, speech recognition result is optimized, and has reached reduction master Machine equipment power consumption is not take up resource and other effects.

By taking external host equipment is television set as an example, voice enabling apparatus application of the invention is realized on a television set The specifically used method of the far field pickup of television set is as follows:

Firstly, the voice enabling apparatus is mounted on the top of television set by user, the microphone array of its main part is ensured Column are accustomed to direction towards user, and centre is maintained at level angle without main barrier as far as possible.Later, voice is energized The USB line of device is inserted in the junction at television set rear, to keep power supply and signal transmission.Again by the Mike of voice enabling apparatus Near the loudspeaker that wind array is fixed on television set in a manner of pasting etc..The installation process of the voice enabling apparatus is completed with this.

In use, voice enabling apparatus is completed by microphone (the first i.e. above-mentioned sound source acquisition component 101) The process that the sound that user issues is picked up.And microphone (i.e. above-mentioned by being pasted near television set speaker Two sound source acquisition components 102) pickup of the completion to the spontaneous sound of television set.By voice enabling apparatus to two groups of sound of acquisition It compares, completion filters out spontaneous sound, obtains the instruction sound that user actively issues.To complete further to believe Number processing.Subsequent treatment process is referring to above-mentioned method part.

From there through the mode of this external transmission audio, can be transmitted necessary to audio to avoid soft circuit in system layer System debug work；Hard circuit is also avoided to work for the dependence of terminal and system adaptation.Equipment is preferably reduced simultaneously The true interference of Self-sounding part is avoided since power amplification system, loudspeaker etc. broadcast link in sound and voice signal is asynchronous, Caused by problem.

Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can It is realized by the mode of software plus general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, above-mentioned technology Scheme substantially in other words can be embodied in the form of software products the part that the relevant technologies contribute, the computer Software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including some instructions to So that computer equipment (can be personal computer, server or the network equipment etc.) execute each embodiment or Method described in certain parts of embodiment.

Finally, it should be noted that above embodiments are only to illustrate the technical solution of the application, rather than its limitations；Although The application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: it still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features； And these are modified or replaceed, each embodiment technical solution of the application that it does not separate the essence of the corresponding technical solution spirit and Range.

Claims

1. voice enabling apparatus, which is characterized in that including

Data transmission module exports the first audio data to being attached thereto for realizing the data interaction with external equipment External equipment.

2. the apparatus according to claim 1, which is characterized in that the sound source acquisition module includes the first sound source acquisition group Part, for acquiring sound source audio data；

Second sound source acquisition component, for acquiring reference audio data；

The speech processing module includes

Noise eliminates unit, for being carried out at denoising according to sound source audio data and reference audio data to sound source audio data Reason；With

Beam forming unit generates the output of the first audio data for carrying out Wave beam forming to the sound source audio data after denoising.

3. the apparatus of claim 2, wherein the noise eliminates unit and includes

Analog-to-digital conversion component generates digital signal for carrying out analog-to-digital conversion to sound source audio data and reference audio data；

Echo cancellor component, the digital signal for being generated according to analog-to-digital conversion component carries out subtraction, after obtaining denoising Sound source digital signal；

Digital-to-analogue conversion component, for carrying out digital-to-analogue conversion to the sound source digital signal after denoising, the sound source audio after generating denoising Data.

4. device according to claim 3, which is characterized in that it is logical that the speech processing module also generates second audio data It crosses the data transmission module to export to the external equipment, the speech processing module further includes

Authentication unit is waken up, for carrying out wake-up identification to the sound source audio data after denoising, generates and wakes up control signal；

Second audio data generation unit generates for handling to waking up the wake-up control signal that authentication unit generates The output of two audio datas.

5. according to the described in any item devices of claim 2 to 4, which is characterized in that the first sound source acquisition component and second Sound source acquisition component is embodied as at least two moveable microphones.

6. device according to claim 5, wherein the data transmission module supports usb protocol, WIFI agreement and bluetooth At least one of agreement.

7. realizing that voice is energized method by voice enabling apparatus as claimed in claim 4, which is characterized in that including walking as follows It is rapid:

The voice enabling apparatus is connected to main equipment by data transmission module；

The voice enabling apparatus acquires audio data, and handles the audio data, generate the first audio data and Second audio data；

The voice enabling apparatus exports the first audio data and second audio data to the main equipment.

8. the method according to the description of claim 7 is characterized in that the audio data of voice enabling apparatus acquisition includes sound Source audio data and reference audio data, the voice enabling apparatus carry out processing to the audio data and include:

Denoising is carried out to sound source audio data according to sound source audio data and reference audio data；

Wave beam forming is carried out to the sound source audio data after denoising, generates the first audio data；

Wake-up identification is carried out to the sound source audio data after denoising, generates second audio data.

9. according to the method described in claim 8, it is characterized in that, voice enabling apparatus acquisition sound source audio data is realized For

Voice enabling apparatus the first sound source acquisition component is accustomed to direction towards user to be arranged, it is complete by the first sound source acquisition component At the pickup of sound source audio；

The voice enabling apparatus acquisition reference audio data are embodied as

Second sound source acquisition component of voice enabling apparatus is fixed near the loudspeaker of main equipment, the second sound source acquisition group is passed through Part completes the pickup of the reference audio of main equipment.

10. according to the method described in claim 9, it is characterized in that, described according to sound source audio data and reference audio data Carrying out denoising to sound source audio data includes:

Sound source audio data and reference audio data are respectively converted into digital signal；

Subtraction is carried out to the digital signal after conversion；

The digital signal obtained after subtraction is converted into analog signal, the sound source audio data after being denoised.