CN109326290A

CN109326290A - Audio recognition method and device

Info

Publication number: CN109326290A
Application number: CN201811504208.1A
Authority: CN
Inventors: 吴有宝; 林婷
Original assignee: AI Speech Ltd
Current assignee: AI Speech Ltd
Priority date: 2018-12-10
Filing date: 2018-12-10
Publication date: 2019-02-12

Abstract

The invention discloses a kind of audio recognition method, include the following steps: to obtain user interface content；User interface content is registered；When receiving user speech instruction, the recognition result instructed to user speech is determined according to the content of registration.The invention also discloses a kind of speech recognition equipments, the effect for enhancing the precision of speech recognition during interactive voice may be implemented with device according to the method for the present invention, and significantly increase the experience sense of user.

Description

Audio recognition method and device

Technical field

The present invention relates to technical field of voice recognition, especially a kind of audio recognition method and device.

Background technique

With the more maturation of interactive voice technology, it is directed to product or method based on interactive voice at present, due to existing The special circumstances such as polyphone, a variety of results or uncommon term are easy to produce the not high problem of accuracy of identification, such as user's hair Phonetic order out is " lotus street ", and user's expectation is identified as in " lotus street " when speech recognition, but actual speech identifies May being identified as the recognition result of " street Fu Rong " and expectations of customer, just there is any discrepancy, the problem for causing accuracy of identification not high.

Summary of the invention

Inventor is practiced and summarized the experience discovery, and user carries out phonetic order, is issued based on application interface 's.Moreover, inventor is further contemplated that, with the fast development of interactive voice technology, since it can be provided convenient for user Service, it is seen that can say and have become a kind of irresistible development trend.Under this trend, the operation of third-party application becomes Gesture is developed into and is operated based on phonetic order by current manual operation, and mainstream will be become.For this purpose, inventor contemplating solution The certainly new design of the above problem: being registered from the corresponding content of user interface of other application (such as APP), is directed to user in this way The phonetic order that interface issues can be matched according to interface participle, according to matching result as recognition result.It can mention in this way For the precision of speech recognition, enhance the experience sense of user.

According to the first aspect of the invention, a kind of audio recognition method is provided, is included the following steps:

Obtain user interface content；

User interface content is registered；

When receiving user speech instruction, the recognition result instructed to user speech is determined according to the content of registration.

According to the second aspect of the invention, a kind of speech recognition equipment is provided, including

Interface content obtains module, for obtaining user interface content；

Interface content extraction module, for registering user interface content；

Speech recognition module, for being determined to user speech according to the content of registration when receiving user speech instruction The recognition result of instruction.

According to the third aspect of the present invention, a kind of electronic equipment is provided comprising: at least one processor, and The memory being connect at least one processor communication, wherein memory is stored with the finger that can be executed by least one processor It enables, instruction is executed by least one processor, so that the step of at least one processor is able to carry out the above method.

According to the fourth aspect of the present invention, a kind of storage medium is provided, computer program is stored thereon with, the program The step of above method is realized when being executed by processor.

The essence for enhancing speech recognition during interactive voice may be implemented in the method and device provided according to the present invention The effect of degree, and significantly increase the experience sense of user.

Detailed description of the invention

Fig. 1 is the audio recognition method flow chart of an embodiment of the present invention；

Fig. 2 is the audio recognition method flow chart of a further embodiment of this invention；

Fig. 3 is the speech recognition equipment functional block diagram of an embodiment of the present invention；

Fig. 4 is the block diagram of the electronic equipment of an embodiment of the present invention.

Specific embodiment

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.

It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.

The present invention can describe in the general context of computer-executable instructions executed by a computer, such as program Module.Generally, program module includes routines performing specific tasks or implementing specific abstract data types, programs, objects, member Part, data structure etc..The present invention can also be practiced in a distributed computing environment, in these distributed computing environments, by Task is executed by the connected remote processing devices of communication network.In a distributed computing environment, program module can be with In the local and remote computer storage media including storage equipment.

In the present invention, the fingers such as " module ", " device ", " system " are applied to the related entities of computer, such as hardware, hardware Combination, software or software in execution with software etc..In detail, for example, element can with but be not limited to run on processing Process, processor, object, executable element, execution thread, program and/or the computer of device.In addition, running on server Application program or shell script, server can be element.One or more elements can be in the process and/or thread of execution In, and element can be localized and/or be distributed between two or multiple stage computers on one computer, and can be by each Kind computer-readable medium operation.Element can also according to the signal with one or more data packets, for example, from one with Another element interacts in local system, distributed system, and/or the network in internet passes through signal and other system interactions The signals of data communicated by locally and/or remotely process.

Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, the terms "include", "comprise", not only include those elements, and And further include other elements that are not explicitly listed, or further include for this process, method, article or equipment institute it is intrinsic Element.In the absence of more restrictions, the element limited by sentence " including ... ", it is not excluded that including described want There is also other identical elements in the process, method, article or equipment of element.

The audio recognition method of the embodiment of the present invention can be applied to any terminal device for being configured with phonetic function, example Such as, the terminal devices such as smart phone, tablet computer, smart home, the invention is not limited in this regard, so that user exists Response more promptly and accurately is obtained during using these terminal devices, promotes user experience.

The invention will now be described in further detail with reference to the accompanying drawings.

Fig. 1 schematically shows audio recognition method flow chart according to an embodiment of the present invention, as shown in Figure 1, The present embodiment includes the following steps:

Step S101: user interface content is obtained.User interface is user circle of each APP installed on terminal device Face, acquisition modes can be obtained by the api interface of each app user interface.

Step S102: user interface content is registered.Participle extraction is carried out respectively to each user interface content, point Word extraction is a mature technology, and being referred to the participle inside prior art realization, such as " you are well small to speed " has " hello " " small to speed ", extracts by unit of phrase.The participle extracted is determined as to the interface participle of the user interface.Later, By determining interface participle by identifying that the interface of engine is registered in identification engine.

Step S103: when receiving user speech instruction, the knowledge instructed to user speech is determined according to the content of registration Other result.Specific implementation are as follows: dock the user speech instruction received and carry out speech recognition, the mode of speech recognition is referred to existing There is technology, obtain the first recognition result, wherein the first recognition result is obtained according to tradition or existing voice recognition mode Recognition result.By the first recognition result and last step, chartered interface participle carries out similarity mode, if matching at Function is just using first recognition result as final recognition result, and illustratively, the phonetic order that user issues is " lotus street ", leads to Crossing existing speech recognition and obtaining the first recognition result is " street Fu Rong ", by " street Fu Rong " and the interface of registration participle based on hair Sound is matched, and the interface participle in " lotus street " will be matched to, in this case, just by the final identification of the first recognition result Modified result is " lotus street ", i.e., when the interface for being matched to same or similar pronunciation segments, preferably selection interface segments conduct Recognition result.It, will be according to existing voice recognition mode using the first recognition result as final recognition result if it fails to match. Illustratively, if being not matched to similar interface participle, the first recognition result " street Fu Rong " is just used as recognition result.

The embodiment of the present invention carries out user in respective user interfaces the scene of interactive voice, can quickly and accurately It is matched to the speech recognition result of user, accuracy of identification is improved, avoids the asking for the generating due to special circumstances such as polyphone, rarely used word Topic, and can effectively realize the purpose of " finding is i.e. described ", so that all operating user interfaces can pass through interactive voice It realizes, enriches the interactive voice experience of user, and the mode that user is operated using various user interfaces is more abundant And it is friendly.

Fig. 2 schematically shows the audio recognition method flow chart of another embodiment according to the present invention, such as Fig. 2 institute Show, the present embodiment includes:

Step S201: user interface content is obtained.Its concrete implementation mode is referred to the implementation of step S101.

Step S202: user interface mark is configured for the user interface content of acquisition.Specific implementation are as follows: according to user interface Affiliated APP is the identifier that the user interface content obtained is configured to the unique identification user interface, the content of identifier Can be includes the field for identifying App and the field for identity user interface.Illustratively, to music APP Playlist user interface can for its allocation identification accord with: MUSIC.PLAY LIST001.MUSIC indicates music APP, PLAY LIST indicate playlist, and 001 indicates first user interface of playlist.

Wherein, music APP itself can be directly used in the field in a preferred embodiment for identity user interface User interface ID.

In some embodiments user interface mark still can be terminal device system or respective application software to its ID is identified, as long as achieving the purpose that effectively to identify each user interface, the embodiment of the present invention is not carried out specific implementation Limitation.

Step S203: user interface content is registered.Its implementation and step S102 are essentially identical, difference It is to carry out participle extraction to each user interface content, after determining interface participle, interface participle is infused in identification engine Volume, and be associated with corresponding user interface mark for it, i.e., by the user interface mark of user interface belonging to the participle of interface also into Row registration.

Step S204: receive user speech instruction when, according to the content of registration and user interface mark determine to The recognition result of family phonetic order.Specific implementation are as follows: when receiving user speech instruction, by calling corresponding system interface The user interface being currently located is obtained, determines that its user interface identifies.Later, according to the user interface for being currently located user interface Mark obtains the registered interface participle of present user interface.Speech recognition is carried out to the user speech instruction received, is obtained First recognition result.First recognition result is matched with the registered interface of present user interface participle, is tied according to matching Fruit determines the final recognition result of the first recognition result.

According to the present embodiment because having user interface mark, thus during interactive voice, to the first recognition result Further identification is based on present user interface, and whether can effectively tell is the use issued based on present user interface Family phonetic order, can be more accurate be matched to recognition result, have higher speech recognition precision.

Fig. 3 schematically shows speech recognition equipment functional block diagram according to an embodiment of the present invention, such as Fig. 3 institute Show, speech recognition equipment includes: that interface content obtains module 4, interface content extraction module 5 and speech recognition module 6.

Interface content obtains module 4 for obtaining user interface content, is embodied as connecing with the api of each app user interface Mouth connection, according to the interface captures user interface content.

Interface content extraction module 5 is for registering user interface content comprising has participle 501 He of extraction unit Registering unit 502 is segmented, participle extraction unit 501 determines that interface segments for carrying out participle extraction to user interface content.Point Word registering unit 502 is used to register interface participle in identification engine.The registration content of the participle registering unit 502 can To be a register interface participle, it is also possible to distinguishing interface participle into registration with interface identification.Specific implementation can join According to above-mentioned method part.

Speech recognition module 6 is used to be determined according to the content of registration to user speech when receiving user speech instruction The recognition result of instruction.It includes recognition unit 601 and identification verification unit 602.Recognition unit 601 is used for receiving User speech instruction carries out speech recognition, obtains the first recognition result, can use existing speech recognition module.Identification verification Unit 602 is used to match the first recognition result with the interface of registration participle, determines the first identification knot according to matching result The final recognition result of fruit.Wherein it is possible to the first recognition result is matched with all registered interface participles, it can also be by first Recognition result is matched with the registered interface of present user interface participle.Specific implementation is referred to above-mentioned method Part.

The phonetic order that the device of the embodiment of the present invention can be issued for user interface, after carrying out speech recognition, into One step is matched according to interface participle, is corrected recognition result according to matching result, can be provided the precision of speech recognition in this way. Also, the device of the embodiment of the present invention is to carry out modified result based on the participle for extracting user interface, it is thus possible to effectively be expanded The interactive mode of various applications is opened up, so that interactive voice all plays left and right in various applications, is realized " visible to say ", enhancing The experience sense of user.

In some embodiments, the embodiment of the present invention provides a kind of non-volatile computer readable storage medium storing program for executing, described to deposit Being stored in storage media one or more includes the programs executed instruction, it is described execute instruction can by electronic equipment (including but It is not limited to computer, server or the network equipment etc.) it reads and executes, for executing any of the above-described voice of the present invention Recognition methods.

In some embodiments, the embodiment of the present invention also provides a kind of computer program product, computer program product packet The computer program being stored on non-volatile computer readable storage medium storing program for executing is included, computer program includes program instruction, works as institute When program instruction is computer-executed, computer is made to execute any of the above-described audio recognition method.

In some embodiments, the embodiment of the present invention also provides a kind of electronic equipment comprising: at least one processor, And the memory being connect at least one processor communication, wherein memory, which is stored with, to be executed by least one processor Instruction, instruction by least one described processor execute so that at least one processor is able to carry out audio recognition method.

In some embodiments, the embodiment of the present invention also provides a kind of storage medium, is stored thereon with computer program, It is characterized in that, audio recognition method when which is executed by processor.

The speech recognition equipment of the embodiments of the present invention can be used for executing the audio recognition method of the embodiment of the present invention, and Reach the realization audio recognition method technical effect achieved of the embodiments of the present invention accordingly, which is not described herein again.This Related function module can be realized by hardware processor (hardware processor) in inventive embodiments.

Fig. 4 is the hardware configuration signal of the electronic equipment for the execution audio recognition method that another embodiment of the application provides Figure, as shown in figure 4, the equipment includes:

One or more processors 410 and memory 420, in Fig. 4 by taking a processor 410 as an example.

The equipment for executing audio recognition method can also include: input unit 430 and output device 440.Processor 410, Memory 420, input unit 430 and output device 440 can be connected by bus or other modes, by total in Fig. 4 For line connection.

Memory 420 is used as a kind of non-volatile computer readable storage medium storing program for executing, can be used for storing non-volatile software journey Sequence, non-volatile computer executable program and module, such as the corresponding program of audio recognition method in the embodiment of the present application Instruction/module.Non-volatile software program, instruction and the module that processor 410 is stored in memory 420 by operation, Thereby executing the various function application and data processing of server, that is, realize the audio recognition method of above method embodiment.

Memory 420 may include storing program area and storage data area, wherein storing program area can store operation system Application program required for system, at least one function；Storage data area can be stored to be created according to using for speech recognition equipment Data etc..In addition, memory 420 may include high-speed random access memory, it can also include nonvolatile memory, example Such as at least one disk memory, flush memory device or other non-volatile solid state memory parts.In some embodiments, it deposits Optional reservoir 420 includes the memory remotely located relative to processor 410, these remote memories can pass through network connection To speech recognition equipment.The example of above-mentioned network includes but is not limited to internet, intranet, local area network, mobile radio communication And combinations thereof.

Input unit 430 can receive the number or character information of input, and generates and set with the user of speech recognition equipment It sets and the related signal of function control.Output device 440 may include that display screen etc. shows equipment.

Said one or multiple modules are stored in the memory 420, when by one or more of processors When 410 execution, the audio recognition method in above-mentioned any means embodiment is executed.

Method provided by the embodiment of the present application can be performed in the said goods, has the corresponding functional module of execution method and has Beneficial effect.The not technical detail of detailed description in the present embodiment, reference can be made to method provided by the embodiment of the present application.

The electronic equipment of the embodiment of the present application exists in a variety of forms, including but not limited to:

(1) mobile communication equipment: the characteristics of this kind of equipment is that have mobile communication function, and to provide speech, data Communication is main target.This Terminal Type includes: smart phone (such as iPhone), multimedia handset, functional mobile phone and low Hold mobile phone etc..

(2) super mobile personal computer equipment: this kind of equipment belongs to the scope of personal computer, there is calculating and processing function Can, generally also have mobile Internet access characteristic.This Terminal Type includes: PDA, MID and UMPC equipment etc., such as iPad.

(3) portable entertainment device: this kind of equipment can show and play multimedia content.Such equipment include: audio, Video player (such as iPod), handheld device, e-book and intelligent toy and portable car-mounted navigation equipment.

(4) server: providing the equipment of the service of calculating, and the composition of server includes that processor, hard disk, memory, system are total Line etc., server is similar with general computer architecture, but due to needing to provide highly reliable service, in processing energy Power, stability, reliability, safety, scalability, manageability etc. are more demanding.

(5) other electronic devices with data interaction function.

The apparatus embodiments described above are merely exemplary, wherein described, unit can as illustrated by the separation member It is physically separated with being or may not be, component shown as a unit may or may not be physics list Member, it can it is in one place, or may be distributed over multiple network units.It can be selected according to the actual needs In some or all of the modules achieve the purpose of the solution of this embodiment.

Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can It is realized by the mode of software plus general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, above-mentioned technology Scheme substantially in other words can be embodied in the form of software products the part that the relevant technologies contribute, the computer Software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including some instructions to So that computer equipment (can be personal computer, server or the network equipment etc.) execute each embodiment or Method described in certain parts of embodiment.

Finally, it should be noted that above embodiments are only to illustrate the technical solution of the application, rather than its limitations；Although The application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: it still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features； And these are modified or replaceed, each embodiment technical solution of the application that it does not separate the essence of the corresponding technical solution spirit and Range.

Claims

1. audio recognition method characterized by comprising

Obtain user interface content；

The user interface content is registered；

2. the method according to claim 1, wherein it is described by the user interface content carry out registration include:

Participle extraction is carried out to user interface content, determines that interface segments；

Interface participle is registered in identification engine.

3. according to the method described in claim 2, it is characterized in that, receive user speech instruction when, according in registration Hold and determines that the recognition result instructed to user speech includes:

Speech recognition is carried out to the user speech instruction received, obtains the first recognition result；

First recognition result is matched with the interface of registration participle, the final of the first recognition result is determined according to matching result Recognition result.

4. the method according to claim 1, wherein the method also includes:

User interface mark is configured for the user interface content of acquisition；

It is described by the user interface content carry out registration include:

Participle extraction is carried out to each user interface content, determines that interface segments；

Interface participle is registered in identification engine, and is associated with corresponding user interface mark for it.

5. according to the method described in claim 4, it is characterized in that, it is described receive user speech instruction when, according to registration Content determine that the recognition result that instructs to user speech includes

When receiving user speech instruction, acquisition is currently located user interface；

The registered interface participle for obtaining present user interface is identified according to the user interface for being currently located user interface；

First recognition result is matched with the registered interface of present user interface participle, determines first according to matching result The final recognition result of recognition result.

6. speech recognition equipment characterized by comprising

Interface content obtains module, for obtaining user interface content；

Interface content extraction module, for registering the user interface content；

Speech recognition module, for being determined according to the content of registration and being instructed to user speech when receiving user speech instruction Recognition result.

7. device according to claim 6, which is characterized in that the interface content extraction module includes

Extraction unit is segmented, for carrying out participle extraction to user interface content, determines that interface segments；

Registering unit is segmented, for registering interface participle in identification engine.

8. device according to claim 7, which is characterized in that the speech recognition module includes

Recognition unit obtains the first recognition result for carrying out speech recognition to the user speech instruction received；

It identifies verification unit, for matching the first recognition result with the interface of registration participle, is determined according to matching result The final recognition result of first recognition result.

9. electronic equipment comprising: at least one processor, and the storage being connect at least one described processor communication Device, wherein the memory is stored with the instruction that can be executed by least one described processor, and described instruction is by described at least one A processor executes, so that at least one described processor is able to carry out the step of any one of claim 1-5 the method Suddenly.

10. storage medium is stored thereon with computer program, which is characterized in that the program realizes right when being executed by processor It is required that the step of any one of 1-5 the method.