CN108932943A

CN108932943A - Order word sound detection method, device, equipment and storage medium

Info

Publication number: CN108932943A
Application number: CN201810764304.3A
Authority: CN
Inventors: 雷延强
Original assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd
Current assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date: 2018-07-12
Filing date: 2018-07-12
Publication date: 2018-12-04

Abstract

The embodiment of the invention discloses a kind of order word sound detection method, device, equipment and storage mediums, this method comprises: determining the target starting point and target endpoint of pretreated order word sound using deep neural network model, effective order word sound bite is determined according to the target starting point and the target endpoint；The phoneme classification results in the effective order word sound bite are determined using the deep neural network algorithm model, determine that order word exports result according to the phoneme classification results.It is more acurrate to the testing result of order word voice endpoint, do not increase computation complexity additionally when determining target starting point and target endpoint.

Description

Order word sound detection method, device, equipment and storage medium

Technical field

The present invention relates to speech recognition technology more particularly to a kind of order word sound detection method, device, equipment and storages Medium.

Background technique

In speech recognition system, the audio signal of input includes voice and ambient noise, in the audio signal of input Find voice segments, referred to as speech terminals detection, terminus detection or Voice activity detector (Voice Activity Detection, VAD).The purpose of VAD is to identify and eliminate the prolonged mute phase from audio signal.The detection of sound end Whether accurate, the performance of speech recognition system is directly affected.

In the implementation of the present invention, at least there are the following problems in the prior art for inventor's discovery.Speech detection end Point inaccuracy has a single function moreover, only identifying the beginning and end of voice.

Summary of the invention

The embodiment of the present invention provides a kind of order word sound detection method, device, equipment and storage medium, to order word The recognition result of voice endpoint is more acurrate, does not increase computation complexity additionally when determining target starting point and target endpoint.

In a first aspect, the embodiment of the invention provides a kind of order word sound detection methods, this method comprises:

The target starting point and target endpoint that pretreated order word sound is determined using deep neural network model, according to The target starting point and the target endpoint determine effective order word sound bite；

The phoneme classification results in the effective order word sound bite, root are determined using the deep neural network model Determine that order word exports result according to the phoneme classification results.

Second aspect, the embodiment of the invention also provides a kind of order word voice endpoint detection device, which includes:

Determining module, for using deep neural network model determine pretreated order word sound target starting point and Target endpoint determines effective order word sound bite according to the target starting point and the target endpoint；

Output module, for determining the sound in the effective order word sound bite using the deep neural network model Plain classification results determine that order word exports result according to the phoneme classification results.

The third aspect the embodiment of the invention also provides a kind of computer equipment, including memory, processor and is stored in On memory and the computer program that can run on a processor, the processor are realized when executing described program as the present invention is real Apply any order word sound detection method in example.

Fourth aspect, the embodiment of the invention also provides a kind of computer readable storage mediums, are stored thereon with computer Program realizes the order word sound detection method as described in any in the embodiment of the present invention when program is executed by processor.

In the embodiment of the present invention, the target starting point of pretreated order word sound is determined using deep neural network model And target endpoint, effective order word sound bite is determined according to the target starting point and the target endpoint；Using the depth Neural network model determines the phoneme classification results in the effective order word sound bite, true according to the phoneme classification results Determine order word output result.Determine that the target starting point of pretreated order word sound and target are whole using the same DNN model Point, while determining that the phoneme classification results in effective order word sound bite do not increase additionally on the basis of increasing function Computation complexity.It is more acurrate to the recognition result of order word voice endpoint (target starting point and target endpoint)；Using the first voice End-point detecting method filters out a large amount of non-voices, avoids and directly determines order word sound bite end using deep neural network The computationally intensive problem of point bring.The technical solution scope of application provided by the invention is wider.

Detailed description of the invention

Fig. 1 a is the flow chart of one of embodiment of the present invention one order word sound detection method；

Fig. 1 b is a kind of a kind of be applicable in DNN structural schematic diagram of the embodiment of the present invention；

Fig. 2 is the flow chart of one of embodiment of the present invention two order word sound detection method；

Fig. 3 is the structural schematic diagram of one of embodiment of the present invention three order word sound detection device；

Fig. 4 is the structural schematic diagram of one of the embodiment of the present invention four computer equipment.

Specific embodiment

The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention rather than limiting the invention.It also should be noted that in order to just Only the parts related to the present invention are shown in description, attached drawing rather than entire infrastructure.

Embodiment one

Fig. 1 a is a kind of flow chart for order word sound detection method that the embodiment of the present invention one provides, and the present embodiment can fit The case where for waking up terminal device by order word, this method can be by order word speech detection provided in an embodiment of the present invention Device executes, and the mode which can be used software and/or hardware realizes.With reference to Fig. 1 a, this method can specifically include as Lower step:

S110, the target starting point that pretreated order word sound is determined using deep neural network model and target are whole Point determines effective order word sound bite according to the target starting point and the target endpoint.

Wherein, neural network algorithm embodies the process made inferences according to logic rules, and information processing usually passes through mind Through the dynamic process of interaction is completed simultaneously between member.Deep neural network (Deep Neural Network, DNN) is logical Multilayer neural network is referred to, its working principle is that simulation human brain thinking mode, using DNN, faster, identification is accurate for processing speed Rate is also higher.Optionally, the DNN model applied in the embodiment of the present invention is instructed by a large amount of voice (noise or regular speech etc.) It gets, can tolerate more diversified noise, that is, classifying to the more acurrate of non-voice judgement to subsequent commands word phoneme As a result more advantageous, that is, it is more preferable to the recognition effect of non-command word in pretreated order word sound, reduce the general of misrecognition Rate.

Order word, can be the operational order that user says equipment (mobile phone, toy or household electrical appliances) etc., and equipment makes phase The feedback answered opens interactive voice.Order word can be " tuning up volume ", " play happy birthday song " etc..Order word detects One subdivision direction of speech recognition, the identification of order word is usually offline, and calculation amount requirement is smaller, is commonly used in terminal device Control, such as wake up terminal device scene in.

Specifically, target starting point and the target for obtaining pretreated order word sound are whole by the forward direction meter for carrying out DNN Point remains results of intermediate calculations in the forward calculation engineering of DNN, which can be used for phoneme classification results during being somebody's turn to do Output.The training of DNN is the process of a multitask, and Fig. 1 b shows a kind of DNN structural schematic diagram, the structural reference figure of DNN 1b, wherein input feature vector can be the Filter-Bank parameter or LPC (Linear that the voice of a frame is taken every 25 milliseconds Predictive Coding, uniform enconding prediction) parameter etc., technical solution provided in an embodiment of the present invention can also realize two Classification output, that is, apply the same DNN model, realize the output of end-point detection and phoneme classification results, wherein the present invention The output layer of DNN model in embodiment may include VAD output layer and phoneme output layer.

In a specific example, the target starting point and target endpoint of pretreated order word sound can be the used time The area of a room indicates, can also use the corresponding order word sound content representation of the time quantum.According to the target starting point and the target Terminal determines effective order word sound bite, if specifically may is that indicates target starting point and target endpoint with time quantum, two The period of the difference of time, corresponding order word sound bite was denoted as effective order word sound bite；If with order word voice content Indicate target starting point and target endpoint, it is determined that the order word sound bite between two order word voice contents is effective order Word sound bite.

S120, the classification knot of the phoneme in the effective order word sound bite is determined using the deep neural network model Fruit determines that order word exports result according to the phoneme classification results.

Wherein, phoneme be syllabication minimum unit or the smallest sound bite, be to be come out from the angular divisions of sound quality The smallest linear phonetic unit, phoneme is specific existing physical phenomenon.The International Phonetic Symbols (it is formulated by International Phonetic Association (IPA), For the letter of unified mark various countries' voice) phonetic symbol and the phoneme of whole mankind's language correspond.Such as Chinese syllable ā () only one phoneme, there are two phonemes by d à (big).

Specifically, the phoneme classification results in effective order word sound bite are determined using DNN, that is, pressing to order word Classify according to phoneme standard, and determines order word input results according to phoneme classification results.In an actual application scenarios In, this programme is applied in the voice of embedded device wakes up, embedded device such as speaker etc., embedded device is according to life The output result of word is enabled to be responded.Embedded device typically refers to the equipment having a single function, own processing capabilities are weaker.

In the embodiment of the present invention, the target starting point of pretreated order word sound is determined using deep neural network model And target endpoint, effective order word sound bite is determined according to the target starting point and the target endpoint；Using the depth Neural network model determines the phoneme classification results in the effective order word sound bite, true according to the phoneme classification results Determine order word output result.Determine that the target starting point of pretreated order word sound and target are whole using the same DNN model Point, while determining that the phoneme classification results in effective order word sound bite do not increase additionally on the basis of increasing function Computation complexity.It is more acurrate to the recognition result of order word voice endpoint (target starting point and target endpoint).

Optionally, determine that the phoneme in the effective order word sound bite is classified using the deep neural network model As a result, comprising: for each frame ordering word sound in the effective order word sound bite, using the deep neural network Model determines phoneme classification results.Wherein, it to each frame ordering word sound in effective order word sound bite, is determined using DNN Factor classification identifies as a result, can be also used for the decoding of order word, and the decoding of the orders word such as keyword/filler can be used for example Recognition methods.Factor classification is exported to each frame ordering word sound as a result, being conducive to improve the accurate of order word output result Property.

Illustratively, the phoneme classification results include: that the order word sound at current time belongs to the probability of setting phoneme. In a specific example, the order word sound at current time belongs to the probability of phoneme a, the probability for belonging to phoneme b etc., In, the order word sound at current time belongs to the probability of whole phonemes and is 1, then defeated further according to each determine the probability order word Out.To determine that the output of order word provides foundation.

Based on the above technical solution, the target starting point and target endpoint of pretreated order word sound are determined, It include: the speech probability in continuous first preset time greater than the non-voice probability in continuous first preset time, then really The end time of fixed continuous first preset time is target starting point, and/or, determine continuous first preset time when Between the corresponding order word voice node of terminal be target starting point；Speech probability in continuous second preset time is less than described continuous Non-voice probability in second preset time, it is determined that the start time of continuous second preset time is target endpoint, and/ Or, determining that the corresponding order word voice node of end time of continuous second preset time is target endpoint.

Wherein, the probability that a voice occurs, referred to as speech probability are calculated every the time of setting, while every setting Time calculates the probability that a non-voice occurs, and referred to as non-voice probability, the time of the setting can be 5 milliseconds, then determine Speech probability in continuous first preset time is greater than the non-voice probability in continuous first preset time, this is continuous first pre- If the time can be 1 minute, it is determined that the end time of continuous first preset time is target starting point, and/or, determine the company The corresponding order word voice node of end time of continuous first preset time is target starting point.

Determine that the speech probability in continuous second preset time is less than the non-voice probability in continuous second preset time, Continuous second preset time can be 1 minute, it is determined that and the end time of continuous second preset time is target endpoint, and/ Or, determining that the corresponding order word voice node of end time of continuous second preset time is target emphasis.

Embodiment two

Fig. 2 is a kind of flow chart of order word sound detection method provided by Embodiment 2 of the present invention, and the present embodiment is upper It states and realizes on the basis of embodiment.With reference to Fig. 2, this method can specifically include following steps:

S210, order word sound is pre-processed, wherein the pretreatment includes applying the first speech terminals detection side Method determines the first starting point of order word sound, determines pretreated order word sound according to first starting point.

Specifically, the first sound end detecting method can be can also be based on the detection method of energy based on statistics Double-threshold comparison, it is generally the case that determine the first starting point using the first sound end detecting method, order word sound will not be cut away Order word in segment, but some non-command words may be carried in order word front.In a specific example, non-order Word may include noise etc..

Using the first sound end detecting method, a large amount of non-voices are filtered out, are avoided directly using deep neural network The computationally intensive problem of order word sound bite endpoint bring is determined, because the calculation amount of DNN is usually than traditional VAD method (such as differentiation based on energy) is much larger, is not suitable in the weaker equipment of some processing capacities so directly applying DNN, The technical solution scope of application provided by the invention is wider.

S220, the target starting point that pretreated order word sound is determined using deep neural network model and target are whole Point determines effective order word sound bite according to the target starting point and the target endpoint.

S230, the classification knot of the phoneme in the effective order word sound bite is determined using the deep neural network model Fruit determines that order word exports result according to the phoneme classification results.

In the embodiment of the present invention, order word sound is pre-processed, wherein the pretreatment includes applying the first voice End-point detecting method determines the first starting point of order word sound, determines pretreated order word according to first starting point Sound.Using the first sound end detecting method, a large amount of non-voices are filtered out, avoids and directly determines life using deep neural network The problem for enabling word sound bite endpoint bring computationally intensive provides the foundation to carry out detection using DNN.

Embodiment three

Fig. 3 is a kind of structural schematic diagram for order word sound detection device that the embodiment of the present invention three provides, and the device is suitable A kind of order word sound detection method being supplied to for executing the embodiment of the present invention.As shown in figure 3, the device specifically can wrap It includes:

Determining module 310, for determining that the target of pretreated order word sound rises using deep neural network model Point and target endpoint, determine effective order word sound bite according to the target starting point and the target endpoint；

Output module 320, for being determined in the effective order word sound bite using the deep neural network model Phoneme classification results, according to the phoneme classification results determine order word export result.

Further, further includes:

Preprocessing module, in the mesh for determining pretreated order word sound using deep neural network model Before marking starting point and target endpoint, order word sound is pre-processed, wherein the pretreatment includes applying the first end-speech Point detecting method determines the first starting point of order word sound, determines pretreated order word sound according to first starting point.

Further, output module 320 is specifically used for:

For each frame ordering word sound in the effective order word sound bite, using the deep neural network mould Type determines phoneme classification results.

Further, the phoneme classification results include: that the order word sound at current time belongs to the probability of setting phoneme.

Further, determining module 320 is specifically used for:

Speech probability in continuous first preset time is greater than the non-voice probability in continuous first preset time, then The end time for determining continuous first preset time is target starting point, and/or, determine continuous first preset time The corresponding order word voice node of end time is target starting point；

Speech probability in continuous second preset time is less than the non-voice probability in continuous second preset time, then The start time for determining continuous second preset time is target endpoint, and/or, determine continuous second preset time The corresponding order word voice node of end time is target endpoint.

The order that any embodiment of that present invention provides can be performed in order word sound detection device provided in an embodiment of the present invention Word sound detection method has the corresponding functional module of execution method and beneficial effect.

Example IV

Fig. 4 is a kind of structural schematic diagram for computer equipment that the embodiment of the present invention four provides.Fig. 4, which is shown, to be suitable for being used to Realize the block diagram of the exemplary computer device 12 of embodiment of the present invention.The computer equipment 12 that Fig. 4 is shown is only one Example, should not function to the embodiment of the present invention and use scope bring any restrictions.

As shown in figure 4, computer equipment 12 is showed in the form of universal computing device.The component of computer equipment 12 can be with Including but not limited to: one or more processor or processing unit 16, system storage 28 connect different system components The bus 18 of (including system storage 28 and processing unit 16).

Bus 18 indicates one of a few class bus structures or a variety of, including memory bus or Memory Controller, Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.It lifts For example, these architectures include but is not limited to industry standard architecture (ISA) bus, microchannel architecture (MAC) Bus, enhanced isa bus, Video Electronics Standards Association (VESA) local bus and peripheral component interconnection (PCI) bus.

Computer equipment 12 typically comprises a variety of computer system readable media.These media can be it is any can be by The usable medium that computer equipment 12 accesses, including volatile and non-volatile media, moveable and immovable medium.

System storage 28 may include the computer system readable media of form of volatile memory, such as arbitrary access Memory (RAM) 30 and/or cache memory 32.Computer equipment 12 may further include it is other it is removable/can not Mobile, volatile/non-volatile computer system storage medium.Only as an example, storage system 34 can be used for reading and writing not Movably, non-volatile magnetic media (Fig. 4 do not show, commonly referred to as " hard disk drive ").It although not shown in fig 4, can be with The disc driver for reading and writing to removable non-volatile magnetic disk (such as " floppy disk ") is provided, and non-volatile to moving The CD drive of CD (such as CD-ROM, DVD-ROM or other optical mediums) read-write.In these cases, each driving Device can be connected by one or more data media interfaces with bus 18.System storage 28 may include at least one journey Sequence product, the program product have one group of (for example, at least one) program module, these program modules are configured to perform this hair The function of bright each embodiment.

Program/utility 40 with one group of (at least one) program module 42 can store and store in such as system In device 28, such program module 42 includes --- but being not limited to --- operating system, one or more application program, other It may include the realization of network environment in program module and program data, each of these examples or certain combination.Journey Sequence module 42 usually executes function and/or method in embodiment described in the invention.

Computer equipment 12 can also be with one or more external equipments 14 (such as keyboard, sensing equipment, display 24 Deng) communication, can also be enabled a user to one or more equipment interact with the computer equipment 12 communicate, and/or with make The computer equipment 12 any equipment (such as network interface card, the modulatedemodulate that can be communicated with one or more of the other calculating equipment Adjust device etc.) communication.This communication can be carried out by input/output (I/O) interface 22.Also, computer equipment 12 may be used also To pass through network adapter 20 and one or more network (such as local area network (LAN), wide area network (WAN) and/or public network Network, such as internet) communication.As shown, network adapter 20 is logical by other modules of bus 18 and computer equipment 12 Letter.It should be understood that although not shown in fig 4, other hardware and/or software module, packet can be used in conjunction with computer equipment 12 It includes but is not limited to: microcode, device driver, redundant processing unit, external disk drive array, RAID system, magnetic tape drive Device and data backup storage system etc..

Processing unit 16 by the program that is stored in system storage 28 of operation, thereby executing various function application and Data processing, such as realize order word sound detection method provided by the embodiment of the present invention:

That is, the processing unit is realized when executing described program: after determining pretreatment using deep neural network model Order word sound target starting point and target endpoint, effective order word is determined according to the target starting point and the target endpoint Sound bite；The phoneme classification results in the effective order word sound bite are determined using the deep neural network model, Determine that order word exports result according to the phoneme classification results.

Embodiment five

The embodiment of the present invention five provides a kind of computer readable storage medium, is stored thereon with computer program, the journey The order word sound detection method provided such as all inventive embodiments of the application is provided when sequence is executed by processor:

That is, realization when the program is executed by processor: determining pretreated order using deep neural network model The target starting point and target endpoint of word sound determine effective order word tablet according to the target starting point and the target endpoint Section；The phoneme classification results in the effective order word sound bite are determined using the deep neural network model, according to institute It states phoneme classification results and determines that order word exports result.

It can be using any combination of one or more computer-readable media.Computer-readable medium can be calculating Machine readable signal medium or computer readable storage medium.Computer readable storage medium for example can be --- but it is unlimited In system, device or the device of --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, or any above combination.It calculates The more specific example (non exhaustive list) of machine readable storage medium storing program for executing includes: electrical connection with one or more conducting wires, just Taking formula computer disk, hard disk, random access memory (RAM), read-only memory (ROM), erasable type may be programmed read-only storage Device (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device, Or above-mentioned any appropriate combination.In this document, computer readable storage medium can be it is any include or storage journey The tangible medium of sequence, the program can be commanded execution system, device or device use or in connection.

Computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including --- but It is not limited to --- electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be Any computer-readable medium other than computer readable storage medium, which can send, propagate or Transmission is for by the use of instruction execution system, device or device or program in connection.

The program code for including on computer-readable medium can transmit with any suitable medium, including --- but it is unlimited In --- wireless, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.

The computer for executing operation of the present invention can be write with one or more programming languages or combinations thereof Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++, It further include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with It fully executes, partly execute on the user computer on the user computer, being executed as an independent software package, portion Divide and partially executes or executed on a remote computer or server completely on the remote computer on the user computer.? Be related in the situation of remote computer, remote computer can pass through the network of any kind --- including local area network (LAN) or Wide area network (WAN)-be connected to subscriber computer, or, it may be connected to outer computer (such as mentioned using Internet service It is connected for quotient by internet).

Note that the above is only a better embodiment of the present invention and the applied technical principle.It will be appreciated by those skilled in the art that The invention is not limited to the specific embodiments described herein, be able to carry out for a person skilled in the art it is various it is apparent variation, It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out by above embodiments to the present invention It is described in further detail, but the present invention is not limited to the above embodiments only, without departing from the inventive concept, also It may include more other equivalent embodiments, and the scope of the invention is determined by the scope of the appended claims.

Claims

1. a kind of order word sound detection method characterized by comprising

The target starting point and target endpoint that pretreated order word sound is determined using deep neural network model, according to described Target starting point and the target endpoint determine effective order word sound bite；

The phoneme classification results in the effective order word sound bite are determined using the deep neural network model, according to institute It states phoneme classification results and determines that order word exports result.

2. being pre-processed the method according to claim 1, wherein being determined in the application deep neural network model Before the target starting point and target endpoint of order word sound afterwards, further includes:

Order word sound is pre-processed, wherein the pretreatment includes determining life using the first sound end detecting method The first starting point for enabling word sound determines pretreated order word sound according to first starting point.

3. the method according to claim 1, wherein being determined using the deep neural network model described effective Phoneme classification results in order word sound bite, comprising:

It is true using the deep neural network model for each frame ordering word sound in the effective order word sound bite Accordatura element classification results.

4. according to the method described in claim 3, it is characterized in that, the phoneme classification results include:

The order word sound at current time belongs to the probability of setting phoneme.

5. method according to claim 1-4, which is characterized in that determine the mesh of pretreated order word sound Mark starting point and target endpoint, comprising:

Speech probability in continuous first preset time is greater than the non-voice probability in continuous first preset time, it is determined that The end time of continuous first preset time is target starting point, and/or, determine the time of continuous first preset time The corresponding order word voice node of terminal is target starting point；

Speech probability in continuous second preset time is less than the non-voice probability in continuous second preset time, it is determined that The start time of continuous second preset time is target endpoint, and/or, determine the time of continuous second preset time The corresponding order word voice node of terminal is target endpoint.

6. a kind of order word voice endpoint detection device characterized by comprising

Determining module, for determining the target starting point and target of pretreated order word sound using deep neural network model Terminal determines effective order word sound bite according to the target starting point and the target endpoint；

Output module, for determining the phoneme in the effective order word sound bite point using the deep neural network model Class is as a result, determine that order word exports result according to the phoneme classification results.

7. device according to claim 6, which is characterized in that further include:

Preprocessing module, for determining that the target of pretreated order word sound rises using deep neural network model described Before point and target endpoint, order word sound is pre-processed, wherein the pretreatment includes examining using the first sound end Survey method determines the first starting point of order word sound, determines pretreated order word sound according to first starting point.

8. device according to claim 6, which is characterized in that the output module is specifically used for:

9. a kind of computer equipment including memory, processor and stores the meter that can be run on a memory and on a processor Calculation machine program, which is characterized in that the processor realizes such as side as claimed in any one of claims 1 to 5 when executing described program Method.

10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor Such as method as claimed in any one of claims 1 to 5 is realized when execution.