CN108932943A - Order word sound detection method, device, equipment and storage medium - Google Patents
Order word sound detection method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN108932943A CN108932943A CN201810764304.3A CN201810764304A CN108932943A CN 108932943 A CN108932943 A CN 108932943A CN 201810764304 A CN201810764304 A CN 201810764304A CN 108932943 A CN108932943 A CN 108932943A
- Authority
- CN
- China
- Prior art keywords
- order word
- word sound
- target
- starting point
- neural network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 31
- 238000003860 storage Methods 0.000 title claims abstract description 24
- 238000003062 neural network model Methods 0.000 claims abstract description 31
- 238000000034 method Methods 0.000 claims abstract description 28
- 238000004364 calculation method Methods 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 4
- 238000007781 pre-processing Methods 0.000 claims description 2
- 238000013528 artificial neural network Methods 0.000 abstract description 9
- 238000004422 calculation algorithm Methods 0.000 abstract description 2
- 238000012545 processing Methods 0.000 description 9
- 230000006870 function Effects 0.000 description 8
- 238000010586 diagram Methods 0.000 description 7
- 230000005291 magnetic effect Effects 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 230000005236 sound signal Effects 0.000 description 3
- 235000013399 edible fruits Nutrition 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 230000030808 detection of mechanical stimulus involved in sensory perception of sound Effects 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000945 filler Substances 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000002035 prolonged effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 230000002618 waking effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
- G10L15/05—Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/87—Detection of discrete points within a voice signal
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
Abstract
The embodiment of the invention discloses a kind of order word sound detection method, device, equipment and storage mediums, this method comprises: determining the target starting point and target endpoint of pretreated order word sound using deep neural network model, effective order word sound bite is determined according to the target starting point and the target endpoint;The phoneme classification results in the effective order word sound bite are determined using the deep neural network algorithm model, determine that order word exports result according to the phoneme classification results.It is more acurrate to the testing result of order word voice endpoint, do not increase computation complexity additionally when determining target starting point and target endpoint.
Description
Technical field
The present invention relates to speech recognition technology more particularly to a kind of order word sound detection method, device, equipment and storages
Medium.
Background technique
In speech recognition system, the audio signal of input includes voice and ambient noise, in the audio signal of input
Find voice segments, referred to as speech terminals detection, terminus detection or Voice activity detector (Voice Activity
Detection, VAD).The purpose of VAD is to identify and eliminate the prolonged mute phase from audio signal.The detection of sound end
Whether accurate, the performance of speech recognition system is directly affected.
In the implementation of the present invention, at least there are the following problems in the prior art for inventor's discovery.Speech detection end
Point inaccuracy has a single function moreover, only identifying the beginning and end of voice.
Summary of the invention
The embodiment of the present invention provides a kind of order word sound detection method, device, equipment and storage medium, to order word
The recognition result of voice endpoint is more acurrate, does not increase computation complexity additionally when determining target starting point and target endpoint.
In a first aspect, the embodiment of the invention provides a kind of order word sound detection methods, this method comprises:
The target starting point and target endpoint that pretreated order word sound is determined using deep neural network model, according to
The target starting point and the target endpoint determine effective order word sound bite;
The phoneme classification results in the effective order word sound bite, root are determined using the deep neural network model
Determine that order word exports result according to the phoneme classification results.
Second aspect, the embodiment of the invention also provides a kind of order word voice endpoint detection device, which includes:
Determining module, for using deep neural network model determine pretreated order word sound target starting point and
Target endpoint determines effective order word sound bite according to the target starting point and the target endpoint;
Output module, for determining the sound in the effective order word sound bite using the deep neural network model
Plain classification results determine that order word exports result according to the phoneme classification results.
The third aspect the embodiment of the invention also provides a kind of computer equipment, including memory, processor and is stored in
On memory and the computer program that can run on a processor, the processor are realized when executing described program as the present invention is real
Apply any order word sound detection method in example.
Fourth aspect, the embodiment of the invention also provides a kind of computer readable storage mediums, are stored thereon with computer
Program realizes the order word sound detection method as described in any in the embodiment of the present invention when program is executed by processor.
In the embodiment of the present invention, the target starting point of pretreated order word sound is determined using deep neural network model
And target endpoint, effective order word sound bite is determined according to the target starting point and the target endpoint;Using the depth
Neural network model determines the phoneme classification results in the effective order word sound bite, true according to the phoneme classification results
Determine order word output result.Determine that the target starting point of pretreated order word sound and target are whole using the same DNN model
Point, while determining that the phoneme classification results in effective order word sound bite do not increase additionally on the basis of increasing function
Computation complexity.It is more acurrate to the recognition result of order word voice endpoint (target starting point and target endpoint);Using the first voice
End-point detecting method filters out a large amount of non-voices, avoids and directly determines order word sound bite end using deep neural network
The computationally intensive problem of point bring.The technical solution scope of application provided by the invention is wider.
Detailed description of the invention
Fig. 1 a is the flow chart of one of embodiment of the present invention one order word sound detection method;
Fig. 1 b is a kind of a kind of be applicable in DNN structural schematic diagram of the embodiment of the present invention;
Fig. 2 is the flow chart of one of embodiment of the present invention two order word sound detection method;
Fig. 3 is the structural schematic diagram of one of embodiment of the present invention three order word sound detection device;
Fig. 4 is the structural schematic diagram of one of the embodiment of the present invention four computer equipment.
Specific embodiment
The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched
The specific embodiment stated is used only for explaining the present invention rather than limiting the invention.It also should be noted that in order to just
Only the parts related to the present invention are shown in description, attached drawing rather than entire infrastructure.
Embodiment one
Fig. 1 a is a kind of flow chart for order word sound detection method that the embodiment of the present invention one provides, and the present embodiment can fit
The case where for waking up terminal device by order word, this method can be by order word speech detection provided in an embodiment of the present invention
Device executes, and the mode which can be used software and/or hardware realizes.With reference to Fig. 1 a, this method can specifically include as
Lower step:
S110, the target starting point that pretreated order word sound is determined using deep neural network model and target are whole
Point determines effective order word sound bite according to the target starting point and the target endpoint.
Wherein, neural network algorithm embodies the process made inferences according to logic rules, and information processing usually passes through mind
Through the dynamic process of interaction is completed simultaneously between member.Deep neural network (Deep Neural Network, DNN) is logical
Multilayer neural network is referred to, its working principle is that simulation human brain thinking mode, using DNN, faster, identification is accurate for processing speed
Rate is also higher.Optionally, the DNN model applied in the embodiment of the present invention is instructed by a large amount of voice (noise or regular speech etc.)
It gets, can tolerate more diversified noise, that is, classifying to the more acurrate of non-voice judgement to subsequent commands word phoneme
As a result more advantageous, that is, it is more preferable to the recognition effect of non-command word in pretreated order word sound, reduce the general of misrecognition
Rate.
Order word, can be the operational order that user says equipment (mobile phone, toy or household electrical appliances) etc., and equipment makes phase
The feedback answered opens interactive voice.Order word can be " tuning up volume ", " play happy birthday song " etc..Order word detects
One subdivision direction of speech recognition, the identification of order word is usually offline, and calculation amount requirement is smaller, is commonly used in terminal device
Control, such as wake up terminal device scene in.
Specifically, target starting point and the target for obtaining pretreated order word sound are whole by the forward direction meter for carrying out DNN
Point remains results of intermediate calculations in the forward calculation engineering of DNN, which can be used for phoneme classification results during being somebody's turn to do
Output.The training of DNN is the process of a multitask, and Fig. 1 b shows a kind of DNN structural schematic diagram, the structural reference figure of DNN
1b, wherein input feature vector can be the Filter-Bank parameter or LPC (Linear that the voice of a frame is taken every 25 milliseconds
Predictive Coding, uniform enconding prediction) parameter etc., technical solution provided in an embodiment of the present invention can also realize two
Classification output, that is, apply the same DNN model, realize the output of end-point detection and phoneme classification results, wherein the present invention
The output layer of DNN model in embodiment may include VAD output layer and phoneme output layer.
In a specific example, the target starting point and target endpoint of pretreated order word sound can be the used time
The area of a room indicates, can also use the corresponding order word sound content representation of the time quantum.According to the target starting point and the target
Terminal determines effective order word sound bite, if specifically may is that indicates target starting point and target endpoint with time quantum, two
The period of the difference of time, corresponding order word sound bite was denoted as effective order word sound bite;If with order word voice content
Indicate target starting point and target endpoint, it is determined that the order word sound bite between two order word voice contents is effective order
Word sound bite.
S120, the classification knot of the phoneme in the effective order word sound bite is determined using the deep neural network model
Fruit determines that order word exports result according to the phoneme classification results.
Wherein, phoneme be syllabication minimum unit or the smallest sound bite, be to be come out from the angular divisions of sound quality
The smallest linear phonetic unit, phoneme is specific existing physical phenomenon.The International Phonetic Symbols (it is formulated by International Phonetic Association (IPA),
For the letter of unified mark various countries' voice) phonetic symbol and the phoneme of whole mankind's language correspond.Such as Chinese syllable ā
() only one phoneme, there are two phonemes by d à (big).
Specifically, the phoneme classification results in effective order word sound bite are determined using DNN, that is, pressing to order word
Classify according to phoneme standard, and determines order word input results according to phoneme classification results.In an actual application scenarios
In, this programme is applied in the voice of embedded device wakes up, embedded device such as speaker etc., embedded device is according to life
The output result of word is enabled to be responded.Embedded device typically refers to the equipment having a single function, own processing capabilities are weaker.
In the embodiment of the present invention, the target starting point of pretreated order word sound is determined using deep neural network model
And target endpoint, effective order word sound bite is determined according to the target starting point and the target endpoint;Using the depth
Neural network model determines the phoneme classification results in the effective order word sound bite, true according to the phoneme classification results
Determine order word output result.Determine that the target starting point of pretreated order word sound and target are whole using the same DNN model
Point, while determining that the phoneme classification results in effective order word sound bite do not increase additionally on the basis of increasing function
Computation complexity.It is more acurrate to the recognition result of order word voice endpoint (target starting point and target endpoint).
Optionally, determine that the phoneme in the effective order word sound bite is classified using the deep neural network model
As a result, comprising: for each frame ordering word sound in the effective order word sound bite, using the deep neural network
Model determines phoneme classification results.Wherein, it to each frame ordering word sound in effective order word sound bite, is determined using DNN
Factor classification identifies as a result, can be also used for the decoding of order word, and the decoding of the orders word such as keyword/filler can be used for example
Recognition methods.Factor classification is exported to each frame ordering word sound as a result, being conducive to improve the accurate of order word output result
Property.
Illustratively, the phoneme classification results include: that the order word sound at current time belongs to the probability of setting phoneme.
In a specific example, the order word sound at current time belongs to the probability of phoneme a, the probability for belonging to phoneme b etc.,
In, the order word sound at current time belongs to the probability of whole phonemes and is 1, then defeated further according to each determine the probability order word
Out.To determine that the output of order word provides foundation.
Based on the above technical solution, the target starting point and target endpoint of pretreated order word sound are determined,
It include: the speech probability in continuous first preset time greater than the non-voice probability in continuous first preset time, then really
The end time of fixed continuous first preset time is target starting point, and/or, determine continuous first preset time when
Between the corresponding order word voice node of terminal be target starting point;Speech probability in continuous second preset time is less than described continuous
Non-voice probability in second preset time, it is determined that the start time of continuous second preset time is target endpoint, and/
Or, determining that the corresponding order word voice node of end time of continuous second preset time is target endpoint.
Wherein, the probability that a voice occurs, referred to as speech probability are calculated every the time of setting, while every setting
Time calculates the probability that a non-voice occurs, and referred to as non-voice probability, the time of the setting can be 5 milliseconds, then determine
Speech probability in continuous first preset time is greater than the non-voice probability in continuous first preset time, this is continuous first pre-
If the time can be 1 minute, it is determined that the end time of continuous first preset time is target starting point, and/or, determine the company
The corresponding order word voice node of end time of continuous first preset time is target starting point.
Determine that the speech probability in continuous second preset time is less than the non-voice probability in continuous second preset time,
Continuous second preset time can be 1 minute, it is determined that and the end time of continuous second preset time is target endpoint, and/
Or, determining that the corresponding order word voice node of end time of continuous second preset time is target emphasis.
Embodiment two
Fig. 2 is a kind of flow chart of order word sound detection method provided by Embodiment 2 of the present invention, and the present embodiment is upper
It states and realizes on the basis of embodiment.With reference to Fig. 2, this method can specifically include following steps:
S210, order word sound is pre-processed, wherein the pretreatment includes applying the first speech terminals detection side
Method determines the first starting point of order word sound, determines pretreated order word sound according to first starting point.
Specifically, the first sound end detecting method can be can also be based on the detection method of energy based on statistics
Double-threshold comparison, it is generally the case that determine the first starting point using the first sound end detecting method, order word sound will not be cut away
Order word in segment, but some non-command words may be carried in order word front.In a specific example, non-order
Word may include noise etc..
Using the first sound end detecting method, a large amount of non-voices are filtered out, are avoided directly using deep neural network
The computationally intensive problem of order word sound bite endpoint bring is determined, because the calculation amount of DNN is usually than traditional VAD method
(such as differentiation based on energy) is much larger, is not suitable in the weaker equipment of some processing capacities so directly applying DNN,
The technical solution scope of application provided by the invention is wider.
S220, the target starting point that pretreated order word sound is determined using deep neural network model and target are whole
Point determines effective order word sound bite according to the target starting point and the target endpoint.
S230, the classification knot of the phoneme in the effective order word sound bite is determined using the deep neural network model
Fruit determines that order word exports result according to the phoneme classification results.
In the embodiment of the present invention, order word sound is pre-processed, wherein the pretreatment includes applying the first voice
End-point detecting method determines the first starting point of order word sound, determines pretreated order word according to first starting point
Sound.Using the first sound end detecting method, a large amount of non-voices are filtered out, avoids and directly determines life using deep neural network
The problem for enabling word sound bite endpoint bring computationally intensive provides the foundation to carry out detection using DNN.
Embodiment three
Fig. 3 is a kind of structural schematic diagram for order word sound detection device that the embodiment of the present invention three provides, and the device is suitable
A kind of order word sound detection method being supplied to for executing the embodiment of the present invention.As shown in figure 3, the device specifically can wrap
It includes:
Determining module 310, for determining that the target of pretreated order word sound rises using deep neural network model
Point and target endpoint, determine effective order word sound bite according to the target starting point and the target endpoint;
Output module 320, for being determined in the effective order word sound bite using the deep neural network model
Phoneme classification results, according to the phoneme classification results determine order word export result.
Further, further includes:
Preprocessing module, in the mesh for determining pretreated order word sound using deep neural network model
Before marking starting point and target endpoint, order word sound is pre-processed, wherein the pretreatment includes applying the first end-speech
Point detecting method determines the first starting point of order word sound, determines pretreated order word sound according to first starting point.
Further, output module 320 is specifically used for:
For each frame ordering word sound in the effective order word sound bite, using the deep neural network mould
Type determines phoneme classification results.
Further, the phoneme classification results include: that the order word sound at current time belongs to the probability of setting phoneme.
Further, determining module 320 is specifically used for:
Speech probability in continuous first preset time is greater than the non-voice probability in continuous first preset time, then
The end time for determining continuous first preset time is target starting point, and/or, determine continuous first preset time
The corresponding order word voice node of end time is target starting point;
Speech probability in continuous second preset time is less than the non-voice probability in continuous second preset time, then
The start time for determining continuous second preset time is target endpoint, and/or, determine continuous second preset time
The corresponding order word voice node of end time is target endpoint.
The order that any embodiment of that present invention provides can be performed in order word sound detection device provided in an embodiment of the present invention
Word sound detection method has the corresponding functional module of execution method and beneficial effect.
Example IV
Fig. 4 is a kind of structural schematic diagram for computer equipment that the embodiment of the present invention four provides.Fig. 4, which is shown, to be suitable for being used to
Realize the block diagram of the exemplary computer device 12 of embodiment of the present invention.The computer equipment 12 that Fig. 4 is shown is only one
Example, should not function to the embodiment of the present invention and use scope bring any restrictions.
As shown in figure 4, computer equipment 12 is showed in the form of universal computing device.The component of computer equipment 12 can be with
Including but not limited to: one or more processor or processing unit 16, system storage 28 connect different system components
The bus 18 of (including system storage 28 and processing unit 16).
Bus 18 indicates one of a few class bus structures or a variety of, including memory bus or Memory Controller,
Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.It lifts
For example, these architectures include but is not limited to industry standard architecture (ISA) bus, microchannel architecture (MAC)
Bus, enhanced isa bus, Video Electronics Standards Association (VESA) local bus and peripheral component interconnection (PCI) bus.
Computer equipment 12 typically comprises a variety of computer system readable media.These media can be it is any can be by
The usable medium that computer equipment 12 accesses, including volatile and non-volatile media, moveable and immovable medium.
System storage 28 may include the computer system readable media of form of volatile memory, such as arbitrary access
Memory (RAM) 30 and/or cache memory 32.Computer equipment 12 may further include it is other it is removable/can not
Mobile, volatile/non-volatile computer system storage medium.Only as an example, storage system 34 can be used for reading and writing not
Movably, non-volatile magnetic media (Fig. 4 do not show, commonly referred to as " hard disk drive ").It although not shown in fig 4, can be with
The disc driver for reading and writing to removable non-volatile magnetic disk (such as " floppy disk ") is provided, and non-volatile to moving
The CD drive of CD (such as CD-ROM, DVD-ROM or other optical mediums) read-write.In these cases, each driving
Device can be connected by one or more data media interfaces with bus 18.System storage 28 may include at least one journey
Sequence product, the program product have one group of (for example, at least one) program module, these program modules are configured to perform this hair
The function of bright each embodiment.
Program/utility 40 with one group of (at least one) program module 42 can store and store in such as system
In device 28, such program module 42 includes --- but being not limited to --- operating system, one or more application program, other
It may include the realization of network environment in program module and program data, each of these examples or certain combination.Journey
Sequence module 42 usually executes function and/or method in embodiment described in the invention.
Computer equipment 12 can also be with one or more external equipments 14 (such as keyboard, sensing equipment, display 24
Deng) communication, can also be enabled a user to one or more equipment interact with the computer equipment 12 communicate, and/or with make
The computer equipment 12 any equipment (such as network interface card, the modulatedemodulate that can be communicated with one or more of the other calculating equipment
Adjust device etc.) communication.This communication can be carried out by input/output (I/O) interface 22.Also, computer equipment 12 may be used also
To pass through network adapter 20 and one or more network (such as local area network (LAN), wide area network (WAN) and/or public network
Network, such as internet) communication.As shown, network adapter 20 is logical by other modules of bus 18 and computer equipment 12
Letter.It should be understood that although not shown in fig 4, other hardware and/or software module, packet can be used in conjunction with computer equipment 12
It includes but is not limited to: microcode, device driver, redundant processing unit, external disk drive array, RAID system, magnetic tape drive
Device and data backup storage system etc..
Processing unit 16 by the program that is stored in system storage 28 of operation, thereby executing various function application and
Data processing, such as realize order word sound detection method provided by the embodiment of the present invention:
That is, the processing unit is realized when executing described program: after determining pretreatment using deep neural network model
Order word sound target starting point and target endpoint, effective order word is determined according to the target starting point and the target endpoint
Sound bite;The phoneme classification results in the effective order word sound bite are determined using the deep neural network model,
Determine that order word exports result according to the phoneme classification results.
Embodiment five
The embodiment of the present invention five provides a kind of computer readable storage medium, is stored thereon with computer program, the journey
The order word sound detection method provided such as all inventive embodiments of the application is provided when sequence is executed by processor:
That is, realization when the program is executed by processor: determining pretreated order using deep neural network model
The target starting point and target endpoint of word sound determine effective order word tablet according to the target starting point and the target endpoint
Section;The phoneme classification results in the effective order word sound bite are determined using the deep neural network model, according to institute
It states phoneme classification results and determines that order word exports result.
It can be using any combination of one or more computer-readable media.Computer-readable medium can be calculating
Machine readable signal medium or computer readable storage medium.Computer readable storage medium for example can be --- but it is unlimited
In system, device or the device of --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, or any above combination.It calculates
The more specific example (non exhaustive list) of machine readable storage medium storing program for executing includes: electrical connection with one or more conducting wires, just
Taking formula computer disk, hard disk, random access memory (RAM), read-only memory (ROM), erasable type may be programmed read-only storage
Device (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device,
Or above-mentioned any appropriate combination.In this document, computer readable storage medium can be it is any include or storage journey
The tangible medium of sequence, the program can be commanded execution system, device or device use or in connection.
Computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal,
Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including --- but
It is not limited to --- electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be
Any computer-readable medium other than computer readable storage medium, which can send, propagate or
Transmission is for by the use of instruction execution system, device or device or program in connection.
The program code for including on computer-readable medium can transmit with any suitable medium, including --- but it is unlimited
In --- wireless, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.
The computer for executing operation of the present invention can be write with one or more programming languages or combinations thereof
Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++,
It further include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with
It fully executes, partly execute on the user computer on the user computer, being executed as an independent software package, portion
Divide and partially executes or executed on a remote computer or server completely on the remote computer on the user computer.?
Be related in the situation of remote computer, remote computer can pass through the network of any kind --- including local area network (LAN) or
Wide area network (WAN)-be connected to subscriber computer, or, it may be connected to outer computer (such as mentioned using Internet service
It is connected for quotient by internet).
Note that the above is only a better embodiment of the present invention and the applied technical principle.It will be appreciated by those skilled in the art that
The invention is not limited to the specific embodiments described herein, be able to carry out for a person skilled in the art it is various it is apparent variation,
It readjusts and substitutes without departing from protection scope of the present invention.Therefore, although being carried out by above embodiments to the present invention
It is described in further detail, but the present invention is not limited to the above embodiments only, without departing from the inventive concept, also
It may include more other equivalent embodiments, and the scope of the invention is determined by the scope of the appended claims.
Claims (10)
1. a kind of order word sound detection method characterized by comprising
The target starting point and target endpoint that pretreated order word sound is determined using deep neural network model, according to described
Target starting point and the target endpoint determine effective order word sound bite;
The phoneme classification results in the effective order word sound bite are determined using the deep neural network model, according to institute
It states phoneme classification results and determines that order word exports result.
2. being pre-processed the method according to claim 1, wherein being determined in the application deep neural network model
Before the target starting point and target endpoint of order word sound afterwards, further includes:
Order word sound is pre-processed, wherein the pretreatment includes determining life using the first sound end detecting method
The first starting point for enabling word sound determines pretreated order word sound according to first starting point.
3. the method according to claim 1, wherein being determined using the deep neural network model described effective
Phoneme classification results in order word sound bite, comprising:
It is true using the deep neural network model for each frame ordering word sound in the effective order word sound bite
Accordatura element classification results.
4. according to the method described in claim 3, it is characterized in that, the phoneme classification results include:
The order word sound at current time belongs to the probability of setting phoneme.
5. method according to claim 1-4, which is characterized in that determine the mesh of pretreated order word sound
Mark starting point and target endpoint, comprising:
Speech probability in continuous first preset time is greater than the non-voice probability in continuous first preset time, it is determined that
The end time of continuous first preset time is target starting point, and/or, determine the time of continuous first preset time
The corresponding order word voice node of terminal is target starting point;
Speech probability in continuous second preset time is less than the non-voice probability in continuous second preset time, it is determined that
The start time of continuous second preset time is target endpoint, and/or, determine the time of continuous second preset time
The corresponding order word voice node of terminal is target endpoint.
6. a kind of order word voice endpoint detection device characterized by comprising
Determining module, for determining the target starting point and target of pretreated order word sound using deep neural network model
Terminal determines effective order word sound bite according to the target starting point and the target endpoint;
Output module, for determining the phoneme in the effective order word sound bite point using the deep neural network model
Class is as a result, determine that order word exports result according to the phoneme classification results.
7. device according to claim 6, which is characterized in that further include:
Preprocessing module, for determining that the target of pretreated order word sound rises using deep neural network model described
Before point and target endpoint, order word sound is pre-processed, wherein the pretreatment includes examining using the first sound end
Survey method determines the first starting point of order word sound, determines pretreated order word sound according to first starting point.
8. device according to claim 6, which is characterized in that the output module is specifically used for:
It is true using the deep neural network model for each frame ordering word sound in the effective order word sound bite
Accordatura element classification results.
9. a kind of computer equipment including memory, processor and stores the meter that can be run on a memory and on a processor
Calculation machine program, which is characterized in that the processor realizes such as side as claimed in any one of claims 1 to 5 when executing described program
Method.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor
Such as method as claimed in any one of claims 1 to 5 is realized when execution.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810764304.3A CN108932943A (en) | 2018-07-12 | 2018-07-12 | Order word sound detection method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810764304.3A CN108932943A (en) | 2018-07-12 | 2018-07-12 | Order word sound detection method, device, equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108932943A true CN108932943A (en) | 2018-12-04 |
Family
ID=64447564
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810764304.3A Pending CN108932943A (en) | 2018-07-12 | 2018-07-12 | Order word sound detection method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108932943A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110992929A (en) * | 2019-11-26 | 2020-04-10 | 苏宁云计算有限公司 | Voice keyword detection method, device and system based on neural network |
CN111599350A (en) * | 2020-04-07 | 2020-08-28 | 云知声智能科技股份有限公司 | Command word customization identification method and system |
CN116884399A (en) * | 2023-09-06 | 2023-10-13 | 深圳市友杰智新科技有限公司 | Method, device, equipment and medium for reducing voice misrecognition |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10254477A (en) * | 1997-03-10 | 1998-09-25 | Atr Onsei Honyaku Tsushin Kenkyusho:Kk | Phonemic boundary detector and speech recognition device |
CN101656070A (en) * | 2008-08-22 | 2010-02-24 | 展讯通信(上海)有限公司 | Voice detection method |
CN104067314A (en) * | 2014-05-23 | 2014-09-24 | 中国科学院自动化研究所 | Human-shaped image segmentation method |
US20150126252A1 (en) * | 2008-04-08 | 2015-05-07 | Lg Electronics Inc. | Mobile terminal and menu control method thereof |
CN105118502A (en) * | 2015-07-14 | 2015-12-02 | 百度在线网络技术(北京)有限公司 | End point detection method and system of voice identification system |
CN106297828A (en) * | 2016-08-12 | 2017-01-04 | 苏州驰声信息科技有限公司 | The detection method of a kind of mistake utterance detection based on degree of depth study and device |
CN106297773A (en) * | 2015-05-29 | 2017-01-04 | 中国科学院声学研究所 | A kind of neutral net acoustic training model method |
CN106611598A (en) * | 2016-12-28 | 2017-05-03 | 上海智臻智能网络科技股份有限公司 | VAD dynamic parameter adjusting method and device |
CN106875936A (en) * | 2017-04-18 | 2017-06-20 | 广州视源电子科技股份有限公司 | Audio recognition method and device |
CN107644638A (en) * | 2017-10-17 | 2018-01-30 | 北京智能管家科技有限公司 | Audio recognition method, device, terminal and computer-readable recording medium |
CN108010515A (en) * | 2017-11-21 | 2018-05-08 | 清华大学 | A kind of speech terminals detection and awakening method and device |
-
2018
- 2018-07-12 CN CN201810764304.3A patent/CN108932943A/en active Pending
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10254477A (en) * | 1997-03-10 | 1998-09-25 | Atr Onsei Honyaku Tsushin Kenkyusho:Kk | Phonemic boundary detector and speech recognition device |
US20150126252A1 (en) * | 2008-04-08 | 2015-05-07 | Lg Electronics Inc. | Mobile terminal and menu control method thereof |
CN101656070A (en) * | 2008-08-22 | 2010-02-24 | 展讯通信(上海)有限公司 | Voice detection method |
CN104067314A (en) * | 2014-05-23 | 2014-09-24 | 中国科学院自动化研究所 | Human-shaped image segmentation method |
CN106297773A (en) * | 2015-05-29 | 2017-01-04 | 中国科学院声学研究所 | A kind of neutral net acoustic training model method |
CN105118502A (en) * | 2015-07-14 | 2015-12-02 | 百度在线网络技术(北京)有限公司 | End point detection method and system of voice identification system |
CN106297828A (en) * | 2016-08-12 | 2017-01-04 | 苏州驰声信息科技有限公司 | The detection method of a kind of mistake utterance detection based on degree of depth study and device |
CN106611598A (en) * | 2016-12-28 | 2017-05-03 | 上海智臻智能网络科技股份有限公司 | VAD dynamic parameter adjusting method and device |
CN106875936A (en) * | 2017-04-18 | 2017-06-20 | 广州视源电子科技股份有限公司 | Audio recognition method and device |
CN107644638A (en) * | 2017-10-17 | 2018-01-30 | 北京智能管家科技有限公司 | Audio recognition method, device, terminal and computer-readable recording medium |
CN108010515A (en) * | 2017-11-21 | 2018-05-08 | 清华大学 | A kind of speech terminals detection and awakening method and device |
Non-Patent Citations (1)
Title |
---|
GUOGUO CHEN,CAROLINA PARADA: "Small-footprint keyword spotting using deep neural networks", 《INTERNATIONAL CONFERENCE ON ACOUSTIC, SPEECH AND SIGNAL PROCESSING (ICASSP)》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110992929A (en) * | 2019-11-26 | 2020-04-10 | 苏宁云计算有限公司 | Voice keyword detection method, device and system based on neural network |
CN111599350A (en) * | 2020-04-07 | 2020-08-28 | 云知声智能科技股份有限公司 | Command word customization identification method and system |
CN111599350B (en) * | 2020-04-07 | 2023-02-28 | 云知声智能科技股份有限公司 | Command word customization identification method and system |
CN116884399A (en) * | 2023-09-06 | 2023-10-13 | 深圳市友杰智新科技有限公司 | Method, device, equipment and medium for reducing voice misrecognition |
CN116884399B (en) * | 2023-09-06 | 2023-12-08 | 深圳市友杰智新科技有限公司 | Method, device, equipment and medium for reducing voice misrecognition |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021093449A1 (en) | Wakeup word detection method and apparatus employing artificial intelligence, device, and medium | |
CN110444193B (en) | Method and device for recognizing voice keywords | |
US11281945B1 (en) | Multimodal dimensional emotion recognition method | |
US10878807B2 (en) | System and method for implementing a vocal user interface by combining a speech to text system and a speech to intent system | |
CN109036405A (en) | Voice interactive method, device, equipment and storage medium | |
WO2020253509A1 (en) | Situation- and emotion-oriented chinese speech synthesis method, device, and storage medium | |
CN107622770A (en) | voice awakening method and device | |
CN111081280B (en) | Text-independent speech emotion recognition method and device and emotion recognition algorithm model generation method | |
CN109885713A (en) | Facial expression image recommended method and device based on voice mood identification | |
CN107134279A (en) | A kind of voice awakening method, device, terminal and storage medium | |
US11790921B2 (en) | Speaker separation based on real-time latent speaker state characterization | |
CN111312245A (en) | Voice response method, device and storage medium | |
CN110097870A (en) | Method of speech processing, device, equipment and storage medium | |
CN108932943A (en) | Order word sound detection method, device, equipment and storage medium | |
US20190066669A1 (en) | Graphical data selection and presentation of digital content | |
CN110995943B (en) | Multi-user streaming voice recognition method, system, device and medium | |
US10950221B2 (en) | Keyword confirmation method and apparatus | |
CN114127849A (en) | Speech emotion recognition method and device | |
CN113160854A (en) | Voice interaction system, related method, device and equipment | |
CN113611316A (en) | Man-machine interaction method, device, equipment and storage medium | |
CN112863496B (en) | Voice endpoint detection method and device | |
CN115512698B (en) | Speech semantic analysis method | |
CN115098765A (en) | Information pushing method, device and equipment based on deep learning and storage medium | |
CN113920996A (en) | Voice interaction processing method and device, electronic equipment and storage medium | |
CN113506565A (en) | Speech recognition method, speech recognition device, computer-readable storage medium and processor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181204 |