CN115457938A

CN115457938A - Method, device, storage medium and electronic device for identifying awakening words

Info

Publication number: CN115457938A
Application number: CN202211145889.3A
Authority: CN
Inventors: 王宝俊; 吴人杰; 方瑞东; 林聚财; 殷俊
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2022-09-20
Filing date: 2022-09-20
Publication date: 2022-12-09

Abstract

The embodiment of the invention provides a method, a device, a storage medium and an electronic device for identifying a wakeup word, wherein the method comprises the following steps: carrying out feature extraction on a target voice signal to obtain a multi-frame acoustic feature vector; processing the multi-frame acoustic feature vectors through a deep neural network to obtain a target processing result; decoding the multi-frame acoustic feature vectors through a decoding graph to obtain a target decoding result; and determining the recognition result of the awakening words in the target voice signal according to the target processing result and the target decoding result. The method and the device solve the problem that the identification accuracy of the awakening words is low due to crosstalk between the awakening words in the related technology.

Description

Method, device, storage medium and electronic device for identifying awakening words

Technical Field

The embodiment of the invention relates to the field of voice awakening, in particular to a method, a device, a storage medium and an electronic device for recognizing awakening words.

Background

In recent years, with the rapid development of information technology, voice recognition-related technologies have greatly facilitated and enriched people's lives. Perfect voice awakening function is configured in intelligent household equipment, video conference equipment, household electrical appliance equipment and the like. The user may speak a wake-up word, wake-up the device, and then begin human-machine voice interaction with the device. Therefore, voice wakeup is an important link of voice interaction.

In the current awakening word recognition, a plurality of awakening words are often defined, different awakening words are trained simultaneously, and a classification task is performed in the same model, so that crosstalk exists among the awakening words, the accuracy of awakening word recognition is low, and the false awakening rate of equipment is increased. Therefore, the problem that the accuracy of the identification of the awakening words is low due to crosstalk between the awakening words exists in the prior art.

Aiming at the problem that the accuracy of awakening word recognition is low due to crosstalk between awakening words in the related technology, an effective solution is not provided at present.

Disclosure of Invention

Embodiments of the present invention provide a method, an apparatus, a storage medium, and an electronic apparatus for identifying a wake-up word, so as to at least solve the problem in the related art that the accuracy of identifying the wake-up word is low due to crosstalk between wake-up words.

According to an embodiment of the present invention, there is provided a method of recognizing a wake-up word, including: carrying out feature extraction on a target voice signal to obtain a multi-frame acoustic feature vector; processing the multi-frame acoustic feature vector through a deep neural network to obtain a target processing result; decoding the multi-frame acoustic feature vector through a decoding graph to obtain a target decoding result; and determining the recognition result of the awakening words in the target voice signal according to the target processing result and the target decoding result.

In an exemplary embodiment, processing the multiple frames of acoustic feature vectors through a deep neural network to obtain a target processing result includes: and inputting the multi-frame acoustic feature vectors into a deep neural network, classifying each frame of acoustic feature vector in the multi-frame acoustic feature vectors through the deep neural network, and obtaining phoneme posterior feature vectors corresponding to each frame of acoustic feature vector, wherein the target processing result comprises the phoneme posterior feature vectors corresponding to each frame of acoustic feature vector.

In an exemplary embodiment, the decoding, by using a decoding map, the multiple frames of acoustic feature vectors to obtain a target decoding result includes: decoding the multi-frame acoustic feature vectors through a plurality of paths in a decoding graph to obtain a target path; and determining the target path as the target decoding result.

In an exemplary embodiment, decoding the plurality of frames of acoustic feature vectors by decoding a plurality of paths in the graph to obtain a target path includes: determining the target path among the plurality of paths of the decoding graph through a token passing algorithm.

In an exemplary embodiment, the recognizing the wakeup word to be recognized in the target speech signal according to the target processing result and the target decoding result includes: under the condition that a target path contains a to-be-recognized awakening word, determining a target phoneme posterior feature vector corresponding to the to-be-recognized awakening word in a phoneme posterior feature vector corresponding to each frame of acoustic feature vector, wherein the target processing result comprises the phoneme posterior feature vector corresponding to each frame of acoustic feature vector, and the target decoding result comprises the target path; and identifying the awakening words to be identified through the target phoneme posterior feature vector.

In an exemplary embodiment, the recognizing the wakeup word to be recognized through the target phoneme posterior feature vector includes: determining a target distance between the target phoneme posterior feature vector and a preset standard template; and identifying the awakening words to be identified according to the relation between the target distance and a preset standard distance.

In an exemplary embodiment, the recognizing the wakeup word to be recognized according to a relationship between the target distance and a preset standard distance includes: and under the condition that the difference value between the target distance and a preset target standard distance is smaller than or equal to a preset threshold value, determining the awakening word to be recognized as a recognition result of the awakening word in the target voice signal.

According to another embodiment of the present invention, there is also provided an apparatus for recognizing a wake-up word, including: the extraction module is used for extracting the characteristics of the target voice signal to obtain a multi-frame acoustic characteristic vector; the processing module is used for processing the multi-frame acoustic feature vectors through a deep neural network to obtain a target processing result; the decoding module is used for decoding the multi-frame acoustic feature vectors through a decoding graph to obtain a target decoding result; and the determining module is used for determining the recognition result of the awakening words in the target voice signal according to the target processing result and the target decoding result.

According to a further embodiment of the present invention, there is also provided a computer-readable storage medium having a computer program stored thereon, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.

According to the invention, the target decoding result is obtained by decoding the acoustic feature vector, the target decoding result is verified through the multi-frame posterior feature vector corresponding to the acoustic feature vector, the target decoding result is not directly used as the recognition result, but the recognition result of the awakening word in the target voice signal is determined after the target decoding result is verified according to the target processing result, the problem that the awakening word in the target voice signal is recognized as other awakening words due to crosstalk among the awakening words is avoided, and the accuracy of awakening word recognition is improved.

Drawings

Fig. 1 is a block diagram of a hardware structure of a mobile terminal according to a method of recognizing a wakeup word in an embodiment of the present invention;

FIG. 2 is a flow diagram of a method of identifying a wake word according to an embodiment of the invention;

FIG. 3 is a flow chart of a method of identifying a wake up word according to a specific embodiment of the present invention;

fig. 4 is a block diagram of an apparatus for recognizing a wakeup word according to an embodiment of the present invention.

Detailed Description

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings in conjunction with the embodiments.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

The method embodiments provided in the embodiments of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Taking the example of operating on a mobile terminal, fig. 1 is a block diagram of a hardware structure of the mobile terminal according to the method for recognizing a wakeup word in the embodiment of the present invention. As shown in fig. 1, the mobile terminal may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), and a memory 104 for storing data, wherein the mobile terminal may further include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration, and does not limit the structure of the mobile terminal. For example, the mobile terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store a computer program, for example, a software program of an application software and a module, such as a computer program corresponding to the method for recognizing the wakeup word in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the mobile terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

In this embodiment, a method for recognizing a wake-up word is provided, and fig. 2 is a flowchart of the method for recognizing a wake-up word according to the embodiment of the present invention, as shown in fig. 2, the flowchart includes the following steps:

step S202, extracting the characteristics of a target voice signal to obtain a multi-frame acoustic characteristic vector;

step S204, processing the multi-frame acoustic feature vectors through a deep neural network to obtain a target processing result;

step S206, decoding the multi-frame acoustic feature vector through a decoding graph to obtain a target decoding result;

and S208, determining the recognition result of the awakening words in the target voice signal according to the target processing result and the target decoding result.

In the technical solution provided in step S202, the target speech signal is a speech signal acquired by the speech acquisition device, and before the identification of the awakening word in the target speech signal, effective acoustic features need to be extracted according to the waveform of the speech signal, and the feature extraction is very critical to the accuracy of the subsequent awakening word identification system.

The model for extracting acoustic feature vectors may include at least one of: MFCC (Mel-Frequency Cepstral Coefficients, mel Cepstral Coefficients), FBANK (Filter Bank, filter Bank characteristics), PLP (Perceptual Linear prediction), PCEN (Per-Channel Energy Normalization), and the like.

In the technical solution provided in step S204, the Deep Neural Network may be a Deep Neural Network in a Deep Neural Network-Hidden Markov Model architecture (DNN-HMM), and the multiple frames of acoustic feature vectors are input into the Deep Neural Network to obtain a posterior feature vector of each frame of acoustic feature vector in the multiple frames of acoustic feature vectors on a predefined k-class phoneme, that is, a target processing result.

In the technical solution provided in step S206, the decoding map is an HCLG decoding map, and the HCLG decoding map is a large resource map constructed by a language model, a dictionary, context phonemes, and an HMM. And decoding the multi-frame acoustic feature vector in the HCLG to obtain a target decoding result.

The HCLG decoding graph includes a plurality of paths, and one or more optimal paths are selected from the plurality of paths as target decoding results through a decoding process, and each path includes a result of a corresponding word level, for example, the optimal path decoded on the HCLG is path 1, and the result of the corresponding word level on path 1 is "please turn on the device".

In the technical solution provided in step S208, it is determined whether a wakeup word is resolved in the target decoding result, and if the wakeup word is not resolved, it is determined that the target speech signal does not include the wakeup word, and the device is not woken up; and if the awakening words are analyzed, determining whether the awakening words decoded in the decoding graph are the awakening words contained in the target voice signal according to the target processing result and the target decoding result. For example. If the decoding result in the HCLG decoding graph is 'please turn on the device', the decoding result includes a wakeup word 'on', and further, the target processing result and the target decoding result are combined to confirm that the target voice signal includes the wakeup word 'on', and the 'on' is taken as the wakeup word recognized in the target voice signal after confirmation.

Through the steps, the target decoding result is obtained by decoding the acoustic feature vector, the target decoding result is verified through the multi-frame posterior feature vector corresponding to the acoustic feature vector, the target decoding result is not directly used as the recognition result, but the recognition result of the awakening word in the target voice signal is determined after the target decoding result is verified according to the target processing result, the phoneme posterior feature is obtained by processing the acoustic feature vector, the decoding result obtained after decoding the multi-frame acoustic feature vector is matched with the phoneme posterior feature, the accuracy of the decoding result obtained in the decoding process is judged, so that the error of the decoding result caused by the crosstalk among the awakening words is reduced, the recognition result of the awakening word in the target voice signal is determined, the problem that the awakening word in the target voice signal is recognized as other awakening words due to the crosstalk among the awakening words is avoided, and the accuracy of awakening word recognition is improved.

In an optional embodiment, the processing the multiple frames of acoustic feature vectors by using a deep neural network to obtain a target processing result includes: and inputting the multi-frame acoustic feature vectors into a deep neural network, classifying each frame of acoustic feature vector in the multi-frame acoustic feature vectors through the deep neural network, and obtaining phoneme posterior feature vectors corresponding to each frame of acoustic feature vector, wherein the target processing result comprises the phoneme posterior feature vectors corresponding to each frame of acoustic feature vector.

In this embodiment, K phonemes are predefined, phoneme modeling is performed on an awakening word and a non-awakening word, and a fine modeling manner is adopted for the awakening word, that is, one awakening word includes a plurality of phonemes; and for the non-awakening words, extensive modeling is adopted, namely the non-awakening words are in a word modeling mode, and one non-awakening word corresponds to one phoneme.

The input of the deep neural network is a frame of acoustic feature vector, the output is a phoneme posterior feature vector corresponding to the frame, for example, a multi-frame acoustic feature vector obtained by performing feature extraction on a target speech signal is as follows: u. of _o ＝{o ₁ ,o ₂ ,...,o _n N is the number of frames of the multiframe acoustic feature vectors, each frame of acoustic feature vector is sequentially input into the deep neural network to obtain a phoneme posterior feature vector of a corresponding frame, and the multiframe acoustic feature vectors correspond to the multiframe phoneme posterior feature vectors:

first frame acoustic feature vector o ₁ The corresponding phoneme posterior feature vector is

Second frame acoustic feature vector o ₂ The corresponding phoneme posterior feature vector is

So on the same way that the acoustic feature vector o of the nth frame _n The corresponding phoneme posterior feature vector is

One phoneme posterior feature vector PG represents the corresponding speech feature vector o in the predefined K phonemes C ₁ ,C ₂ ,...,C _k The posterior probability distribution over } is expressed as:

PG _o ＝{p(C ₁ |o),p(C ₂ |o),...p(C _k |o)}

wherein p (C) ₁ I o) is the posterior probability of the feature vector o on class 1 phoneme, p (C) ₂ I o) is the posterior probability of the feature vector o on class 2 phoneme, and so on, p (C) _k I o) is the posterior probability of the feature vector o on the kth class of phoneme.

The target processing result comprises multi-frame phoneme posterior feature vectors, and each frame of phoneme posterior feature vector in the multi-frame phoneme posterior feature vectors corresponds to each frame of acoustic feature vector in the multi-frame acoustic feature vectors one to one.

In an optional embodiment, the decoding, by using a decoding map, the multiple frames of acoustic feature vectors to obtain a target decoding result includes: decoding the multi-frame acoustic characteristic vectors through a plurality of paths in a decoding graph to obtain a target path; determining the target path as the target decoding result.

In the present embodiment, the HCLG decoding graph includes a plurality of paths, and the sum of the outputs of all nodes on each path constitutes an output sentence or word. After the HCLG decoding graph is built, an optimal path is found in the HCLG decoding graph according to the multi-frame acoustic feature vectors, the cost of an output tag sequence on the optimal path on the voice to be recognized is required to be as low as possible, the output tag sequence taken out of the optimal path is a word level recognition result, and the process is decoding. The optimal paths can also be found, and the recognition result of the optimal paths is called an N-best list.

In an optional embodiment, decoding the multiple frames of acoustic feature vectors by decoding multiple paths in the graph to obtain a target path includes: determining the target path among the plurality of paths of the decoding graph through a token passing algorithm.

In this embodiment, the token passing algorithm is to place a token at the start node, where one start node corresponds to one token, and if there are multiple start nodes, each start node places one token. Decoding multiple frames of acoustic feature vectors according to frames, after a first frame (a first acoustic feature vector) is decoded, transmitting a token on an initial node to a next node according to decoded information, calculating transmission cost, then decoding a second frame of acoustic feature vector, transmitting the token from a current node to the next node according to the decoded information, accumulating the transmission cost, sequentially decoding all the acoustic feature vectors, after the last frame is decoded, checking a state node where the token is located, tracing back a path passed by the current token, and calculating and accumulating the transmission cost in each transmission process. If one state node has a plurality of jumps, the token is copied into a plurality of copies and is respectively transmitted. And checking the transfer cost of all tokens in the current decoding graph in the last frame, and selecting the optimal path or paths according to the transfer cost, wherein the lower the transfer cost is, the better the corresponding path is.

An optimal path is globally optimal, whereas globally optimal is necessarily locally optimal, i.e. if a path is globally optimal, then the path is necessarily the locally optimal path it traverses through any state. Therefore, when multiple tokens are passed to the same state node, only the optimal token (the token with the smallest cumulative cost) is reserved.

In an optional embodiment, the recognizing, according to the target processing result and the target decoding result, a wakeup word to be recognized in the target speech signal includes: under the condition that a target path contains a to-be-recognized awakening word, determining a target phoneme posterior feature vector corresponding to the to-be-recognized awakening word in the phoneme posterior feature vectors corresponding to the to-be-recognized awakening word, wherein the target processing result comprises the phoneme posterior feature vectors corresponding to the to-be-recognized awakening words, and the target decoding result comprises the target path; and identifying the awakening words to be identified through the target phoneme posterior feature vector.

In this embodiment, it is determined whether the output sentence or word corresponding to the target path includes the wakeup word, and when the output sentence or word includes the wakeup word (i.e., the wakeup word to be recognized), the one or more frames of phoneme posterior feature vectors corresponding to the wakeup word are found from the phoneme posterior feature vectors corresponding to the acoustic feature vectors of the frames to be determined as the target phoneme posterior feature vectors, and whether the wakeup word is used as the result of this recognition is determined according to the target phoneme posterior feature vectors.

The target phoneme posterior feature vector comprises one frame or multiple frames of phoneme posterior feature vectors.

For example, the sentence output in the target path is "please turn on the device", and the "please" and "device" in the sentence belong to the non-wake word, "turn on" belongs to the wake word,in this case, the target path includes the wakeup word to be recognized, the phoneme posterior feature vector corresponding to "on" is found from the multi-frame phoneme posterior feature vector in the target processing result, and is determined as the target phoneme posterior feature vector, and assuming that the acoustic feature vectors are classified, 10 frames of phoneme posterior feature vectors are obtained, and are expressed as:

suppose that where "please" corresponds to the first phoneme posterior feature vector

"on" corresponds to the 2 nd to 8 th frame phoneme posterior feature vector

The device corresponds to the 9 th and 10 th frame phoneme posterior feature vector PG _o9 、PG _o10 Then { PG _o2 ,...,PG _o8 And determining the target phoneme posterior feature vector.

In an optional embodiment, the recognizing the wakeup word to be recognized through the target phoneme posterior feature vector includes: determining a target distance between the target phoneme posterior feature vector and a preset standard template; and identifying the awakening words to be identified according to the relation between the target distance and a preset standard distance.

In this embodiment, a corresponding phoneme posterior feature vector sequence (i.e., a standard template) is preset for each wakeup word, a target distance between the standard template corresponding to the wakeup word to be recognized and the target phoneme posterior feature vector sequence is calculated, and whether the wakeup word to be recognized is to be used as the recognition result of the wakeup word at this time is determined according to a relationship between the target distance and the preset standard distance.

It should be noted that, the dynamic warping algorithm is used to calculate the target distance between the standard template and the target phoneme posterior feature vector sequence.

The phoneme posterior feature vector sequence corresponding to the standard template is u _x ＝{x ₁ ,x ₂ ,...,x _n }， u _y ＝{y ₁ ,y ₂ ,...,y _m N and m respectively represent the frame number of the two sequences of the phoneme posterior feature vectors, a distance matrix D is established, the element D (i, j) in the distance matrix represents the distance between the ith frame vector and the jth frame vector of the target phoneme posterior feature vector sequence in the standard template, and the negative logarithm inner product metric represents the distance between the ith frame vector and the jth frame vector of the target phoneme posterior feature vector sequence in the standard template, namely the distance between the ith frame vector and the jth frame vector is D (i, j) = -lg (x, j) = _i ·y _j )。

Denotes u by phi _x And u _y One possible correspondence between:

φ(k)＝(i _k ,j _k ),k＝1,2,...,T。

wherein T represents time, k is an independent variable of T, and u is the time of k _x Ith frame vector in sequence and u _y The j frame sequence in the sequences corresponds.

Finding out an optimal corresponding sequence phi 'in the matrix D, wherein the minimum accumulated distortion value corresponding to the optimal sequence phi' is as follows:

and determining the minimum accumulated distortion value as a target distance between the target phoneme posterior feature vector sequence and a preset standard template.

In an optional embodiment, the recognizing the wakeup word to be recognized according to a relationship between the target distance and a preset standard distance includes: and under the condition that the difference value between the target distance and a preset standard distance is smaller than or equal to a preset threshold value, determining the awakening word to be recognized as a recognition result of the awakening word in the target voice signal.

In this embodiment, the preset standard distance is calculated by using a standard template of a positive sample data wake-up word of a wake-up word test set, the wake-up word test set has a plurality of test samples, and there may be differences in intonation and speech rate between the test samples, so that a plurality of matching distances can be obtained by calculating a distance between a posterior feature vector sequence corresponding to each test sample and the standard template, and the standard distance is obtained according to the plurality of matching distances. The standard distance may be obtained by calculating an average value of a plurality of matching distances, or may be determined by other determination methods, which are not limited herein.

Taking the example of the wakeup word as "on", there are 10 "on" test samples in the wakeup word test set, and among the 10 test samples, "on" may be a voice spoken by different accents without speaking speed. And obtaining a posterior feature vector sequence of each test sample, wherein the standard template is the posterior feature vector sequence obtained by the speech spoken by the Mandarin and the standard speech speed. And performing distance matching on the posterior feature vector sequence of each test sample and the standard template to obtain a standard distance.

And under the condition that the difference value between the calculated target distance and the standard distance is smaller than or equal to a preset threshold value, determining the awakening words to be recognized in the target decoding result as a final recognition result, namely the recognition result of the awakening words in the target voice signal. For example, the awakening word to be recognized in the target decoding result "please turn on the device" is "on", and when the difference between the calculated target distance and the standard distance is less than or equal to the preset threshold value, the "on" is determined as the recognition result, that is, it is determined that the target voice signal contains the awakening word "on", and then the device executes the corresponding operation according to the awakening word, that is, the device executes the on operation.

It is to be understood that the above-described embodiments are only a few, but not all, embodiments of the present invention.

The present invention will be described in detail with reference to the following examples:

fig. 3 is a flowchart of a method of recognizing a wake-up word according to an embodiment of the present invention, as shown in fig. 3,

s301: acquiring a voice signal, and extracting multi-frame acoustic feature vectors;

acquiring a voice signal through voice signal acquisition equipment on the equipment, and extracting acoustic features from the voice signal;

s302: classifying the multiframe acoustic feature vectors by using a deep neural network, and obtaining multiframe phoneme posterior feature vectors;

and classifying the multi-frame acoustic characteristic frames by adopting the trained model. The trained Model is a Deep Neural Network-Hidden Markov Model (DNN-HMM) framework, and the modeling units are phonemes.

The input of the deep neural network is the acoustic feature of a frame, and the output is a phoneme posterior feature vector. Inputting an acoustic feature vector o, and outputting the feature vector in k predefined classes { C } ₁ ,C ₂ ,...,C _k The posterior probability distribution over } is:

PG _o ＝(p(C ₁ |o)p(C ₂ |o)...p(C _k |o))

wherein p (C) _i I o) is the posterior probability of a feature vector o over the i-th class, where a class can be defined as any kind of phonetic unit, such as a phoneme. This patent uses a posteriori features at the phoneme level.

And utilizing the trained DNN-HMM model, wherein the DNN can obtain the phoneme posterior characteristics in the unit of frame.

S303: constructing an HCLG decoding graph, and performing time sequence decoding to obtain an optimal path;

in the DNN-HMM model, decoding of an acoustic feature sequence is realized based on an HCLG decoding graph, and a list of recognition results (N-best list) is obtained by finding an optimal plurality of paths in the decoding graph. The construction of the decoding graph depends on a large resource graph formed by a language model, a dictionary, context phonemes and an HMM. The token passing algorithm is carried out according to frames, when the token passing is finished when the last frame is executed, the token in the termination state is checked, the optimal one or more tokens are taken, and the paths corresponding to the tokens can be taken out or traced back according to the information on the optimal one or more tokens, so that the identification result is obtained. The path accumulates the likelihood values of the acoustic model and the language model, and the path with the highest likelihood value is assumed to be decoded to wake upCumulative likelihood value of several frames of a word is P _h 。

S304: performing phoneme posterior feature matching according to whether the awakening word is decoded in the optimal path or not, and calculating a negative logarithm inner product;

and determining whether to match the posterior probability of the phoneme according to the time sequence decoding result of the HMM. And once the awakening words are decoded in the third step, carrying out phoneme posterior probability matching according to the frames corresponding to the awakening words on the path.

The phoneme posterior probability is the phoneme classification probability output by DNN, the time sequence matching uses a dynamic time warping algorithm to calculate the distance between the phoneme posterior probability sequence corresponding to the awakening word and a standard template, and the distance between the two sequences is calculated by adopting an inner product metric.

The sequence matching of the phoneme feature vectors adopts the following dynamic time warping algorithm. Given a sequence of feature vectors of two consecutive speech segments, u _x ＝{x ₁ ,x ₂ ,...,x _n }，u _y ＝{y ₁ ,y ₂ ,...,y _m N and m respectively represent the frame number of two sequence eigenvectors, a distance matrix D is established by defining the distance between the eigenvectors of the speech frame, and phi represents u _x And u _y A possible correspondence between phi (k) = (i) _k ,j _k ) K =1,2.., T, where T represents time, k is an argument of T, and at time k u _x Ith frame vector in sequence and u _y The j frame sequence in the sequences corresponds.

Finding out an optimal corresponding sequence phi' in the matrix D, thereby minimizing the accumulated distortion value Dist _φ (u _x ,u _y )，

In this embodiment, the speech frame is represented by a phoneme posterior feature, using a negative log inner product metric, then:

and finally, recording the distance between the sequence to be matched and the template sequence as P _d Taking the value as the minimum accumulated distortion value Dist _φ (u _x ,u _y )。

S305: performing distance matching by using positive sample data of the awakening word test set and a standard template of the awakening word to obtain a distance threshold;

DTW distance matching is carried out on positive sample data of the awakening word test set and a standard template of the awakening word, and a distance threshold (namely standard distance) can be calculated

The calculation of the DTW distance can be done using known techniques.

S306: comparing the difference between the distance between the sequence to be matched and the template sequence and the distance threshold value to obtain an awakening result;

comparing the distance between the current sequence to be matched and the standard template sequence, and calculating

If the difference value is smaller than a certain threshold value set by people, the current awakening word is judged to be the identified awakening word, and an awakening result is output.

Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method according to the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention or portions thereof contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (which may be a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

In this embodiment, a device for recognizing a wake-up word is further provided, and fig. 4 is a block diagram of a structure of the device for recognizing a wake-up word according to an embodiment of the present invention, as shown in fig. 4, the device includes:

an extraction module 402, configured to perform feature extraction on a target speech signal to obtain a multi-frame acoustic feature vector;

the processing module 404 is configured to process the multiple frames of acoustic feature vectors through a deep neural network to obtain a target processing result;

the decoding module 406 is configured to decode the multiple frames of acoustic feature vectors through a decoding graph to obtain a target decoding result;

the determining module 408 is configured to determine a recognition result of the wakeup word in the target speech signal according to the target processing result and the target decoding result.

In an optional embodiment, the apparatus is further configured to input the multiple frames of acoustic feature vectors into a deep neural network, classify each frame of acoustic feature vector in the multiple frames of acoustic feature vectors through the deep neural network, and obtain a phoneme posterior feature vector corresponding to each frame of acoustic feature vector, where the target processing result includes the phoneme posterior feature vector corresponding to each frame of acoustic feature vector.

In an optional embodiment, the apparatus is further configured to decode the multiple frames of acoustic feature vectors through multiple paths in a decoding graph to obtain a target path; determining the target path as the target decoding result.

In an optional embodiment, the apparatus is further configured to determine the target path in the plurality of paths of the decoding graph through a token passing algorithm.

In an optional embodiment, the apparatus is further configured to, in a case that a target path includes a wakeup word to be recognized, determine a target phoneme posterior feature vector corresponding to the wakeup word to be recognized in a phoneme posterior feature vector corresponding to each frame of acoustic feature vector, where the target processing result includes the phoneme posterior feature vector corresponding to each frame of acoustic feature vector, and the target decoding result includes the target path; and identifying the awakening words to be identified through the target phoneme posterior feature vector.

In an optional embodiment, the apparatus is further configured to determine a target distance between the target phoneme posterior feature vector and a preset standard template; and identifying the awakening words to be identified according to the relation between the target distance and a preset standard distance.

In an optional embodiment, the apparatus is further configured to determine the wakeup word to be recognized as a recognition result of the wakeup word in the target speech signal when a difference between the target distance and a preset target standard distance is smaller than or equal to a preset threshold.

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

Embodiments of the present invention also provide a computer-readable storage medium having a computer program stored thereon, wherein the computer program is arranged to perform the steps of any of the above-mentioned method embodiments when executed.

In an exemplary embodiment, the computer-readable storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

In an exemplary embodiment, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

For specific examples in this embodiment, reference may be made to the examples described in the above embodiments and exemplary embodiments, and details of this embodiment are not repeated herein.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented in a general purpose computing device, they may be centralized in a single computing device or distributed across a network of multiple computing devices, and they may be implemented in program code that is executable by a computing device, such that they may be stored in a memory device and executed by a computing device, and in some cases, the steps shown or described may be executed in an order different from that shown or described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps therein may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of recognizing a wake word, comprising:

carrying out feature extraction on a target voice signal to obtain a multi-frame acoustic feature vector;

processing the multi-frame acoustic feature vector through a deep neural network to obtain a target processing result;

decoding the multi-frame acoustic feature vector through a decoding graph to obtain a target decoding result;

and determining the recognition result of the awakening words in the target voice signal according to the target processing result and the target decoding result.

2. The method of claim 1, wherein processing the plurality of frames of acoustic feature vectors through a deep neural network to obtain a target processing result comprises:

inputting the multi-frame acoustic feature vectors into a deep neural network, classifying each frame of acoustic feature vectors in the multi-frame acoustic feature vectors through the deep neural network to obtain phoneme posterior feature vectors corresponding to each frame of acoustic feature vectors, wherein the target processing result comprises the phoneme posterior feature vectors corresponding to each frame of acoustic feature vectors.

3. The method according to claim 1, wherein the decoding the plurality of frames of acoustic feature vectors by using the decoding map to obtain a target decoding result comprises:

decoding the multi-frame acoustic characteristic vectors through a plurality of paths in a decoding graph to obtain a target path;

determining the target path as the target decoding result.

4. The method of claim 3, wherein decoding the plurality of frames of acoustic feature vectors through a plurality of paths in a decoding graph to obtain a target path comprises:

determining the target path among the plurality of paths of the decoding graph through a token passing algorithm.

5. The method according to any one of claims 1 to 4, wherein the identifying the wake-up word to be identified in the target speech signal according to the target processing result and the target decoding result comprises:

under the condition that a target path contains a to-be-recognized awakening word, determining a target phoneme posterior feature vector corresponding to the to-be-recognized awakening word in the phoneme posterior feature vectors corresponding to the to-be-recognized awakening word, wherein the target processing result comprises the phoneme posterior feature vectors corresponding to the to-be-recognized awakening words, and the target decoding result comprises the target path;

and identifying the awakening words to be identified through the target phoneme posterior feature vector.

6. The method according to claim 5, wherein the identifying the to-be-identified wake word by the target phoneme posterior feature vector comprises:

determining a target distance between the target phoneme posterior feature vector and a preset standard template;

and identifying the awakening words to be identified according to the relation between the target distance and a preset standard distance.

7. The method according to claim 6, wherein the identifying the wake-up word to be identified according to the relationship between the target distance and a preset standard distance comprises:

and under the condition that the difference value between the target distance and a preset target standard distance is smaller than or equal to a preset threshold value, determining the awakening word to be recognized as a recognition result of the awakening word in the target voice signal.

8. An apparatus for recognizing a wake word, comprising:

the extraction module is used for extracting the characteristics of the target voice signal to obtain a multi-frame acoustic characteristic vector;

the processing module is used for processing the multi-frame acoustic feature vectors through a deep neural network to obtain a target processing result;

the decoding module is used for decoding the multi-frame acoustic characteristic vectors through a decoding graph to obtain a target decoding result;

and the determining module is used for determining the recognition result of the awakening words in the target voice signal according to the target processing result and the target decoding result.

9. A computer-readable storage medium, in which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method as claimed in any one of claims 1 to 7 when executing the computer program.