CN114927128A

CN114927128A - Voice keyword detection method and device, electronic equipment and readable storage medium

Info

Publication number: CN114927128A
Application number: CN202210424846.2A
Authority: CN
Inventors: 王东; 李蓝天; 杜文强
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2022-08-19

Abstract

The invention provides a method and a device for detecting a voice keyword, electronic equipment and a readable storage medium, wherein the method comprises the following steps: acquiring a voice segment to be detected and a target keyword, wherein the voice segment is a sequence comprising a plurality of voice vectors, and the target keyword is a sequence comprising a plurality of syllables; extracting the voice characteristic of each syllable based on each syllable and the voice fragment, and calculating the correlation between each syllable and the voice fragment according to the voice characteristic of each syllable and the basic voice mode of each syllable to obtain a correlation matrix between the target keyword and the voice fragment; searching the best matching path between the target keyword and the voice fragment based on the correlation matrix so as to calculate the matching probability of the target keyword and the voice fragment; and if the matching probability is greater than or equal to a preset threshold value, judging that the target keyword is contained in the voice segment.

Description

Voice keyword detection method and device, electronic equipment and readable storage medium

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a method and an apparatus for detecting a speech keyword, an electronic device, and a readable storage medium.

Background

The method for detecting the specific key words or phrases from the voice has a wide application scene. For example, in the field of smart appliances, keyword detection is used for device voice wake-up and voice commands; in the live broadcast examination, keyword detection is used for pornography, violence, profanity and profanity language early warning; in multimedia data archiving, keyword detection is used for audio and video searching.

The existing keyword detection methods generally have the following types:

firstly: large scale continuous speech recognition. The most direct approach is to convert the audio into text using large-scale continuous speech recognition techniques and then detect keywords based on the text content. The drawbacks of this approach are two: (1) large-scale continuous speech recognition consumes too much computing resources, is not suitable for large-scale online detection, and cannot operate on equipment with low computing power; (2) it is difficult to detect words that have not been seen in the vocabulary.

Secondly, the method comprises the following steps: and (4) partial decoding. The keyword is detected by designing a small decoding graph containing the keyword and filling components. Because the decoding graph is designed according to the target keywords and has small scale, the calculation amount is low, and the decoding graph can be operated on the embedded device. Meanwhile, the design and the generation of the decoding graph are convenient, so that the detection of any keyword can be supported. The problem with this approach is that the path weights for different keywords need to be readjusted while being less resistant to noise and cluttered sounds.

Thirdly, the steps of: end-to-end model approach. The basic scheme of the end-to-end model method is that a voice segment is given, whether a certain specified keyword is contained in the voice segment or not is directly judged based on a neural network, if yes, 1 is output, and if not, 0 is output. The biggest drawback of this approach is that the network is keyword specific and retraining is required to replace a keyword. Moreover, the network for training each keyword needs to prepare a large number of voice segments of the target keyword, and the resource consumption is too large.

Disclosure of Invention

The invention provides a method and a device for detecting a voice keyword, electronic equipment and a readable storage medium, which are used for solving the defects of large calculation amount, repeated training for identifying the target keyword and poor interference resistance in a voice segment in the prior art and realizing the efficient and accurate detection of the target keyword.

The invention provides a method for detecting a voice keyword, which comprises the following steps:

acquiring a voice segment to be detected and a target keyword, wherein the voice segment is a sequence comprising a plurality of voice vectors, and the target keyword is a sequence comprising a plurality of syllables;

extracting the voice characteristic of each syllable based on each syllable and the voice fragment, and calculating the correlation between each syllable and the voice fragment according to the voice characteristic of each syllable and the basic voice mode of each syllable to obtain a correlation matrix between the target keyword and the voice fragment;

searching the best matching path between the target keyword and the voice segment based on the correlation matrix so as to calculate the matching probability of the target keyword and the voice segment;

and if the matching probability is greater than or equal to a preset threshold value, judging that the target keyword is contained in the voice segment.

According to the method for detecting the voice keyword provided by the invention, the extracting the voice feature of each syllable based on each syllable and the voice segment specifically comprises the following steps:

obtaining a masking mode of each syllable;

masking the speech vector of each frame in the speech segment based on a masking pattern for each of the syllables;

extracting the voice characteristics corresponding to each syllable.

According to the method for detecting the voice keyword provided by the invention, the correlation between each syllable and the voice segment is calculated according to the voice feature of each syllable and the basic voice mode of each syllable, so as to obtain the correlation matrix between the target keyword and the voice segment, and the method specifically comprises the following steps:

acquiring a basic voice mode of each syllable;

performing dot product operation between the basic voice mode of the single syllable and the voice characteristics of the single syllable and the voice fragment to obtain the correlation degree between the single syllable and the voice fragment;

and calculating the correlation degree between each syllable and the voice segment to obtain a correlation degree matrix between the target keyword and the voice segment.

According to the method for detecting the voice keyword provided by the invention, the calculating of the matching probability of the target keyword and the voice segment specifically comprises the following steps:

calculating the average matching score of the best matching path according to the best matching path;

and acquiring the matching probability of the target keyword and the voice fragment according to the average matching score.

According to the method for detecting the voice keyword provided by the invention, the calculating of the average matching score of the best matching path according to the best matching path specifically comprises the following steps:

acquiring the frame number corresponding to the optimal matching path;

calculating the accumulated value of the correlation degree of each syllable in the optimal matching path and each frame of voice vector;

and dividing the accumulated value by the frame number to obtain the average matching score of the optimal matching path.

The method for detecting the voice keywords further comprises the following steps:

and if the matching probability is smaller than the preset threshold value, judging that the target keyword is not included in the voice fragment.

The invention also provides a voice keyword detection device, which comprises:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a voice segment to be detected and a target keyword, the voice segment is a sequence comprising a plurality of voice vectors, and the target keyword is a sequence comprising a plurality of syllables;

the first calculation module is used for extracting the voice characteristics of each syllable based on each syllable and the voice fragment, calculating the correlation between each syllable and the voice fragment according to the voice characteristics of each syllable and the basic voice mode of each syllable, and obtaining a correlation matrix between the target keyword and the voice fragment;

the second calculation module is used for searching the best matching path between the target keyword and the voice segment based on the correlation matrix so as to calculate the matching probability of the target keyword and the voice segment;

and the result judging module is used for judging that the voice segment comprises the target keyword if the matching probability is greater than or equal to a preset threshold value.

The invention also provides an electronic device, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein when the processor executes the program, the processor realizes the detection method of the voice keyword.

The present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method for detecting a speech keyword as in any one of the above.

The invention also provides a computer program product comprising a computer program, wherein the computer program is used for realizing the method for detecting the voice keywords as mentioned in any one of the above when being executed by a processor.

The invention provides a method and a device for detecting a voice keyword, electronic equipment and a readable storage medium, wherein the detection method comprises the following steps: acquiring a voice segment to be detected and a target keyword, wherein the voice segment is a sequence comprising a plurality of voice vectors, and the target keyword is a sequence comprising a plurality of syllables;

searching the optimal matching path between the target keyword and the voice fragment based on the correlation matrix so as to calculate the matching probability of the target keyword and the voice fragment;

and if the matching probability is greater than or equal to a preset threshold value, judging that the voice segment comprises the target keyword. According to the steps, the voice fragment to be detected is judged to comprise the target keyword, the scheme provided by the invention learns the mask of each syllable, and corresponding frequency spectrum components are captured from the voice fragment through the mask. Therefore, even if noise interference exists and even pronunciation aliasing exists, the target keyword can still be captured from the target keyword, and the anti-interference capability is strong.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a method for detecting a speech keyword according to the present invention;

FIG. 2 is a second flowchart illustrating a method for detecting a keyword according to the present invention;

FIG. 3 is a third schematic flowchart of a method for detecting a keyword according to the present invention;

FIG. 4 is a fourth schematic flowchart of a method for detecting a speech keyword according to the present invention;

FIG. 5 is a schematic structural diagram of a device for detecting a speech keyword according to the present invention;

FIG. 6 is a schematic diagram of an electronic device provided by the present invention;

FIG. 7 is a diagram illustrating the masking of a speech segment according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The method for detecting a speech keyword according to the present invention is described below with reference to fig. 1 to 4.

Fig. 1 is a schematic flow chart of a method for detecting a speech keyword according to an embodiment of the present invention.

Compared with the defects in the prior art, the technical scheme of the invention does not limit specific keywords, and can accurately capture the target keywords under the conditions of noise interference and pronunciation aliasing in the voice fragments.

As shown in fig. 1, an embodiment of the present invention provides a method for detecting a speech keyword, including the following steps:

101. acquiring a voice segment X to be detected and a target keyword Q, wherein the voice segment X is a sequence comprising t frames of voice vectors X, namely X ═ X ₁ ,x ₂ …,x _t ]The target keyword Q is a sequence including n syllables Q, i.e., Q ═ Q ₁ ,q ₂ ...,q _n ]Wherein t and n are positive integers.

Firstly, a fixed-length speech segment X containing t frame speech vectors X is given, where X is ₁ ,x ₂ …,x _t ]It may be extracted from the field of household appliances to be intelligent for voice wakeup and voice command, voice clip in live video or voice clip in multimedia data archiving. The log energy spectrum (LPS) of each frame is used in the present invention as the speech vector x for that frame.

Given a target keyword Q, any target keyword Q can be broken down into sequences containing syllables Q, i.e., Q ═ Q ₁ ,q ₂ ...,q _n ]. Syllables being sheets forming a speech sequenceBits, are also the most natural structural unit of speech in speech.

It is an object of the present invention to detect whether a target keyword Q is included in a speech segment X.

102. Extracting the voice characteristic of each syllable Q based on each syllable Q and the voice fragment X, and calculating the correlation between each syllable Q and the voice fragment X according to the voice characteristic of each syllable Q and the basic voice mode b (Q) of each syllable Q to obtain a correlation matrix between the target keyword Q and the voice fragment X.

103. And searching the optimal matching path p between the target keyword Q and the voice segment X based on the correlation matrix R, thereby calculating the matching probability of the target keyword Q and the voice segment X. I.e. the basic speech patterns b (q) based on v and q are compared, so as to obtain the correlation degree between the frame speech vector x and the syllable q. And finally, solving an optimal matching path by using a dynamic programming algorithm by considering the time sequence relationship between x and q.

104. And if the matching probability is greater than or equal to a preset threshold value, judging that the voice fragment X contains the target keyword Q. Meanwhile, if the matching probability is smaller than a preset threshold value, it is determined that the target keyword Q is not included in the voice segment X.

Specifically, as shown in fig. 2, the process of extracting the speech feature of each syllable q in the target keyword in step 102 includes the following steps:

201. the masking pattern m (q) for each syllable q is obtained.

202. Each frame speech vector X in the speech segment X is masked based on the masking pattern m (q) for each syllable q.

203. The speech feature corresponding to each syllable q is extracted.

Wherein, the voice feature v ═ m (q) < > X for each syllable q, i.e., the mask pattern for each syllable q is subjected to a product operation with each frame voice vector, which indicates a hadamard product.

Further, as shown in fig. 7, in the present invention, for a syllable q in a speech segment X, a segment is divided into two segments. By aligning, the position of the syllable q in the speech segment is obtained, and the speech segment outside the position is added with mask, i.e. the speech segment is added with mask. For example, for syllable A in the following figure, since positions other than A are all added as mask.

In particular, for the original signal, x [ APPLE ]]；x _mask ＝x·[01000]Wherein the position of 1 is the position of syllable P; by dot multiplication, x _mask ＝[0P000]。

The above description is at the signal level, and the actual mask operation is at the feature level, for example, extracting the mfcc/fbank feature from the signal x, and then making a mask.

Specifically, the invention provides a keyword processing method called 'voice carving', and the core idea of the method is to learn the pronunciation characteristics of each syllable q (or phoneme) in a specific context and formalize the pronunciation characteristics into a mask pattern m (q).

Phonetic carving defines a mask m (q) for each syllable q (or context-related syllable), which is learned by, but not limited to, neural networks, and m (q) is the carving knife in the phonetic carving mentioned above, which delineates the syllable q of interest from the garbled audio, focusing only on the characteristics of syllable q, with the features of the other syllables being masked. The mask m (q) is used to extract the speech features associated with each syllable q from each frame speech vector X in the original speech segment X.

Masking the input voice segment X based on m (q), and extracting a voice feature v corresponding to q, wherein the method specifically comprises the following steps: m (q) is a one-dimensional vector of equal length to x, and each element is between 0 and 1. The speech feature extraction process is as follows (this speech feature is the feature obtained by syllable q on speech segment X):

v _q ＝m(q)⊙X (1)

where |, indicates multiplication by element, the formula specifies that the speech feature for a single syllable q is the product of the mask m (q) for this syllable q and the speech vector X for each frame in speech segment X.

Further, as shown in fig. 3, in step 102, calculating a correlation between each syllable Q and the speech segment X according to the speech feature of each syllable Q and the basic speech pattern b (Q) of each syllable Q, to obtain a correlation matrix between the target keyword Q and the speech segment X, specifically including the following steps:

301. the base speech pattern b (q) for each syllable q is obtained. In particular, a base speech pattern b (q) is defined for each syllable q (this base speech pattern is understood to be a word vector for this syllable q).

In the invention, the correlation between the speech segment X and each syllable q is defined as the speech characteristic v of the syllable _q And the base speech pattern b (q) of the syllable, r. The correlation r can be calculated by a simple dot product, or by a more complex neural network. In step 302, a dot product metric is taken as an example.

302. And performing dot product operation on the basic voice mode b (q) of the single syllable q, the single syllable q and the voice characteristics v of the voice fragment X to obtain the correlation r between the single syllable q and the voice fragment X.

Is formulated as follows:

r＝b(q)v _q (2)

303. and calculating the correlation degree between each syllable Q and the voice fragment X to obtain a correlation degree matrix R between the target keyword Q and the voice fragment X. Specifically, considering that there are n syllables in Q, a speech segment X-keyword Q correlation matrix R can be calculated for X-Q:

R＝b(Q)v(Q) (3)

wherein, b (Q) represents the basic speech mode of each syllable in the obtained target keyword Q, and v (Q) represents the speech characteristics of each syllable in the target keyword Q and each frame of speech vector X in the speech segment X.

The following table gives an example of a speech segment X-keyword Q correlation matrix R where there are 4 frames of speech in total in speech segment X, corresponding to 3 syllables A, B, C in total in keyword Q. The values in each element of the table represent the corresponding degree of correlation r.

Table 1: speech-syllable correlation matrix example

In step 103, the optimal path p is first defined to satisfy the following constraint: (1) p must be taken from q ₁ Starting and starting from q _n Completing the process; (2) the speech frame number (1, 2 … t) corresponding to each step of p is inevitably increased, and the phoneme number (1, 2 … n) cannot be decreased. | p | is the length of path p. Note that since Q may be contained in one sub-segment of X, p does not necessarily come in from the first speech frame nor does it necessarily finish from the t-th speech frame. Therefore, the following are provided: n is less than or equal to | p | is less than or equal to t.

To determine p, a timing search algorithm is employed. First, path expansion on syllables is started only when the correlation r reaches a certain level continuously to ensure that the voice corresponding to the best matching path p is really q ₁ Starting; secondly, a Viterbi algorithm is adopted during searching, so that the searching efficiency is ensured; thirdly, all q is needed at the end of the search _n And selecting the finished paths to obtain the optimal matching p and the corresponding optimal average path matching value s.

Further, as shown in fig. 4, calculating an average matching score s of the best matching path according to the best matching path p specifically includes the following steps:

401. acquiring a frame number | p | corresponding to the optimal matching path p;

402. calculating the accumulated value R of the degree of correlation R of each syllable q and each frame of speech vector x in the best matching path p _p ；

403. Accumulated value R _p Dividing by the frame number p to obtain the average matching score s of the best matching path p.

Given a speech-syllable correlation matrix R, a best match path p can be searched for and the average match score s ═ R for the best match path can be calculated _p /|p|。

For example, the gray unit in table 1 is the best path p in the correlation matrix, the frame number | p | -4 corresponding to the best matching path p, and the accumulated value R of the correlation R corresponding to each syllable q and each frame speech vector x _p 0.8+0.7+0.8+ 0.7-3, and an average match score s-R _p /|p|＝0.75。

Further, calculating the matching probability of the target keyword Q and the voice segment X specifically includes: calculating the average matching score of the best matching path according to the best matching path;

and acquiring the matching probability of the target keyword Q and the voice fragment X according to the average matching score s.

Based on the score s, a decision can be made as to whether or not the keyword Q is present. During training, end-to-end training is adopted, the model outputs s, and the matching probability p (1| X, R) ═ Sigmoid(s) of X and Q is obtained after normalization through a Sigmoid function. The goal of the training is to expect p (1| X, R) to be close to 1 on positive samples and 0 on negative samples, so the cross-entropy criterion is used for training.

During actual training, a large number of positive and negative examples are required. These examples can be randomly generated based on any one of the speech recognition databases: if the text corresponding to the speech segment is selected, a positive example is obtained, and if a random text is selected, a negative example is obtained. Therefore, the model can be trained based on the existing voice database, and data of specific keywords does not need to be collected additionally.

The invention takes a neural network model as a basic structure. The neural network is used to extract phonetic features for X, compute the mask m (q) and base patterns b (q) for syllable q, and compute more complex correlations r. The invention is not limited to the structure of the neural network, and the fully-connected neural network or the convolutional neural network can be applied.

The technical scheme of the invention has the following advantages:

firstly: the keyword is freely replaced. The scheme provided by the invention is irrelevant to specific keywords. The model learns the pronunciation characteristics of each syllable, and any keyword can be decomposed into syllable sequences. Thus, the scheme can be quickly applied to any keyword without retraining.

Secondly, the method comprises the following steps: the anti-interference capability is strong. The scheme provided by the invention learns the mask of each syllable and captures the corresponding frequency spectrum components from the voice through the mask. Therefore, even if noise interference exists and even pronunciation aliasing exists, the target keyword can still be captured from the target keyword.

The following describes a device for detecting a voice keyword according to the present invention, and a device for detecting a voice keyword and a method for detecting a voice keyword described above can be referred to in a corresponding manner.

Fig. 5 is a schematic structural diagram of a device for detecting a voice keyword according to an embodiment of the present invention.

As shown in fig. 5, an embodiment of the present invention provides a device for detecting a speech keyword, including the following modules: an acquisition module 51, a first calculation module 52, a second calculation module 53, and a result determination module 54.

Specifically, the obtaining module 51 is configured to obtain a speech segment X to be detected and a target keyword Q, where the speech segment X is a sequence including t frames of speech vectors X, that is, X ═ X ₁ ,x ₂ …,x _t ]The target keyword Q is a sequence including n syllables Q, i.e., Q ═ Q ₁ ,q ₂ ...,q _n ]Wherein t and n are both positive integers.

The first calculating module 52 is configured to extract a speech feature of each syllable Q based on each syllable Q and the speech segment X, and calculate a correlation between each syllable Q and the speech segment X according to the speech feature of each syllable Q and the basic speech pattern b (Q) of each syllable Q, so as to obtain a correlation matrix between the target keyword Q and the speech segment X.

The second calculating module 53 is configured to search for a best matching path between the target keyword Q and the voice segment X based on the correlation matrix, so as to calculate a matching probability between the target keyword Q and the voice segment X.

If the matching probability is greater than or equal to the preset threshold, the result determination module 54 determines that the target keyword Q is included in the voice segment X.

Fig. 6 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 6: a processor (processor)610, a communication Interface (Communications Interface)620, a memory (memory)630 and a communication bus 640, wherein the processor 610, the communication Interface 620 and the memory 630 communicate with each other via the communication bus 640. The processor 610 may invoke logic instructions in the memory 630 to perform a method of detecting a speech keyword, the method comprising the steps of:

acquiring a voice segment X to be detected and a target keyword Q, wherein the voice segment X is a sequence comprising t frames of voice vectors X, namely X ═ X ₁ ,x ₂ …,x _t ]The target keyword Q is a sequence including n syllables Q, i.e., Q ═ Q ₁ ,q ₂ ...,q _n ]Wherein t and n are both positive integers;

extracting the voice characteristic of each syllable Q based on each syllable Q and the voice fragment X, and calculating the correlation between each syllable Q and the voice fragment X according to the voice characteristic of each syllable Q and the basic voice mode b (Q) of each syllable Q to obtain a correlation matrix between the target keyword Q and the voice fragment X;

searching the optimal matching path between the target keyword Q and the voice fragment X based on the correlation matrix, thereby calculating the matching probability of the target keyword Q and the voice fragment X;

and if the matching probability is greater than or equal to a preset threshold value, judging that the voice fragment X comprises the target keyword Q.

In addition, the logic instructions in the memory 630 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention or a part thereof which substantially contributes to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, where the computer program product includes a computer program, the computer program can be stored on a non-transitory computer readable storage medium, and when the computer program is executed by a processor, a computer can execute a method for detecting a speech keyword provided by the above methods, where the method includes:

obtaining a voice segment X to be detected and a target keyword Q, wherein the voice segment X is a sequence comprising t frames of voice vectors X, namely X ═ X ₁ ,x ₂ …,x _t ]The target keyword Q is a sequence including n syllables Q, i.e., Q ═ Q ₁ ,q ₂ ...,q _n ]Wherein t and n are positive integers;

In still another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to execute the method for detecting a speech keyword provided by the above methods, the method comprising the steps of:

acquiring a voice segment X to be detected and a target keyword Q, wherein the voice segment X is a sequence comprising t frames of voice vectors X, namely X ═ X ₁ ,x ₂ …,x _t ]The target keyword Q is a sequence including n syllables Q, i.e., Q ═ Q ₁ ,q2...,q _n ]Wherein t and n are positive integers;

extracting the voice feature of each syllable Q based on each syllable Q and the voice fragment X, and calculating the correlation between each syllable Q and the voice fragment X according to the voice feature of each syllable Q and the basic voice mode b (Q) of each syllable Q to obtain a correlation matrix between the target keyword Q and the voice fragment X;

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for detecting a voice keyword is characterized by comprising the following steps:

2. The method for detecting a keyword as claimed in claim 1, wherein the extracting the phonetic feature of each syllable based on each syllable and the phonetic segment specifically comprises:

obtaining a masking mode of each syllable;

extracting the voice characteristics corresponding to each syllable.

3. The method according to claim 2, wherein the calculating a correlation between each syllable and the speech segment according to the speech feature of each syllable and the basic speech pattern of each syllable to obtain a correlation matrix between the target keyword and the speech segment includes:

acquiring a basic voice mode of each syllable;

4. The method for detecting a voice keyword according to claim 1, wherein the calculating of the matching probability between the target keyword and the voice segment specifically includes:

5. The method for detecting a speech keyword according to claim 4, wherein the calculating an average matching score of the best matching path according to the best matching path specifically includes:

acquiring the frame number corresponding to the optimal matching path;

6. The method for detecting a speech keyword according to claim 1, further comprising:

and if the matching probability is smaller than the preset threshold value, judging that the target keyword is not included in the voice segment.

7. A detection apparatus for a speech keyword, comprising:

the first calculation module is used for extracting the voice characteristics of each syllable based on each syllable and the voice fragment, and calculating the correlation between each syllable and the voice fragment according to the voice characteristics of each syllable and the basic voice mode of each syllable to obtain a correlation matrix between the target keyword and the voice fragment;

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method for detecting the speech keyword according to any one of claims 1 to 6 when executing the program.

9. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the method for detecting a speech keyword according to any one of claims 1 to 6.

10. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements a method for detecting a speech keyword according to any one of claims 1 to 6.