CN113628612A

CN113628612A - Voice recognition method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN113628612A
Application number: CN202110292651.2A
Authority: CN
Inventors: 杨晨; 文学; 姚丽晓; 刘梦倩; 张晨浩; 宋黎明
Original assignee: Samsung Electronics Co Ltd
Current assignee: Beijing Samsung Telecom R&D Center; Samsung Electronics Co Ltd
Priority date: 2020-05-07
Filing date: 2021-03-18
Publication date: 2021-11-09
Also published as: WO2021225403A1

Abstract

The embodiment of the application provides a voice recognition method, a voice recognition device, electronic equipment and a computer readable storage medium, and relates to the technical field of signal processing. The method comprises the following steps: acquiring a voice fragment to be processed; recognizing the voice segments based on the recognition network; and extracting the feature map of the voice segment based on the recognition network, and classifying the extracted feature map to obtain the probability that the voice segment contains the keywords. The voice recognition method provided by the embodiment of the application can effectively reduce the operation amount in the recognition process and improve the recognition efficiency. The speech recognition method, the speech recognition apparatus, the electronic device, and the computer-readable storage medium provided in the embodiments of the present application may be implemented by an Artificial Intelligence (AI) model.

Description

Voice recognition method and device, electronic equipment and computer readable storage medium

Technical Field

The present application relates to the field of signal processing technologies, and in particular, to a speech recognition method, an apparatus, an electronic device, and a computer-readable storage medium.

Background

With the rapid development of artificial intelligence technology, man-machine interaction modes are also more diversified. The voice interaction is popular with a large number of users in the market as a new man-machine interaction mode. The method can be applied to various fields, for example, the method can be used for voice awakening, when a user speaks voice to the intelligent device, the intelligent device recognizes the voice, and if the voice includes a predefined awakening word, the device can be awakened so that the user can further send other instructions; the method can also be expanded to other fields, such as voice instruction control of smart phones, smart watches and smart homes, and sensitive vocabulary detection in broadcasting and television programs.

The existing speech recognition technology mostly adopts a deep learning method, mainly adopts a deep residual error neural network to determine the posterior probability of each frame belonging to each classification in the speech so as to obtain the probability that the whole section of audio contains keywords, and needs to be optimized.

Disclosure of Invention

The application provides a voice recognition method, a voice recognition device, electronic equipment and a computer-readable storage medium, which are used for solving the problem how to respond more accurately when intelligent voice equipment interacts with a user, and the technical scheme is as follows:

in a first aspect, a speech recognition method is provided, which includes:

acquiring a voice fragment to be processed;

and extracting the feature maps of the voice segments, and classifying the extracted feature maps to obtain the probability that the voice segments contain the keywords.

In a second aspect, there is provided a speech recognition apparatus, the apparatus comprising:

the acquisition module is used for acquiring a voice fragment to be processed;

and the recognition module is used for extracting the feature maps of the voice segments and classifying the extracted feature maps to obtain the probability that the voice segments contain the keywords.

In a third aspect, an electronic device is provided, which includes:

one or more processors;

a memory;

one or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: the operations corresponding to the speech recognition method according to the first aspect are performed.

In a fourth aspect, there is provided a computer readable storage medium having stored thereon at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by a processor to implement the speech recognition method as set forth in the first aspect.

The beneficial effect that technical scheme that this application provided brought is:

compared with the prior art, the method and the device for recognizing the voice, the electronic equipment and the computer-readable storage medium have the advantages that the characteristic graphs of the voice fragments are extracted through the recognition network, the extracted characteristic graphs are classified, the calculation amount in the recognition process can be effectively reduced, and the recognition efficiency is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

FIG. 1 is a schematic diagram of a voice wake-up scenario;

FIG. 2 is a diagram illustrating a deep residual neural network in the prior art;

FIG. 3 is a schematic diagram of a residual error unit in the prior art;

fig. 4 is a schematic flow chart of a speech recognition method according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an identification network according to an embodiment of the present application;

fig. 6 is a schematic flow chart of a speech recognition method according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a feature map processing sub-network according to an example of the present application;

FIG. 8 is a schematic diagram of depth-wise convolution according to an embodiment of the present application;

FIG. 9a is a schematic diagram of a feature map processing subnetwork provided in an example of the present application;

FIG. 9b is a schematic diagram of a feature map processing sub-network according to another example of the present application;

fig. 10 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;

FIG. 11 is a schematic diagram of a scheme for classifying a second feature map according to an embodiment of the present application;

FIG. 12 is a schematic diagram of a structure of a multi-scale classifier provided in an example of the present application;

FIG. 13 is a schematic diagram of a multi-scale classifier provided in an example of the present application;

FIG. 14 is a schematic diagram of a structure of a multi-scale classifier provided in an example of the present application;

FIG. 15 is a schematic illustration of a scheme for speech recognition provided in an example of the present application;

FIG. 16 is a schematic diagram of a speech recognition scheme provided by an embodiment of the present application;

FIG. 17 is a schematic diagram of a structure of a multi-scale classifier provided in an example of the present application;

FIG. 18 is a schematic diagram of a speech recognition model according to an example of the present application;

FIG. 19 is a schematic diagram of a speech recognition model according to an example of the present application;

FIG. 20 is a schematic diagram of a speech recognition scheme provided in an embodiment of the present application;

FIG. 21 is a schematic diagram of a speech recognition scheme provided in an embodiment of the present application;

FIG. 22 is a schematic diagram of a speech recognition scheme provided by an embodiment of the present application;

FIG. 23 is a schematic diagram of a speech recognition scheme provided in an embodiment of the present application;

FIG. 24 is a schematic illustration of a scheme for determining a probability of including a keyword in an example of the present application;

FIG. 25 is a schematic diagram of a voice slide detection scheme provided by an embodiment of the present application;

FIG. 26 is a schematic diagram of a voice slide detection scheme in an example of the present application;

FIG. 27 is a schematic illustration of a speech recognition scheme provided by an example of the present application;

fig. 28 is a schematic structural diagram of a DRU provided in an example of the present application;

FIG. 29 is a schematic diagram of the principle of a multi-scale classifier under different speaking styles and noisy environments according to an example of the present application;

fig. 30 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 31 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The speech recognition technique will be described below in the context of the application of voice wakeup.

With the rapid development of artificial intelligence technology, man-machine interaction modes are also more diversified. The voice interaction is popular with a large number of users in the market as a new man-machine interaction mode. The voice awakening module is used for controlling voice interaction, and when a user speaks a predefined awakening word to the intelligent device, the voice awakening module is used for detecting the appearance of the awakening word and awakening the device so that the user can further send other instructions. Therefore, the voice awakening algorithm aims to detect the keywords in the continuous audio stream, and can be applied to voice awakening and also can be expanded to other fields, such as voice instruction control of smart phones, smart watches and smart homes, and sensitive vocabulary detection in broadcasting and television programs.

The existing advanced voice awakening technology mostly adopts a deep learning method. The specific process is shown in fig. 1 and mainly comprises three main steps:

feature engineering, a frame classifier based on a deep neural network, and post-processing. The input of the voice wakeup module is a piece of continuous speech (which may be an audio piece with a set duration, such as an audio piece with 1 second), Feature engineering may also be called Feature Extraction (Feature Extraction) for generating a low-level representation of the input continuous speech, and Feature engineering extracts the input continuous speech into a fixed-length vector of a frame and a frame, i.e., a speech Feature vector, and may also be called a Feature Matrix (Feature Matrix), for example, converts the input audio piece with 1 second into a Mel Frequency Cepstrum Coefficient (MFCC) Feature Matrix.

Each frame feature vector is sequentially input into a frame classifier based on a deep neural network, which may also be referred to as an Acoustic Model (acoustics Model), to obtain a posterior probability that the frame belongs to each classification, i.e., a probability of containing each keyword/sub-word. For example, the obtained MFCC feature matrix is used as an input, and the probability that each keyword/subword is included in the input continuous speech is calculated.

And the post-processing module is used for summarizing the posterior probability of all frames on each classification to obtain the probability that the whole section of audio contains the keywords, comparing the probability with a preset threshold value and making a decision on whether to awaken or not.

Because the voice awakening module needs to reside in the background of the device, continuously operates and detects in real time, the size of the model is not too large, the calculation complexity is not too high, and the balance between low power consumption and high recognition rate is required continuously. The main complexity and almost all parameters of the voice wake-up module come from the frame classifier based on the deep neural network, so the performance of the frame classifier directly affects the overall performance. At present, the most advanced frame classifier adopts a Deep Residual Neural Network (ResNet) structure, and a specific Network structure is shown in fig. 2. A deep residual error neural network is composed of an initial convolutional layer, a residual error unit and a full-connection classifier. The initial convolutional layer performs size compression and channel dimensionality on the input feature frames. After the initial convolution, a plurality of residual error units are stacked together, and the local and global features of the feature frame are extracted from shallow to deep. A Residual Unit (RU), which is composed of a convolutional layer and an active + Normalization layer, and is connected with an input through a shortcut (shortcut) and an input, referring to fig. 3, an RU may include a plurality of convolutional layers and a plurality of active + Normalization layers, and an active + Normalization layer is connected after each convolutional layer, and the convolutional layer may be a 3 × 3 convolutional layer, the active layer may be a modified Linear Unit (ReLU) active layer, and the Normalization layer may be a Batch Normalization (BN) layer. The shortcut connection in the residual unit makes the residual unit tend to learn the identity mapping, so the convolutional layer in the residual unit tends to learn the zero mapping, which becomes easier to train. And finally, a fully connected classifier is used for classifying the features extracted by the residual error unit to obtain the posterior probability of each class of the feature frame. The deep residual error neural network can train that the layer number is deep, so that the accuracy of the model is guaranteed. By controlling the number of convolution kernels, the network size and complexity can also be limited to a specified range.

Although the voice wake-up technology based on the deep residual error neural network can meet the requirements of model size and recognition rate to some extent, some defects still exist:

(1) the voice wake-up scheme based on the deep residual error neural network has poor noise immunity. The voice awakening based on the deep residual error neural network is good in quiet environment, but the scene in reality is often more complex, and a user has strong demand for the voice awakening function in a noisy noise environment such as a restaurant, an office and a street, and in such an environment, the discrimination capability of a voice awakening model on awakening words is reduced due to the fact that human voice and environmental noise are mixed together, and then the awakening rate is reduced. ResNet has poor robustness under noise environment and different speaking modes

(2) The voice wake-up scheme based on the deep residual error neural network has poor speaker adaptability. One thousand speaking modes are provided for one thousand individuals, when each speaker sends out voice containing awakening words, the speed, tone and strength of the voice are possibly different, and the voice of the awakening words sent out by the same speaker is different due to different physical states in different time periods. The deep residual neural network based voice wake-up scheme can only handle most average cases, but it is difficult to handle some boundary cases. For example, when the speaker pronounces too fast, the input speech segment is fixed length 1 second, while the wake-up word segment takes only 400 milliseconds, and the remaining 600 milliseconds are silence or noise, then the model will be more inclined to judge this speech segment as silence or noise rather than wake-up word during post-processing.

(3) The voice awakening scheme based on the deep residual error neural network only uses features of a single scale for identification (all features of the whole audio segment are used for identification), and a lot of non-keyword information is introduced, so that the identification capability of a voice awakening model on awakening words is reduced, and further, the awakening rate is reduced. For example, the deep residual error-based neural network takes an audio segment of 1s as an input, but the time of occurrence of an actual keyword in the audio segment may be very short, and the actual keyword occurs in an unknown time period.

(4) There is still room for further compression of the deep residual neural network. The number of parameters and complexity of the deep residual error neural network mainly come from stacked residual error units, while the residual error units use traditional convolution, a plurality of different convolution cores are needed to perform sliding convolution operation on input feature maps, channel combination is performed after each sliding convolution is completed, and a large amount of redundancy exists in both the number of parameters and the complexity of calculation. Although the model size can be limited to some extent by reducing the number of convolution kernels, the representation capability of the model is reduced due to the consequent reduction in the number of channels.

The application provides a low-power-consumption voice awakening scheme based on multi-scale convolution. The anti-noise capability and the speaker adaptability of the model are greatly improved, and the network robustness is improved. Parameter compression is carried out on the convolution layer of the residual unit in the depth residual error neural network by using depth-wise operation, so that the size of a model and the computational complexity are greatly reduced (for example, 7 times of reduction can be realized).

Compared with the prior art, the scheme has the following advantages:

(1) the noise immunity of this scheme is better. According to the scheme, the multi-scale classifier is added after the plurality of middle convolution layers, and due to the fact that the sensing fields of different middle layers are different, the added multi-scale classifier can classify a plurality of different starting points of the input voice fragment and sub-fragments with different lengths, so that the awakening word position can be accurately found, the interference of background noise is eliminated, and a better classification effect is achieved. For example, if a section of noisy audio with the length of 1 second is input, 200 milliseconds to 800 milliseconds are used as a wake-up word, and classification is performed by using the whole extracted feature, noise in the front and back 200 milliseconds can interfere with the classification result. And the multi-scale classifier can try to classify by using only the sub-segment of 200 milliseconds to 800 milliseconds, so that higher classification precision is obtained.

(2) The speaker adaptability of the scheme is better. According to the scheme, the multi-scale classifier is added after the plurality of middle convolutional layers, and the existence of the keywords can be detected from a plurality of scales, so that the method and the device can adapt to the change of the speed and the tone of the speaker. For example, a section of audio with the length of 1 second is input, since the speaker voices too fast, the awakening word only occupies a part of 200 milliseconds to 600 milliseconds, and the multi-scale classifier can attempt to classify by using a sub-section of 200 milliseconds to 600 milliseconds, so that the classification accuracy is improved.

(3) The model compression degree effect of the scheme is better. According to the scheme, depth-wise operation is used for carrying out parameter compression on the convolution layers of the residual error units in the depth residual error network, the number of convolution kernels is reduced from a plurality of convolution kernels to one convolution kernel, and the size of the model is compressed by 7 times. In addition, channel combination after sliding convolution is cancelled, the number of channels is not reduced, and the original representation capability of the model is maintained.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

A possible implementation manner is provided in the embodiment of the present application, and as shown in fig. 4, a speech recognition method is provided, which may include the following steps:

step S401, acquiring a voice segment to be processed; in the embodiment of the present application, the speech segment may also be referred to as an audio segment.

Step S402, extracting a feature map (feature map) of the voice segment based on the recognition network, and classifying the extracted feature map to obtain the probability that the voice segment contains the keyword.

Specifically, the voice to be recognized is obtained, at least one voice segment with preset duration is extracted from the voice, and for each voice segment, recognition is carried out based on a recognition network.

Wherein the recognition network may comprise a feature extraction sub-network, at least one feature map processing sub-network and a classifier.

Among them, the feature map processing sub-network may also be called a depth-wise Residual Unit (DRU) for extracting rich features of the input speech segment, and the structure of the DRU will be described in detail below.

As shown in FIG. 5, in one embodiment, the recognition network may include a feature extraction subnetwork, a feature map processing subnetwork (which may include an initial convolution and at least one DRU), and a classifier; the feature extraction sub-network may be an MFCC (Mel Frequency Cepstrum Coefficient) extraction network, or may be a PLP (Perceptual Linear prediction) feature extraction network.

In the identification network in the application, the Residual error unit in the deep Residual error neural network can be replaced by a DRU, and the identification network can also be called a depth-wise Residual error neural network (DRN).

Specifically, a feature map in the voice segment is extracted through a feature extraction sub-network, then channel compression and depth-wise operation are carried out on the extracted feature map through at least one DRU, then channel recovery is carried out, and the feature map processed by the DRU is classified so as to identify the voice segment.

In the embodiment, the voice segments are identified through the identification network comprising at least one feature map processing sub-network, so that the computation amount in the identification process can be effectively reduced, and the identification efficiency is improved.

As shown in fig. 6, a possible implementation manner of the embodiment of the present application, the extracting, based on the recognition network, the feature map of the voice segment in step S402, and classifying the extracted feature map to obtain a probability that the voice segment contains the keyword may include:

step S410, extracting a first feature map of the voice segment based on the feature extraction sub-network in the recognition network;

step S420, processing the first feature map based on at least one feature map processing sub-network in the identification network to obtain at least one second feature map;

step S430, classifying at least one second feature map based on at least one classifier in the recognition network to obtain the probability that the voice segment contains the keyword.

Specifically, as shown in fig. 5, the voice segment may be input to the feature extraction sub-network to obtain a first feature map, the first feature map is input to the initial convolution to project the features into the plurality of channels, the output features of the initial convolution are input to at least one DRU, for each DRU in the at least one DRU, the DRU performs channel compression on the input, then performs depth-wise operation, and finally performs channel restoration, and then classifies the input to obtain the probability that the voice segment includes the keyword.

Specifically, the output of each DRU in the at least one DRU may be used as the second feature map, the output of the last DRU in the at least one DRU may be used as the second feature map, and the output of each group of DRUs in the at least one DRU may be used as the second feature map, and a group of DRUs may include at least one DRU, which will be described in further detail below with reference to the embodiments.

Specifically, the output of the last DRU in the at least one DRU may be used as a second feature map, and the second feature map is input into a full-connection classifier for classification; the feature maps can also be input into corresponding multi-scale classifiers for classification.

The structure of each DRU will be described in detail below.

In one possible implementation of the embodiment of the present application, for any one of the at least one feature map processing sub-network, the feature map processing sub-network includes a first number of channel compression convolutions, a depth separable convolution and a second number of channel restoration convolutions.

Among them, the depth separable convolution may also be referred to as depth-wise convolution or depth-wise partial convolution.

In a possible implementation manner of the embodiment of the present application, the number of channels corresponding to the channel compression convolution of the feature map processing subnetwork is equal to the number of channels corresponding to the input feature map of the feature map processing subnetwork; the numerical value of the first quantity is smaller than the numerical value of the number of channels corresponding to the input feature map of the feature map processing sub-network;

the numerical value of the number of channels corresponding to the channel recovery convolution of the feature map processing sub-network is equal to the numerical value of the first number; the second number is equal to the number of channels corresponding to the profile of the input to the profile processing subnetwork.

In a possible implementation manner of the embodiment of the present application, the channel compression convolution may be a 1 × 1 convolution; the channel recovery convolution may also be a 1 × 1 convolution.

In other embodiments, the channel compression convolution and the channel recovery convolution may also have other sizes, for example, 2 × 2 convolution may be used, and when the convolution is 1 × 1 convolution, the compression effect on the number of channels is better, the amount of operation in the identification process may be more effectively reduced, and the identification efficiency may be improved.

In one example, as shown in fig. 7, the size of the first feature map input to the DRU may be (w, h, c), the size of the channel compression convolution may be (1,1, c), and the number of channel compression convolutions is c ', where c' is less than c, to achieve channel compression of the first feature map.

The number of channels of the middle depth-wise convolution is c ', c channel recovery convolutions can be set, and the size of each channel recovery convolution can be (1,1, c'), so that channel recovery is performed on the first feature map after channel compression.

In the above example, DRU replaces convolution layer in original residual unit with three convolutions, 1 × 1 convolution is done from head to tail, and 1 depth-wise convolution is done in the middle. The convolution functions as compression and recovery of the number of channels: the convolution (namely channel compression convolution) of the DRU header compresses the number of channels of the input feature graph, and reduces the number and complexity of parameters for the middle depth-wise convolution; the convolution of the tail part (namely channel recovery convolution) recovers the channel number of the depth-wise convolution into the channel number of the input characteristic diagram, the consistency of the input and output channel numbers is ensured, the middle depth-wise convolution only needs 1 convolution kernel, and each channel of the convolution kernel slides on the corresponding characteristic diagram channel.

As shown in fig. 8, each channel of the convolution kernel of the depth-wise convolution slides on the corresponding feature map channel, and the convolution results of the channels are not combined. Each kernel calculates the convolution on a particular channel of the input features, the result being set directly as an output feature channel, assuming that there are channels for the feature input and output features, which need to be shaped as k_h×k_wN convolution kernels of x1, the total number of parameters is k_w×k_hX n. It is to be understood that only one is shown in FIG. 7For example, in other embodiments, the size of the channel compression convolution may also be (2,2, c), (3,3, c), and so on, and the smaller the size, the more beneficial the reduction of the parameter calculation amount of the depth-wise convolution.

In another example, as shown in fig. 9a, the DRU may also perform stacking after three convolutions as shown in fig. 9a, that is, replace the convolution layer in the original residual unit with three convolutions, first perform 1 × 1 convolution, perform 1 depth-wise convolution in the middle, then perform 1 × 1 convolution, and after performing three convolutions, repeat three convolutions, that is, continue to perform 1 × 1 convolution, depth-wise convolution and 1 × 1 convolution.

It will be appreciated that the three convolutions shown in fig. 7 are stacked only twice as in fig. 9a, and in other examples, the three convolutions shown in fig. 7 may also be stacked repeatedly a plurality of times.

The position of the normalization layer in FIG. 9a can also be adjusted, which can be activation + normalization between depth-wise convolution and 1 × 1 convolution in FIG. 9 a; it is also possible to perform activation + normalization between two 1x1 convolutions as shown in fig. 9 b.

Although fewer parameters are used in the DRU, it is much deeper than the residual unit in the existing ResNet (2 convolution- >6 convolution). After each convolutional layer there is a ReLU activation, which is a non-linear transformation, and the DRU performs 6 non-linear transformations and then has the representation capability of only 2 times the residual units in the ReLU activated ResNet. The convolution results of one kernel have strong correlation between different channels. The audio is less complex than the picture and the information on each channel is sufficient to accurately identify the keywords. Cross-channel information fusion does not seem necessary, leading to many redundant parameters.

The following will explain, in conjunction with the embodiments and the accompanying drawings, a specific process for determining a probability that a speech segment contains a keyword based on a second feature map when the second feature map is the output of the last DRU.

In a possible implementation manner of the embodiment of the present application, the processing sub-network based on at least one feature map in the recognition network in step S420 processes the first feature map to obtain at least one second feature map, where the processing sub-network may include:

(1) inputting a first feature map into a plurality of feature map processing subnetworks which are stacked in sequence;

wherein, the input of the first feature map processing sub-network is the first feature map; for any one of the plurality of feature map processing subnetworks other than the first feature map processing subnetwork, the input is the output of the previous feature map processing subnetwork.

(2) And taking the output of the last feature map processing sub-network as a second feature map.

Specifically, in this embodiment, the output of the last feature map processing sub-network is used as the second feature map, and the second feature map is classified to obtain the probability that the speech segment includes the keyword.

Specifically, in DRN (i.e., recognition network), the first layer is still a start convolution (initial convolution) followed by stacked DRUs. As the number of layers increases, the number of channels of a DRU increases, and a DRU with the same number of channels is called a set of DRUs. As shown in fig. 10, fig. 10 is an example of a DRN, and is composed of 3 sets of DRUs, the number of channels is 16, 32, 48, respectively, and there are 3 DRUs in each set. Since the DRN contains 9 DRUs plus one initial convolution, it can be denoted as DRN 10.

Fig. 10 shows only one example of a DRN, the number of groups of DRUs in the DRN is not limited to 3, and may be any natural number, and the number of DRUs in each group of DRUs is also not limited to 3, and the size and complexity of the DRN may be changed by controlling the number of DRUs and the number of channels in the group to adapt to different application scenarios.

In a possible implementation manner of the embodiment of the present application, the step S430 of classifying the at least one second feature map to obtain a probability that the voice segment includes the keyword may include:

(1) averaging and pooling the second feature map output by the last DRU;

(2) and inputting the average pooled second feature map into a full-connection classifier to obtain the posterior probability of each classification so as to determine the probability of the keyword contained in the voice segment.

Each classification may be that the second feature map includes different keywords, or may be that the second feature map includes different vocalization subunits of words.

As shown in fig. 11, assume that the size of the second feature map input to the classifier is (w, h, n), where w is the width of the feature map, h is the height of the feature map, and n is the number of channels of the feature map. And (3) carrying out average pooling (averaging) on each channel of the second characteristic diagram with the size of (w, h) to obtain a vector with the length of n, and inputting the vector into a full-connection classifier to obtain the posterior probability of each classification.

The following will explain, with reference to specific embodiments and accompanying drawings, a specific process for determining a probability that a speech segment contains a keyword based on a second feature map when the output of each set of DRUs is taken as a corresponding second feature map.

(1) inputting the first feature map into a plurality of groups of feature map processing sub-networks which are stacked in sequence; wherein any group of feature map processing subnetworks comprises at least one feature map processing subnetwork;

wherein, the input of the first group of feature map processing sub-network is a first feature map; for any one group of feature map processing sub-networks except the first group of feature map processing sub-networks in the plurality of groups of feature map processing sub-networks, inputting the output of the previous group of feature map processing sub-networks;

(2) and taking the output of each group of feature map processing sub-network as a second feature map corresponding to the feature map processing sub-network.

The difference between this embodiment and the above-mentioned embodiment is that the second feature map is set differently, and in the above-mentioned embodiment, only the output of the last DRU is taken as the second feature map, but in this embodiment, the output of each group of DRUs is taken as a corresponding second feature map, each group of DRUs includes at least one DRU, and the plurality of second feature maps are classified to obtain the probability that the speech segment includes the keyword, so that the second feature maps output by the DRUs in the middle layer can be used to effectively capture the details, and the overall control can be performed, thereby improving the accuracy of speech recognition.

In one example, as shown in fig. 12, the output of each set of DRUs can be used as a second feature map corresponding to the set of DRUs, and a multi-scale classifier is arranged behind each set of DRUs, so that the overall control can be realized and the details can be captured.

In one example, the architecture of the multi-scale classifier is shown in FIG. 13. For a multi-scale classifier, there are two implications: classification can be done in different receptive fields or using different sized sub-regions. A multi-scale classifier is added at the end of the different layers to classify the different receptive fields. In each multi-scale classifier, a plurality of size windows (a plurality of sub-regions) are selected using the VAD result, the different size windows are moved on the input feature map, and then the sub-regions of the feature map obtained by the corresponding window are averagely combined into each channel (averaging pooling in the corresponding map) and transferred into the feature vector with the same number of channels. Since the number of channels is the same, the weights in the multi-scale classifier can be shared, and thus the multi-scale classifier brings little additional parameters and complexity.

If all sub-region elements are used for classification in each multi-scale classifier instead of selecting multiple sub-regions using VAD results, the results of a few overlapping time intervals may interfere with the final result, resulting in poor robustness, and classifying features from more sub-regions, resulting in a more complex process.

In one example, as shown in fig. 14, one softmax classifier is retained at the last level relative to the multi-scale classifier shown in fig. 12, wherein the softmax classifier is one of the fully-connected classifiers, and using fewer classifiers results in fewer parameters and multiplications relative to the multi-scale classifier shown in fig. 12, but without the multi-scale classifier, features of different scales cannot be obtained, resulting in reduced accuracy and robustness.

As shown in fig. 15, the speech is input into different sets of DUR to obtain different noise environments or speech speeds corresponding to different depths of feature 1, feature map 2 and feature map 3, and a fast speech obtained feature map with different depths has more local and definite features on the shallow feature map and more dispersed and fuzzy features on the deep feature map. Therefore, better results can be obtained by classifying the speech on the shallow feature map, namely, a multi-scale classifier is arranged, and the obtained classification results are more accurate.

In a possible implementation manner of the embodiment of the present application, the classifying the at least one second feature map in step S430 to obtain a probability that the voice segment includes the keyword may include:

(1) for each second feature map in at least one second feature map, determining a sub-region corresponding to the sliding position of each window in the second feature map based on a plurality of windows with preset sizes;

(2) and classifying each sub-region of each second feature map to determine the probability of the speech segment containing the keywords.

Specifically, as shown in fig. 16, for each second feature map (corresponding to the feature map output by the DRU group shown in the figure), a window with a preset size may be set, and the window slides synchronously on each channel with a preset step size, each window with a preset size can be mapped to a specific sub-segment (i.e. sub-segment or sub-region) of the input audio when sliding to a specific sliding position, and then a plurality of sub-segments of the audio segment are classified according to the obtained local feature map in the window.

For example, for a feature map with a size of (w, h, c), the sizes can be selected to be (w, h, c)

The window of (a) is slid synchronously on each channel by a step size s. The window with the size of (w', h) is a sub-region which can be mapped to the voice segment after sliding by t steps with the step length s

Where T is the duration of the speech segment.

In the above embodiment, the preset sizes of the plurality of windows and the corresponding slide positions may be preset by the user, and in another embodiment, the preset size may be determined from a plurality of candidate sizes of the windows, and the corresponding slide position may be determined.

In one example, as shown in fig. 17, DRN8 indicates that this DRN is constructed based on ResNet8, which consists of 8 layers (1 initial convolution +6 stacked 3x3 convolutions +1 softmax).

Inputting the size: 98 × 13(98 frames); the initial convolution increases the number of channels to 16 and reduces the input size to 30 x 15; 3 stacked DRUs with input channel numbers of 16, 32, 48, respectively, and output channel numbers of 32, 48, respectively; each DRU is provided with a 3x3 depth separable convolution before and after the 1x1 channel convolution, respectively, and for each 3x3 depth separable convolution, the number of channels is reduced to half before the 1x1 convolution (16->8,32->16,48->24) The 1 × 1 convolution restores the number of channels to the original number of channels (the last 1 × 1 convolution in each DRU restores the number of channels to the input number of channels of the next DRU); the 1 × 1 convolution is used to increase the number of channels of shortcut (16->32，32->48) (ii) a At the end of each DRU a multi-scale classifier was added, where 10 seed regions were designed as described previously:

and [0, T]。

As shown in fig. 18, fig. 18 is a schematic diagram of an overall structure of a speech recognition model of the present application in one example.

1) Acquiring a Voice segment to be processed (such as a 1-second Voice segment), and performing Voice endpoint Detection (VAD) on the input Voice segment, wherein a rough keyword segment [ s, e ] can be obtained through VAD;

2) if the voice fragment is valid voice, extracting the characteristics of the voice fragment to obtain a first characteristic diagram, such as a 13-dimensional MFCC characteristic matrix;

3) inputting a first feature map into a DRN, wherein the DRN comprises an initial volume and a plurality of feature map processing sub-networks which are stacked in sequence, namely a DRU;

4) the second feature maps output by each set of DRUs are input to a corresponding multi-scale classifier to determine the probability that each keyword (or keyword) is contained in the speech segment. The network of DRNs and multi-scale classifiers can be referred to as an acoustic model.

5) The post-processing module makes a final decision, i.e., whether a keyword is detected, based on the output of the multi-scale classifier.

The feature extraction and initial convolution structure shown in the figure is the structure existing in the existing model, and other VAD, DRU and multi-scale classifier, post-processing are the improved model structure of the present application.

As shown in fig. 19, the feature map processing sub-network of fig. 18 is further described in fig. 19.

The feature map processing subnetwork comprises an initial convolution and stacked DRUs; the DRU replaces the original residual unit of ResNet. In detail, each 3x3 convolutional layer in the ResNet residual unit will be replaced with 3 layers: 1-layer 1x1 convolution, 3x3 depth separable convolution, and 1-layer 1x1 convolution, each layer followed by a ReLU activation. Each DRU is a feature extractor that outputs a feature representation, i.e., outputs a second feature map (shown as a feature map) at a particular receive field, with deeper DRUs containing more channels and extracting more global, richer, and complex features.

(1) performing voice endpoint detection on the voice segments, and determining an effective voice area in the first feature map;

(2) determining a window matched with the effective voice area from a plurality of windows with candidate sizes, and determining a corresponding sliding position based on the effective voice area;

(3) for each second feature map in the at least one second feature map, determining a sub-region corresponding to the sliding position of each window in the second feature map according to the determined window and sliding position;

(4) and classifying each sub-region of each second feature map to determine the probability of the speech segment containing the keywords.

Specifically, the audio segment may be subjected to filter denoising and Voice endpoint Detection (VAD), and if the audio segment is valid Voice, an approximate position including a keyword in the Voice segment, that is, a valid Voice region, is determined; then, several windows having a high degree of overlap with the valid speech region are selected from among windows of various candidate sizes, and the optimum position of each selected window, i.e., the position having the highest degree of overlap with the valid speech region, is determined, and based on the determined size and position of the window, the preset size and the corresponding slide position are determined.

As shown in fig. 20, a keyword grabber may be used to perform filter denoising and voice endpoint detection on a voice frequency segment, determine an effective voice rough segment in the voice segment, that is, an effective voice region, match a plurality of candidate windows with the effective voice region, determine an optimal matching window and an optimal sliding position, that is, determine a preset size and a corresponding sliding position of the window, obtain a corresponding local feature map in the second feature map, that is, obtain a corresponding sub-region in the second feature map, and classify the obtained local feature maps.

The following will further describe a process of specifically classifying corresponding sub-regions in the second feature map with reference to the drawings and the embodiments.

In a possible implementation manner of the embodiment of the present application, classifying each sub-region of each second feature map may include:

a. performing average pooling on each sub-region in the second feature map;

b. and respectively inputting each sub-region after the average pooling into a corresponding full-connection classifier for classification to obtain the posterior probability of each sub-region belonging to each preset class.

The parameters of the corresponding fully connected classifiers may be different or the same. If the parameters corresponding to the fully connected classifier corresponding to each sub-region are the same (i.e., weight sharing), the number of the parameters of the classifier can be reduced.

Specifically, as shown in fig. 21, if the preset size and the preset sliding position of the window are directly set, and the window of each preset size slides to a specific position, the feature points in the window are respectively averaged and pooled in each channel to obtain a vector with a length of c, and the vector is input into the fully-connected classifier with shared weights to obtain the posterior probability that the corresponding sub-segment of the audio segment belongs to each class. Therefore, the windows with different sizes slide through different positions, and the sub-segments with different starting positions and different lengths of the audio segment are classified, so that the aim of multi-scale classification is fulfilled. The multi-scale classifier is used on feature maps of different receptive fields, so that the features of each sub-segment from local to global can be captured, and the classification precision is improved.

As shown in fig. 22, in one example, 10 sub-regions are preset as a multi-scale window (candidate regions in the corresponding graph), each sub-region represents a different receptive field on the input feature graph, and the above 10 sub-regions [ s ', e' ] are compared with the coarse keyword region [ s, e ] detected by VAD, and the sub-region in which the overlap degree is greater than a preset threshold (e.g. greater than 60%) is selected as follows:

in the formula, overlap ([ s, e ]],[s′,e′]) Represents a sub-region [ s, e ]]And subregion [ s ', e']The degree of overlap between. For example, the selected sub-region (which may also be referred to as the matched interval) is

And

selected according to the aboveAnd intercepting the characteristic graph by the sub-region, and then respectively sending the characteristic graph into softmax classifiers with corresponding scales to predict keywords to obtain the probability of the keywords with different scales, wherein weight parameters of different classifiers are shared, so that model parameters and calculated amount can be reduced.

As shown in fig. 23, a keyword grabber may also be used to perform filter denoising and voice endpoint detection on the voice frequency segment, determine an effective voice rough segment in the voice segment, that is, an effective voice region, match a plurality of candidate windows with the effective voice region, determine an optimal matching window and an optimal sliding position, that is, determine a preset size and a corresponding sliding position of the window, obtain a local feature map in the corresponding second feature map, that is, obtain corresponding sub-regions in the second feature map, average pooling the sub-regions, obtain pooled vectors, input the vectors into the fully-connected classifier, and obtain a posterior probability that the corresponding sub-segment of the voice frequency segment belongs to each classification.

In a possible implementation manner of the embodiment of the present application, in the fully-connected classifier corresponding to each sub-region, parameters of at least two fully-connected classifiers are the same.

Specifically, in the fully connected classifier corresponding to each sub-region, parameters of a part of the fully connected classifiers may be the same, or parameters of all the fully connected classifiers may be the same.

In fig. 21 and 23, when windows of different sizes slide on the same feature map, parameters of the fully-connected classifier at each position are shared, that is, internal weights of the multi-scale classifier are shared, so that the number of parameters of the classifier can be reduced.

The specific process of determining the probability that the speech segment contains the keyword when the number of the keywords is different will be further described below with reference to the embodiment.

In a possible implementation manner of the embodiment of the application, the number of the keywords is less than the preset classification number; step S430 of classifying each sub-region of each second feature map to determine a probability that the speech segment includes the keyword may include:

(1) for each second feature map, determining the posterior probability of each sound generating subunit of which each subregion contains a keyword in the second feature map;

(2) determining the confidence level of each sub-region containing a keyword based on the posterior probability of each sound production subunit of each sub-region containing the keyword;

(3) and taking the highest confidence coefficient in the confidence coefficients of the keywords contained in each subarea in each second feature map as the probability that the speech segment contains the keywords.

Specifically, if the number of the keywords is smaller than the preset number of classifications, for example, the preset number of classifications is 2, and if the number of the keywords is 1, when classifying each sub-region of each second feature map, the classification is set to include different sound subunits in the keyword.

For example, if the keyword is "hi Bixby", for a sub-region of a second feature map, the posterior probability of "hi" in the sub-region may be determined to be 0.4, and the posterior probability of "Bixby" may be determined to be 0.8, then the confidence of the whole keyword may be determined, for example, according to the following formula

Determining the confidence of the whole keyword; the highest confidence level that the speech segment contains the keyword in all the sub-regions of all the second feature maps can be used as the confidence level that the speech segment contains the keyword.

Taking fig. 24 as an example, if the keyword is "hi Bixby", three categories including "hi", including "Bixby", and "oov" may be set, that is, neither category is included, taking two sub-regions as an example, the probabilities of including "hi", including "Bixby", and "oov" in the two sub-regions are respectively determined, then the probability of including "Bixby" corresponding to each sub-region is respectively determined, and finally the probability of including "Bixby" in the speech segment is determined.

In a possible implementation manner of the embodiment of the application, the number of the keywords is greater than or equal to the preset classification number; step S430 of classifying each sub-region of each second feature map to determine a probability that the speech segment includes the keyword may include:

(1) determining a first posterior probability that each subregion of each second feature map is of a positive class;

if any sub-region comprises any sounding subunit in the plurality of keywords, the sub-region is of a positive type;

(2) and determining the probability of containing the keywords in the voice segment based on the maximum first posterior probability.

Specifically, if the number of the keywords is set to be large, the confidence of a corresponding keyword does not need to be determined according to the posterior probability of the sound generation subunit, but the first posterior probability that each sub-region of each feature map is of the positive type is directly calculated, so that the probability that the voice segment contains the keywords can be determined, and the calculation amount can be effectively reduced.

Specifically, the maximum first posterior probability may be used as the probability that the speech segment contains the keyword.

(2) determining a second posterior probability that each subregion of each second feature map is of an inverse class;

if any sub-region comprises any sounding subunit in the plurality of keywords, the sub-region is of a positive type; if any sub-region does not comprise any sounding subunit in the plurality of keywords, the sub-region is of the reverse type;

(3) and determining the probability of containing the keywords in the voice segment based on the maximum first posterior probability and the maximum second posterior probability.

Specifically, if the number of the keywords is set to be large, the confidence of a corresponding keyword is not determined according to the posterior probability of the sound generation subunit, but a first posterior probability that each subregion of each feature map is of a positive type is directly calculated, a second posterior probability that each subregion of each feature map is of a negative type is determined at the same time, and if the maximum first posterior probability is greater than the maximum second posterior probability, the maximum first posterior probability is used as the probability that the speech segment contains the keyword.

In the application scenario of voice wakeup, a specific construction process of the recognition model of the present application will be described below, that is, after the recognition model of the present application recognizes a voice segment, the recognition model of the present application can perform voice wakeup on an intelligent terminal based on a probability including a keyword.

The identification model (hereinafter, also referred to as a model) may be built with reference to fig. 12, but the specific model has differences in different application scenarios. Two main considerations are:

(1) size of the model: the DRU group number including DRN, the number of DRU in each group and the number of channels;

(2) input and output: the size of the input time-frequency matrix and the number of output nodes.

(1) Design of model size

The difference of the model sizes mainly comes from different awakening devices, the model should be designed to be as small as possible for devices with limited computing resources, such as a smart phone, a watch and the like, and the model can be designed to be relatively large for devices with sufficient computing resources, such as a sound box, a vehicle-mounted system and the like.

Case 1: model building on low-power-consumption equipment such as smart phone and watch

Since the computing resources of the device are limited, and the wake-up model needs to reside in the background, the memory consumption and the computing complexity of the model are reduced as much as possible, and both the number of DRU groups and the number of channels are designed to be as small as possible. For example, a design may be adopted in which: there are 3 DRU groups, and the number of channels is 16, 32, and 48. Each group contains 3 DRUs, the channel compression is half of the input in depth-wise operation, the convolution scale is 3x3, and the total parameter amount of the DRU part is about 12K and is within the bearing range of the equipment.

Case 2: model building on equipment with sufficient computing resources, such as sound box and vehicle-mounted system

The computing resources of the device are sufficient relative to those of a mobile phone and a watch, so that the size of the model can be increased appropriately to improve the identification accuracy. For example, a design may be adopted in which: and 5 DRU groups are provided, the number of channels is 16, 32, 48, 64 and 80 in sequence, each group comprises 4 DRUs, the channels are compressed into half of the input in depth-wise operation, the convolution scale is 3x3, and the total parameter amount of the DRU part is about 60.6K.

(2) Input and output design

The input to the model is a speech segment, so the size of the input tends to depend on the utterance length. The output size of the model, i.e. the classification number of the multi-scale classifier, depends on the number of the wake-up words or the number of the sound units.

Case 1: voice wakeup of a single wakeup word

The input length of a single wake-up word depends on the maximum possible utterance length, which is usually an integer value for ease of processing. For example, the wake-up word hi Bixby, the pronunciation length is usually in the range of 500ms to 800ms according to the pronunciation habit of the speaker, and in order to cover all situations, the input length should be greater than 800ms, such as 1 s.

For a single wake-up word, we generally split the sound unit into multiple subunits during classification. For example hi Bixby, can be split into subunits hi and Bixby, which are both positive. In addition to the positive class, one or two negative classes are added, indicating other situations besides hi and Bixby. If a reverse class oov is added, the output size is 3, representing the posterior probability that the input segment belongs to the three categories hi, Bixby, oov. Oov may be further broken down into filers representing words other than hi and Bixby and a silon representing background noise, in which case the output is 4 in size.

Case 2: voice wakeup of multiple wakeup words, voice control

Voice wakeup of multiple wakeup words and voice control are characterized in that the existence of a few wakeup words is detected. For example, for voice wake of multiple wake words, multiple wake words such as hello Bixby, hi Bixby, ok Bixby, etc. may be supported; for voice control, various commands such as playing music, opening photo album, locking screen, etc. can be supported.

Under such a scenario, the input size is larger than the pronunciation length of the longest wake up word or command, typically taking an integer. The output size is n +1 or n +2, wherein n is the number of the awakening words or commands, and 1 or 2 is the number of the inverse classes.

Case 3: sensitive word detection

The sensitive word detection is more complicated than voice awakening and voice control, so the sensitive word is usually a set containing a plurality of words, if the classification number is determined according to the specific number of the words, the output nodes of the classifier are too many, and the classification effect is greatly reduced. Therefore, the sensitive words can be clustered according to pronunciation similarity by using a classical clustering algorithm such as k-Means to obtain n positive classes, and the output size of the multi-scale classifier is n +1 or n +2, wherein 1 or 2 is the number of the negative classes.

In the following, a training process of the recognition model of the present application in a voice wake-up application scenario, in which the keyword may also be referred to as a wake-up word, will be described.

The model training comprises the steps of training data collection and labeling, feature extraction, training and the like.

(1) Collection, processing and labeling of training data

The aim of collecting the training data is to fully collect various pronunciation characteristics of the keywords, so that the model can better learn the pronunciation characteristics of the keywords. To increase the diversity of the training data, it is also necessary to normalize and process the collected data. Finally, the processed data is labeled, so that the model knows which keyword each audio segment contains.

Case 1: voice wakeup of a single wakeup word

When collecting data, 200 speakers, 100 each for male and female, can be selected, and each person repeats the awakening word 30 times in a quiet environment, with an interval of more than 2 seconds every two times. Then, 30 continuous voices of each person are cut into 30 segments by a sound cutting tool, each segment is 1 second long, the lengths of front and rear silence are the same, and 6000 audio frequencies with the length of 1 second are obtained after the step. The 6000 audios were noised with 5 different noises (cafe, office, tv, subway, music) and 4 different signal-to-noise ratios (5dB, 10dB, 15dB, 20dB) to obtain 120000 audios, which was the positive sample set. And (3) identifying each audio by using an Automatic Speech Recognition (ASR) model to obtain the start-stop position of each audio awakening word. Taking hi Bixby as an example, the segment identified as hi is labeled as 1, the segment identified as Bixby is labeled as 2, and the rest of the segments are labeled as 0.

For the negative sample set, 5 different noises (same as above) can be collected by a recording device for 10 hours respectively, and 24000 segments with the length of 1 second are randomly and repeatedly cut out from each noise audio for a total of 120000, so as to be used as the negative sample set. All segments of the negative sample set are labeled 0.

Case 2: voice wakeup of multiple wakeup words, voice control

In this scenario, the system includes a plurality of wake-up words or commands, and each wake-up word or command is collected in the same manner as in case 1. Because of too many wake-up words or commands, 5 kinds of noise and 4 kinds of signal-to-noise ratios cannot be fully used during noise addition, otherwise the data set is increased by 20 times, and the training is too slow. The strategy can be adjusted such that a noise and a signal-to-noise ratio are randomly selected for each audio to add noise such that the size of the data set remains the same. When in marking, the ASR model is used for obtaining the initial position of the awakening word, then the awakening word is numbered from 1 to n, the segment identified as the awakening word is marked as the corresponding number of the awakening word, and the rest segments are marked as 0.

The scheme of negative sample set collection, processing and labeling is the same as that of case 1.

Case 3: sensitive word detection

The repeated recording method in

cases

1 and 2 cannot be adopted when audio is collected due to the large number of sensitive words. Typical sensitive word corpora are specific segments of hundreds of hours of continuous speech (e.g., a broadcast program). When collecting data, the audio frequency section with the sensitive word is cut off by a fixed length. The resulting data set is not subjected to noise addition because the broadcast program itself carries some background noise. And (3) clustering the sensitive word data set into n classes by using a k-Means algorithm according to pronunciation characteristics, wherein the label of each audio is the serial number of the class.

(2) Feature extraction

The method for feature extraction is relatively fixed, and no difference exists between different schemes. Common speech signal features are Mel-Frequency cepstral coefficients (MFCC), F-Bank features, Perceptual Linear Prediction coefficients (PLP), Linear Prediction cepstral coefficients (MFCC), and so on.

In the process of feature extraction, a sliding window of 30ms is usually used to slide in 10ms step length, and when the sliding window slides to a position, windowing, Fourier transformation and other operations are performed on sampling points in the window to obtain a fixed-length vector, namely a feature frame. Taking the MFCC feature as an example, the static MFCC feature only takes 13 dimensions, and sometimes a first-order and a second-order difference are added to improve the training effect, and the MFCC becomes 39 dimensions. If the energy dimension is added, the dimension is 40. Generally speaking, the same task is trained by using different kinds of features, and the training effect is not very different.

(3) Training method and training strategy

The training of the DRN is supervised training, and the training mode is error back propagation. Assuming that the label of a certain piece of input audio is denoted as GT, the start and stop times of the wake-up word are s and e, respectively. Assuming that DRNs share a total of N groups, the number of multi-scale classifiers is also N, each multi-scale classifier contains sliding positions of M windows of different scales, where the sliding position of the jth window of the ith classifier can be mapped to [ s ] of the input audio_i,e_i]And (3) fragment. Assuming the classification number of DRN is C, the error function of the segment is:

in the formula (I), the compound is shown in the specification,

means that the jth sliding position of the ith multi-scale classifier belongs to the last of the c-th classificationAnd (6) probability testing. I is_c(i, j) is an indicator function,

wherein overlap calculates the degree of overlap of the two regions.

Wherein the region [ s, e ]]And region [ s ]_ij,e_ij]The degree of overlap between can be calculated according to the following equation:

wherein, IoU ([ s, e ]],[s_ij,e_ij]) Representing the region [ s, e ]]And region [ s ]_ij,e_ij]The degree of overlap therebetween, i.e., overlap ([ s, e ]],[s_ij,e_ij]) (ii) a s, e respectively represent the start and end times of a keyword segment of the input audio; s_ij，e_ijIndicating the start and end times of the jth sub-region in the ith classifier.

The function L is an error function of the input audio, which is a complex function of the DRN parameters. The error back propagation is to find the gradient of the error function with respect to each parameter and update the parameter with the gradient, so that the model learns the characteristics of the training data. If for parameter w, the update policy is:

wherein lr is a learning rate, wherein,

is the gradient of the error function in the direction of the parameter w.

With the progress of the training process, the learning rate is continuously reduced, the local optimization is skipped by using the large learning rate in the beginning generations, and the global optimization is found by using the small learning rate in the later generations. An initial learning rate is generally set to be 0.001-0.1, and then the learning process is handed to an optimizer (optimizer). Common optimizers are a Stochastic Gradient descent Optimizer (SGD), an AdaGrad Optimizer, an Adam Optimizer, and the like, and the main difference is in a learning rate descent process.

The parameters are not updated every time a fixed number of voices are input, but are updated once each time a fixed number of voices are input, such a group is called batch, and the size of the group is called batch-size. Generally, the size will take an integer power of 2, such as 64, 128, 256, etc., and the size of the size can be adjusted according to the length of the input speech to facilitate distributed training.

Generally, 90% of the training data will participate in the actual training, and the model is input according to batch and the parameters are updated in sequence, and after all the training is completed, the 1 generation (epoch) training is completed. After the 1 generation training is over, the model performance needs to be tested with another 10% of the data, called validation. The model is usually trained for multiple generations to achieve the expected effect, and the training termination condition has 2 setting modes. Firstly, setting a training algebra, if only training for 30 generations, terminating the training after finishing; secondly, when the performance of the model on the verification data set does not change any more (or the change is very slight), the training of the model is terminated.

The following will explain the specific process of the speech recognition method of the present application in the application scenario of voice wakeup.

The awakening word recognition comprises the steps of awakening audio acquisition and processing, feature extraction, DRN recognition, awakening decision and the like.

(1) Wake-up audio acquisition and processing

In the process of identifying the awakening word, the awakening audio is a streaming audio and needs to be continuously acquired and processed. The sample collection can be done with a window size of 1 second, with the window sliding step of 100ms, i.e. every 100ms the earliest 100ms is removed from the window head and the latest 100ms is added at the window tail, as shown in fig. 25. The obtained Voice segments can be subjected to Voice endpoint Detection (VAD), if the VAD Detection result is invalid Voice, subsequent operations such as feature extraction and DRN identification are not needed, and computing resources are saved.

In another mode, if the VAD detection result is valid speech, rough segments of valid keywords in the speech segment can be further roughly captured, and several windows with higher overlap ratio with the rough segments of the keywords are selected from candidate windows with different scales; and finding the best position of each selected window, namely the position with the highest coincidence degree with the rough keyword segment, and extracting the local feature map contained in the window.

(2) Feature extraction

Every time audio of 1 second length is acquired, features, such as MFCC features or PLP features, are extracted for the segment of audio. It should be noted that the voice feature types in the process of identifying the awakening words should be consistent with the features of the model training.

(3) DRN identification

The features extracted from each section of audio frequency need to be input into a trained DRN model, the DRN model is propagated in the forward direction, and finally all multi-scale classifiers can obtain the posterior probability of each classification of windows with different scales at each sliding position.

Alternatively, if several windows with high overlap ratio with the rough keyword segment have been selected from candidate windows with different scales in step (1), only the posterior probability of each classification of the preset sliding position of the selected window size may be determined.

(4) Wake-up decision

The method of wake-up decision is different in the single wake-up word and multiple wake-up words scenarios.

Case 1: voice wakeup of a single wakeup word

For each multi-scale classifier, the confidence level of the appearance of the wake word is calculated in all sliding positions of all windows. The confidence of the awakening word is formed by fusing the probability of each sound-producing subunit. For example, if the posterior probability of hi is 0.4 and the posterior probability of Bixby is 0.8, the posterior probability of the whole awakening word is

In all the positions of all the windows in each classifier, the highest confidence level of the awakening word is the confidence level of the classifier for recognizing the input voice as the keyword in the current receptive field. The highest value of the confidence of all classifiers is the confidence of the whole model. When the confidence is higher than a preset threshold, a wake-up decision can be made.

Case 2: voice wakeup of multiple wakeup words

There are two ways to wake up based on the probability of the keyword:

the first method is as follows:

for each multi-scale classifier, in all sliding positions of all windows, selecting a positive class and a negative class with the highest posterior probability in each subinterval respectively, and mapping to obtain two subsections which are respectively sections which are considered by the multi-scale classifier to be most likely to appear in the positive class and the negative class under the current scale.

All the sliding positions of all the windows may be preset sliding positions of windows of a plurality of sizes which are directly preset, or may be preset sliding positions of windows of a size which is selected by selecting several windows with a high overlapping degree with the rough segment of the keyword from candidate windows of various different sizes.

Let the positive class identified by the ith multi-scale classifier be c_iThe corresponding segment is [ s ]_pi,e_pi]The segment identified as the inverse class corresponds to is [ s ]_ni,e_ni]The final decision is then:

decision＝argmax(p_i(c_i),p_i(neg)),i＝1,2,…,N (5)

i.e. the classification with the highest probability in the positive identification class and the negative identification class of all multi-scale classifiers. Wherein p is_i(c_i) Identifying a positive class c for the ith multi-scale classifier_iProbability of p_i(neg) is the probability that the ith multi-scale classifier identifies the result as an inverse class.

That is, in the subintervals corresponding to the sliding positions of the windows, the maximum first posterior probability of the positive class and the maximum second posterior probability of the negative class are determined, and if the maximum first posterior probability is larger than the maximum second posterior probability, voice awakening is performed.

The second method comprises the following steps:

the classification probability of the reverse class can also be not considered, only the classification probabilities of all the forward classes are considered, and threshold values are used for limiting to make the wake decision:

wherein thresh represents a preset threshold; neg means no wake up.

That is, in the subintervals corresponding to the sliding positions of the windows, the maximum first posterior probability of the positive class is determined, and if the maximum first posterior probability of the positive class is greater than a preset threshold, the awakening is performed.

The voice wake-up procedure may include the steps of:

1) the output feature map of a particular DRU is input into a multi-scale classifier as an input;

2) the VAD result is input into each multi-scale classifier to select a classification scale;

3) obtaining a classification result of each scale in each multi-scale classifier, namely obtaining the probability of containing each keyword for each scale;

4) based on the classification results of the different multi-scale classifiers, it is determined whether a keyword is present and a wake-up decision is made.

In an example, as shown in fig. 26, in sub-intervals corresponding to a plurality of sliding positions of a plurality of windows of each classifier respectively, a maximum first posterior probability of a positive class is determined, each multi-scale classifier provides top1 keyword estimation, that is, a keyword with a maximum probability in all scales, the maximum probability keyword determined by the multi-scale classifier 1 in fig. 26 is a keyword 2, and multi-scale classifiers of different levels vote together to obtain a final recognition result; the multi-scale classifiers of different layers are assigned different weights, and deeper layers capture more global and complex features, so higher weights are assigned, and the top1 probability is multiplied by the weights assigned in each classifier to obtain a weighted score. The keyword with the highest weighted score will be selected as the final recognition result, and although the shallow layer is assigned a lower weight, the higher keyword may still be the final decisive keyword. The maximum weighted score is compared to a threshold and if greater, the corresponding word is detected, otherwise it fails (non-keyword).

As shown in fig. 26, the posterior probability for the keyword 2 is 0.9, and multiplying 0.9 by a weight of 0.7 yields a first posterior probability for the keyword 2 of 0.63; similarly, the first posterior probability for the keyword 2 based on the multi-scale classifier 2 is 0.56, and the first posterior probability for the keyword 1 based on the multi-scale classifier 3 is 0.54; voting is carried out on the first posterior probabilities obtained by the classifiers with different scales to obtain the maximum first posterior probability 0.63 for the keyword 2; if 0.63 is larger than the threshold value, the keyword 2 is detected and awakening is carried out; if 0.63 is less than or equal to the threshold, no keywords are detected.

According to the voice recognition method, the feature graphs of the voice fragments are extracted through the recognition network, and the extracted feature graphs are classified, so that the recognition efficiency is improved.

Furthermore, the recognition network comprises at least one feature map processing sub-network, the extracted feature map is subjected to channel compression and depth-wise operation through the at least one feature map processing sub-network, then channel recovery is carried out, the feature map processed by the feature map processing sub-network is classified, so that the voice segments are recognized, and the operation amount in the recognition process can be effectively reduced.

Furthermore, the output of each group of DRUs is used as a corresponding second feature map, each group of DRUs comprises at least one DRU, the second feature maps are classified to obtain the probability that the voice fragment contains the keyword, the second feature maps output by the DRUs in the middle layer can be used for effectively capturing detailed features, and overall control can be achieved, so that the accuracy of voice recognition is improved.

In order to more clearly illustrate the speech recognition method of the present application, the recognition network of the present application will be described below with reference to examples.

In one example, a speech segment is sampled from continuous speech in a window of 1s and a step size of 0.5 s. For each speech segment, a 20Hz/4kHz bandpass filter is first applied to reduce noise, followed by a speech endpoint detection module to detect the presence of human speech. Thereafter, a feature map of 40 dimensions (13-dimensional MFCC + 13-dimensional first order difference + 13-dimensional second order difference + 1-dimensional energy) is constructed in a window of 30ms and a step size of 10ms, and the resultant feature maps are stacked as 2D pictures and input into a DRN.

The whole speech recognition process is shown in fig. 27, and the feature maps are spliced into a two-dimensional time-frequency matrix by the preprocessing module and sent to the DRN as input. The DRN is actually a word classifier that computes the posteriori of all the keywords/words of the input feature matrix. The post-processing module integrates these offspring and makes final decisions.

The DRN is designed on the basis of a residual neural network (ResNet), and the quick connection of the DRN is helpful for training deeper models. However, the latest ResNet solution [5] still requires a larger model size (about 250K parameter). Thus, the 1x1 convolution and depth-wise convolution are applied in this example to optimize the residual units.

In order to more clearly illustrate the speech recognition method of the present application, a feature diagram processing sub-network (DRN) of the present application will be described below with reference to an example.

In one example, as shown in fig. 28, a 1x1 convolution may be added at the head and tail of each existing residual unit and the existing 3x3 convolution is replaced with a depth-wise convolution of 3x3, resulting in an optimized Depth Residual Unit (DRU). The header 1x1 convolution (which may be referred to as 1x1 cross-channel convolution, 1x1 cross-channel convolution) helps to reduce the number of channels, and in order to reduce the number of parameters and the amount of computation of the intermediate depth layer, if the number of channels of the input feature map of the DRU is n, the number of channels of the feature map obtained by the header 1x1 convolution is n/2. The tail 1x1 convolution (which may be referred to as 1x1 cross-channel convolution, 1x1 cross-channel convolution) helps recover the channel number to keep the output shape the same as the input shape. The middle 3x3 depth-wise convolutional layer performs the channel convolution, i.e. the convolution operations are performed separately on different channels without merging together. Thus, the middle depth-wise convolution of 3x3 requires only 1 kernel.

Taking the first DRU in DRN10 as an example, the input feature map contains 16 channels, and the kernel size of the depth-wise layer is 3 × 3. Therefore, the number of parameters of the DRU is 1 × 1 × 16 × 8 × 2+3 × 3 × 8 ═ 328. For the original residual unit with the same number of input channels, the number of parameters is 3 × 3 × 16 × 16 ═ 2304. Our method reduces the number of parameters by more than 7 times.

In order to more clearly illustrate the speech recognition method of the present application, the multi-scale classifier of the present application will be described below with reference to an example.

In one example, as shown in fig. 29, fig. 29 is a schematic diagram of the principle of a multi-scale classifier under different speaking styles and noisy environments, where classification in different receptive fields can take both local and global features into account, and classification of sub-regions of different sizes can identify input audio from different angles, which is why the multi-scale classifier can help to improve robustness, and generally, there is one window of optimal size among all windows with different sizes. Windows of different proportions may be moved across the profile to find the optimum sliding position. In order to cope with different speaking styles, the multi-scale classifier can find the window which covers the optimal sliding position and the optimal size of the keyword segment, filter white noise and improve the identification precision. For noisy environments, the best sliding position, best size window can avoid interference caused by extra noise segments, thus making the result more reasonable.

In order to more clearly illustrate the speech recognition method of the present application, the speech recognition method of the present application will be described below with reference to examples and a speech wake-up scenario.

In one example, when the speech recognition method of the present application is used for performing voice wakeup, the method may include the following steps:

1) acquiring a voice segment to be processed, and performing filter noise reduction processing and voice endpoint detection on the audio segment;

2) if the voice segment is valid voice, extracting a first feature map of the voice segment;

3) inputting the first feature map into a plurality of feature map processing sub-networks, namely DRUs, which are stacked in sequence, and taking the output of the last feature map processing sub-network as a second feature map;

4) performing average pooling on the second feature map;

5) inputting the average pooled second feature map into a full-connection classifier to obtain the posterior probability of each classification so as to determine the probability of the speech segment containing the keywords;

6) and if the probability of the contained keywords is greater than the preset probability, performing voice awakening.

In another example, the number of the wake-up words is 1, and when the speech recognition method of the present application is used for performing speech wake-up, the method may include the following steps:

3) inputting the first feature map into a plurality of groups of DRUs stacked in sequence, wherein each group of DRUs comprises at least one DRU;

4) taking the output of each group of DRUs as a second feature map corresponding to the group of DRUs;

5) aiming at a second feature map corresponding to each group of feature map processing sub-network, determining a sub-region corresponding to a preset sliding position of each window in the second feature map based on a plurality of windows with preset sizes;

6) for each second feature map, determining the posterior probability of each sound generating subunit of which each subregion contains a keyword in the second feature map;

7) determining the confidence level of each sub-region containing a keyword based on the posterior probability of each sound production subunit of each sub-region containing the keyword;

8) taking the highest confidence coefficient in the confidence coefficients of the keywords contained in each subarea in each second feature map as the probability that the speech segment contains the keywords;

9) and if the probability of the keyword is greater than a preset threshold value, performing voice awakening.

In another example, the number of the wake-up words is multiple, and when the speech recognition method of the present application is used for performing speech wake-up, the method may include the following steps:

2) if the voice segment is an effective voice, determining that the voice segment comprises the approximate position of the keyword, namely an effective voice area;

3) selecting a plurality of windows with higher coincidence degree with the effective voice area from various windows with different candidate sizes, and determining the preset size and the preset sliding position of the window;

4) extracting a first feature map of the voice segment;

5) inputting the first feature map into a plurality of groups of DRUs stacked in sequence, wherein each group of DRUs comprises at least one DRU;

6) taking the output of each group of DRUs as a second feature map corresponding to the group of DRUs;

7) for a second feature map corresponding to each group of DRUs, determining a sub-region corresponding to a preset sliding position of each window in the second feature map based on the determined window with a preset size;

8) determining a first posterior probability that each subregion of each second feature map is of a positive class;

9) determining the probability of containing the keywords in the voice fragment based on the maximum first posterior probability;

10) and if the maximum first posterior probability is greater than a preset threshold value, performing voice awakening.

The voice recognition method has the following technical effects:

a) the size and complexity of the model can be greatly reduced. As with DRN10, the size and number of multiplication operations of the existing deep residual neural network can be compressed by a factor of about 7.

b) The robustness of the voice awakening model is improved; by adopting the multi-scale classifier, the noise immunity and the speaker adaptability of the voice awakening model are improved.

The effect of the speech recognition method of the present application will be explained below based on experimental data.

Assuming that the input/output size of DRU in this application is consistent with the original residual unit, the 1 × 1 convolution of the header reduces the number of channels of the feature map from c to c ', then the size of the 1 × 1 convolution of the header is (1,1, c), the number is c', and the size of the intermediate depth-wise convolution is (k)_w,k_hC '), number 1, tail 1 × 1 convolution size (1,1, c'), number c. The number of DRU parameters is denoted as # para_DRUThe multiplication times are recorded as # Multi_DRUThen, there are:

#para_DRU＝1×1×c×c′+k_w×k_h×c′+1×1×c′×c (7)

#multi_DRU＝w×h×c′×1×1×c+w×h×c×k_w×k_h+w×h×c×1×1×c′ (8)

when w is 100, h is 40, c is 16, c is 8, k_w＝3，k_hIf 3, then there is # para_org＝2304，#para_DRU＝328，#multi_org＝9216K，#multi_DRU1312K. It can be seen that the present application compresses the number of parameters and the number of multiplication operations by a factor of 7.

The parameters corresponding to the fully-connected classifiers corresponding to each sub-region are the same, so that the number of the parameters of the classifiers can be reduced.

Suppose that the DRN contains k sets of DRUs in total, and the number of channels in each set is c₁,c₂,…,c_kThen the parameter increment brought by the multi-scale classifier is:

wherein C is the classification number of the classifier.

Let k be 3, c₁＝16，c₂＝32，c₃When C is 48, then there are

Therefore, the multi-scale classifier only causes little parameter increase, but greatly improves the classification precision.

The models of the present application and the prior art are evaluated from 2 dimensions (precision and footprint) using 3 indices (false rejection rate FRR, parameter quantity param and multiplier mul):

wherein FRR represents a ratio that should be recognized as a keyword but fails, and a smaller value is better; param represents the size of the model, the smaller the better; mul represents the computational complexity and delay of the algorithm, the smaller the better.

The comparison result shows that the performance of the DRN8 can be compared with that of the existing res8-narrow, but the used parameters are reduced by 1.6 times, and the multiplication is reduced by 3.4 times; the performance of DRN15 is better than that of res15, parameters are reduced by 7 times, and multiplication is reduced by 194 times; these show that the DRN of the present application maintains good performance with fewer parameters without using a multi-scale classifier.

Further testing was conducted by making artificial noise data sets and fast speaking data sets.

Build a noisy device by adding 3 types of noise (siren, car, and office) to the original clean test device; the rapidly spoken scene is constructed by 1.2 times time stretch, and the following comparison results can be obtained:

for DRN8, the multi-scale classifier improved FRR by 1.5%, 2.3%, and 2.7%, with 0.8K of additional param and 0.01M of mul, respectively, on clean, noisy, and fast speaking datasets;

for DRN15, the multi-scale classifier improved FRR by 0.4%, 0.9%, and 2.6%, respectively, over clean, noisy, and fast speaking datasets with 1.4K additional param, and almost no additional mul.

The above embodiment introduces the speech recognition method through an angle of a method flow, and the following is introduced through an angle of a virtual module, which is specifically as follows:

an embodiment of the present application provides a speech recognition apparatus 3000, as shown in fig. 30, the apparatus 3000 may include an obtaining module 3001 and a recognition module 3002, where:

an obtaining module 3001, configured to obtain a to-be-processed voice segment;

the recognition module 3002 is configured to extract a feature map of the speech segment based on a recognition network, and classify the extracted feature map to obtain a probability that the speech segment contains a keyword. .

In a possible implementation manner of the embodiment of the present application, the recognition module 3002 is specifically configured to, when extracting the feature map of the speech segment based on the recognition network and classifying the extracted feature map to obtain the probability that the speech segment contains the keyword:

extracting a first feature map of the voice segment based on a feature extraction sub-network in the recognition network;

processing the first feature map based on at least one feature map processing sub-network in the identification network to obtain at least one second feature map;

and classifying the at least one second feature map based on at least one classifier in the recognition network to obtain the probability that the voice fragment contains the keywords.

In one possible implementation manner of the embodiment of the present application, for any one of at least one feature map processing sub-network, the feature map processing sub-network includes a first number of channel compression convolutions, a depth separable convolution and a second number of channel restoration convolutions;

in a possible implementation manner of the embodiment of the present application, for any one of the at least one feature map processing sub-network, the number of channels corresponding to the channel compression convolution of the feature map processing sub-network is equal to the number of channels corresponding to the input feature map of the feature map processing sub-network; the numerical value of the first quantity is smaller than the numerical value of the number of channels corresponding to the input feature map of the feature map processing sub-network;

In a possible implementation manner of the embodiment of the present application, the channel compression convolution is a 1 × 1 convolution; the channel recovery convolution is a 1x1 convolution.

In a possible implementation manner of the embodiment of the present application, when the identifying module 3002 processes the first feature map based on at least one feature map processing sub-network in the identifying network to obtain at least one second feature map, specifically configured to:

inputting a first feature map into a plurality of feature map processing subnetworks which are stacked in sequence;

wherein, the input of the first feature map processing sub-network is the first feature map; for any one of the feature map processing subnetworks except the first one, the input is the output of the last feature map processing subnetwork;

and taking the output of the last feature map processing sub-network as a second feature map.

In a possible implementation manner of this embodiment of the application, the identifying module 3002 processes the first feature map based on at least one feature map processing sub-network in the identifying network to obtain at least one second feature map, which is specifically configured to:

inputting the first feature map into a plurality of groups of feature map processing sub-networks which are stacked in sequence; wherein any group of feature map processing subnetworks comprises at least one feature map processing subnetwork;

and taking the output of each group of feature map processing sub-network as a second feature map corresponding to the group of feature map processing sub-network.

In a possible implementation manner of the embodiment of the present application, when classifying the at least one second feature map and obtaining a probability that the voice segment includes the keyword, the identifying module 3002 is specifically configured to:

for each second feature map in at least one second feature map, determining a sub-region corresponding to a preset sliding position of each window in the second feature map based on a plurality of windows with preset sizes;

and classifying each sub-region of each second feature map to determine the probability of the speech segment containing the keywords.

performing voice endpoint detection on the voice segments, and determining an effective voice area in the first feature map;

from the plurality of candidate sizes, a preset size of a window matching the valid speech region is determined, and a corresponding preset slide position is determined.

In a possible implementation manner of the embodiment of the present application, when classifying each sub-region of each second feature map, the identifying module 3002 is specifically configured to:

performing average pooling on each sub-region in the second feature map;

and respectively inputting each sub-region after the average pooling into a corresponding full-connection classifier for classification to obtain the posterior probability of each sub-region belonging to each preset class.

In a possible implementation manner of the embodiment of the application, the number of the keywords is less than the preset classification number; when classifying each sub-region of each second feature map to determine the probability that the speech segment includes the keyword, the recognition module 3002 is specifically configured to:

for each second feature map, determining the posterior probability of each sound generating subunit of which each subregion contains a keyword in the second feature map;

determining the confidence level of each sub-region containing a keyword based on the posterior probability of each sound production subunit of each sub-region containing the keyword;

and taking the highest confidence coefficient in the confidence coefficients of the keywords contained in each subarea in each second feature map as the probability that the speech segment contains the keywords.

In a possible implementation manner of the embodiment of the application, the number of the keywords is greater than or equal to the preset classification number; when classifying each sub-region of each second feature map to determine the probability that the speech segment includes the keyword, the recognition module 3002 is specifically configured to:

determining a first posterior probability that each subregion of each second feature map is of a positive class;

and determining the probability of containing the keywords in the voice segment based on the maximum first posterior probability.

determining a second posterior probability that each subregion of each second feature map is of an inverse class;

and determining the probability of containing the keywords in the voice segment based on the maximum first posterior probability and the maximum second posterior probability.

The voice recognition device extracts the feature maps of the voice fragments through the recognition network and classifies the extracted feature maps, so that the recognition efficiency is improved.

The voice recognition device for pictures according to the embodiments of the present disclosure may execute the voice recognition method for pictures provided by the embodiments of the present disclosure, and the implementation principle is similar, the actions performed by each module in the voice recognition device for pictures according to the embodiments of the present disclosure correspond to the steps in the voice recognition method for pictures according to the embodiments of the present disclosure, and for the detailed function description of each module in the voice recognition device for pictures, reference may be specifically made to the description in the voice recognition method for corresponding pictures shown in the foregoing, and details are not repeated here.

The speech recognition apparatus provided in the embodiment of the present application is described above from the perspective of function modularization, and then the electronic device provided in the embodiment of the present application is described from the perspective of hardware implementation, and a computing system of the electronic device is also described.

Based on the same principle as the method shown in the embodiments of the present disclosure, embodiments of the present disclosure also provide an electronic device, which may include but is not limited to: a processor and a memory; a memory for storing computer operating instructions; and the processor is used for executing the voice recognition method shown in the embodiment by calling the computer operation instruction. Compared with the prior art, the voice recognition method is easier to capture voice features from multiple scales, avoids the interference of non-keyword segments on recognition results, and enables the recognition results to be more accurate.

In an alternative embodiment, there is provided an electronic device, as shown in fig. 31, the electronic device 3100 shown in fig. 31 including: a processor 3101 and a memory 3103. Among other things, processor 3101 is coupled to memory 3103, e.g., via bus 3102. Optionally, the electronic device 3100 may also include a transceiver 3104. In addition, the transceiver 3104 is not limited to one in practical applications, and the structure of the electronic device 3100 is not limited to the embodiment of the present application.

The Processor 3101 may be a CPU (Central Processing Unit), a general purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 3101 may also be a combination of computing functions, e.g., comprising one or more microprocessors, a combination of DSPs and microprocessors, and the like.

Bus 3102 may include a path that transfers information between the above components. The bus 3102 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 3102 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 31, but this does not mean only one bus or one type of bus.

The Memory 3103 may be, but is not limited to, a ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, a RAM (Random Access Memory) or other type of dynamic storage device that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.

The memory 3103 is used for storing application program code for implementing the present scheme and is controlled by the processor 3101 for execution. The processor 3101 is configured to execute application program code stored in the memory 3103 to implement what is shown in the foregoing method embodiments.

Among them, electronic devices include but are not limited to: mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 31 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

The present application provides a computer-readable storage medium, on which a computer program is stored, which, when running on a computer, enables the computer to execute the corresponding content in the foregoing method embodiments. Compared with the prior art, the voice recognition method is easier to capture voice features from multiple scales, avoids the interference of non-keyword segments on recognition results, and enables the recognition results to be more accurate.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the methods shown in the above embodiments.

Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The name of a module does not in some cases form a limitation of the module itself, and for example, the obtaining module may also be described as a "module for obtaining a speech segment to be processed".

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims

1. A speech recognition method, comprising:

acquiring a voice fragment to be processed;

and extracting the feature map of the voice segment based on the recognition network, and classifying the extracted feature map to obtain the probability that the voice segment contains the keywords.

2. The method of claim 1, wherein the extracting feature maps of the speech segments based on the recognition network and classifying the extracted feature maps to obtain probabilities that the speech segments contain keywords comprises:

3. The method of claim 2, wherein for any one of the at least one feature map processing sub-network, the feature map processing sub-network comprises a first number of channel compression convolutions, a depth separable convolution and a second number of channel recovery convolutions.

4. The method of claim 3, wherein for any one of the at least one feature map processing sub-network, the number of channels corresponding to the channel compression convolution of that feature map processing sub-network is equal to the number of channels corresponding to the feature map of the input of that feature map processing sub-network; the numerical value of the first quantity is smaller than the numerical value of the number of channels corresponding to the input feature map of the feature map processing sub-network;

the numerical value of the number of channels corresponding to the channel recovery convolution of the feature map processing sub-network is equal to the numerical value of the first number; the second number is equal to the number of channels corresponding to the input profile of the profile processing subnetwork.

5. The method of claim 3 or 4, wherein the channel compression convolution is a 1x1 convolution; the channel recovery convolution is a 1x1 convolution.

6. The method according to any of claims 2 to 5, wherein the processing the first feature map based on at least one feature map processing sub-network in the recognition network to obtain at least one second feature map comprises:

inputting the first feature map into a plurality of feature map processing subnetworks which are stacked in sequence;

wherein the input of the first feature map processing subnetwork is the first feature map; for any one of the feature map processing subnetworks other than the first one of the feature map processing subnetworks, inputting the output of the last feature map processing subnetwork;

and taking the output of the last feature map processing sub-network as the second feature map.

7. The method according to any of claims 2 to 5, wherein the processing the first feature map based on at least one feature map processing sub-network in the recognition network to obtain at least one second feature map comprises:

8. The method according to any one of claims 2 to 7, wherein the classifying the at least one second feature map to obtain the probability that the speech segment contains the keyword comprises:

for each second feature map in the at least one second feature map, determining a sub-region corresponding to the sliding position of each window in the second feature map based on a plurality of windows with preset sizes;

9. The method according to any one of claims 2 to 7, wherein the classifying the at least one second feature map to obtain the probability that the speech segment contains the keyword comprises:

determining a window matched with the effective voice area from a plurality of windows with candidate sizes, and determining a corresponding sliding position based on the effective voice area;

for each second feature map in the at least one second feature map, determining a sub-region corresponding to the sliding position of each window in the second feature map according to the determined window and sliding position;

10. The method according to claim 8 or 9, wherein the classifying each sub-region of each second feature map comprises:

performing average pooling on each sub-region in the second feature map;

11. The method of claim 10, wherein parameters of at least two of the fully connected classifiers corresponding to each sub-region are the same.

12. The method according to claim 8 or 9, wherein the number of keywords is less than a preset number of categories; the classifying each sub-region of each second feature map to determine the probability that the speech segment contains the keyword includes:

and taking the highest confidence coefficient in the confidence coefficients of each sub-region containing the keyword in each second feature map as the probability that the voice segment contains the keyword.

13. The method according to claim 8 or 9, wherein the number of keywords is greater than or equal to the preset number of categories; the classifying each sub-region of each second feature map to determine the probability that the speech segment contains the keyword includes:

and determining the probability of the keyword contained in the voice segment based on the maximum first posterior probability.

14. The method according to claim 8 or 9, wherein the number of keywords is greater than or equal to the preset number of categories; the classifying each sub-region of each second feature map to determine the probability that the speech segment contains the keyword includes:

15. A speech recognition apparatus, comprising:

the acquisition module is used for acquiring a voice fragment to be processed;

and the recognition module is used for extracting the feature map of the voice segment and classifying the extracted feature map to obtain the probability that the voice segment contains the keywords.

16. An electronic device, comprising:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to: performing a speech recognition method according to any one of claims 1-14.

17. A computer readable storage medium, characterized in that the storage medium stores at least one instruction, at least one program, a set of codes, or a set of instructions that is loaded and executed by the processor to implement the speech recognition method according to any one of claims 1-14.