CN113782014A

CN113782014A - Voice recognition method and device

Info

Publication number: CN113782014A
Application number: CN202111128230.2A
Authority: CN
Inventors: 谢鲁源
Original assignee: Lenovo Beijing Ltd
Current assignee: Lenovo Beijing Ltd
Priority date: 2021-09-26
Filing date: 2021-09-26
Publication date: 2021-12-10
Anticipated expiration: 2041-09-26
Also published as: CN113782014B

Abstract

The application provides a voice recognition method and a device, in the training stage of a wakeup word recognition model, model parameters suitable for wakeup word recognition of voice data of different voice speed categories are trained, under the condition that a user needs to wake up an object to be woken up by voice (such as any terminal or any application installed on the terminal) and the like, corresponding voice data to be recognized are obtained, the voice data to be recognized are firstly subjected to voice speed recognition to obtain a voice speed recognition result, after a target model parameter matched with the target model parameter is selected, the voice data to be recognized are input into the wakeup word recognition model adopting the target model parameter to be processed, compared with the method that voice data of various voice speeds are subjected to wakeup word recognition processing by adopting a fixed model parameter, the accuracy of the wakeup word recognition result of the voice data to be recognized is improved, and therefore the wakeup rate of the object to be woken up is improved.

Description

Voice recognition method and device

Technical Field

The present application relates to the field of speech processing, and more particularly, to a speech recognition method and apparatus.

Background

With the development of artificial intelligence technology, voice wake-up technology is one of the important branches in the field of voice recognition in artificial intelligence, and is widely applied to voice interaction systems such as mobile phone terminals, smart homes, vehicle-mounted navigation and medical equipment, so that a user can wake up equipment by using a voice instruction (i.e. wake-up word) to trigger the equipment to enter a specific working state, and the use requirements of the user on the equipment are met.

In the awakening word detection process, a sliding window with a preset length can be utilized to extract the characteristics of the voice signal to be recognized, and whether the voice signal to be recognized contains the preset awakening word or not is determined based on the extracted characteristic information. However, the speech rate of the speech signal to be recognized output by the user is often variable, and this speech recognition method cannot ensure the detection accuracy of the speech signal to be recognized at different speech rates, thereby reducing the device wake-up rate.

Disclosure of Invention

In view of the above, the present application provides a speech recognition method, including:

acquiring voice data to be recognized;

carrying out speech speed recognition on the speech data to be recognized to obtain a speech speed recognition result;

acquiring target model parameters matched with the speech speed recognition result in the awakening word recognition model; the awakening word recognition model has model parameters aiming at different speech speed voice data;

and inputting the voice data to be recognized into a wake-up word recognition model adopting the target model parameters, and outputting a wake-up word recognition result of the voice signal to be recognized.

Optionally, the obtaining of the target model parameter matched with the speech rate recognition result in the awakening word recognition model includes:

if the speech speed recognition result shows that the speech data to be recognized belongs to a first speech speed, selecting a target characteristic layer matched with the first speech speed from a plurality of characteristic layers contained in an awakening word recognition model; the target feature layers corresponding to different speech rates are different, and/or the feature mapping areas of different target feature layers are different.

Optionally, the inputting the voice data to be recognized into the awakening word recognition model using the target model parameter, and outputting the awakening word recognition result of the voice signal to be recognized includes:

inputting the voice data to be recognized into the awakening word recognition model, and performing feature extraction on the voice data to be recognized by the target feature layer to obtain a target voice feature vector of the voice data to be recognized;

and performing awakening word recognition on the target voice characteristic vector to obtain an awakening word recognition result of the voice data to be recognized.

Optionally, the obtaining a target model parameter in the awakening word recognition model, which is matched with the speech rate recognition result, inputting the voice data to be recognized into the awakening word recognition model using the target model parameter, and outputting the awakening word recognition result of the voice signal to be recognized includes:

if the speech speed recognition result shows that the speech data to be recognized has a first probability of belonging to a first speech speed and a second probability of belonging to a second speech speed, determining a first feature extraction network matched with the first speech speed and a second feature extraction network matched with the second speech speed; the number of the feature layer layers and/or the feature mapping areas contained in different feature extraction networks are different;

inputting the voice data to be recognized into the first feature extraction network and the second feature extraction network respectively, and outputting corresponding first voice feature vectors and second voice feature vectors;

acquiring a first weight vector of the first voice feature vector and a second weight vector of the second voice feature vector;

and processing the first voice characteristic vector and the second voice characteristic vector according to the first weight vector and the second weight vector to obtain a wake-up word recognition result of the voice data to be recognized.

Optionally, the obtaining a first weight vector of the first speech feature vector and a second weight vector of the second speech feature vector includes:

fusing the first voice feature vector and the second voice feature vector to obtain a fused voice feature vector;

and carrying out speech speed classification processing on the fusion speech feature vector to obtain a first weight vector of the first speech feature vector and a second weight vector of the second speech feature vector.

Optionally, the processing the first speech feature vector and the second speech feature vector according to the first weight vector and the second weight vector to obtain a recognition result of the wakeup word of the speech data to be recognized includes:

carrying out weighted fusion processing on the first voice feature vector, the first weight vector, the second voice feature vector and the second weight vector to obtain a target voice feature vector of the voice data to be recognized;

and performing awakening word recognition on the target characteristic vector to obtain an awakening word recognition result of the voice data to be recognized.

Optionally, the number of feature layer layers corresponding to the speech data with the faster speech speed is smaller, and/or the size of the feature mapping area of the feature layer is smaller.

Optionally, the acquiring the voice data to be recognized includes:

acquiring a voice signal to be recognized;

performing frame feature extraction on the voice signal to be recognized to obtain a corresponding voice frame feature vector;

and forming the voice data to be recognized by a plurality of the voice frame feature vectors.

Optionally, the performing speech rate recognition on the speech data to be recognized to obtain a speech rate recognition result includes:

inputting the voice data to be recognized into a speech rate classification model, and outputting a speech rate recognition result;

and the speech rate recognition result comprises the prediction probability that the speech data to be recognized belong to different speech rate categories.

In another aspect, the present application further provides a speech recognition apparatus, where the speech recognition apparatus includes:

the voice data acquisition module is used for acquiring voice data to be recognized;

the speech speed recognition module is used for carrying out speech speed recognition on the speech data to be recognized to obtain a speech speed recognition result;

the target model parameter acquisition module is used for acquiring target model parameters matched with the speech rate recognition result in the awakening word recognition model;

and the awakening word recognition module is used for inputting the voice data to be recognized into the awakening word recognition model adopting the target model parameters and outputting the awakening word recognition result of the voice signal to be recognized.

Therefore, the application provides a voice recognition method and a device, in the training stage of the awakening word recognition model, model parameters suitable for awakening word recognition of voice data of different voice speed categories are trained, corresponding voice data to be recognized are obtained under the condition that a user needs to awaken an object to be awakened (such as any terminal or any application installed on the terminal) through voice, the voice data to be recognized are firstly subjected to voice speed recognition to obtain a voice speed recognition result, after a target model parameter matched with the voice data to be recognized is selected, the voice data to be recognized is input into the awakening word recognition model adopting the target model parameter to be processed, compared with the mode that voice data of various voice speeds are subjected to awakening word recognition processing through fixed model parameters, the accuracy of the awakening word recognition result of the voice data to be recognized is improved, and therefore the awakening rate of the object to be awakened is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a system diagram illustrating an alternative application environment for the speech recognition method proposed in the present application;

FIG. 2 is a diagram illustrating an alternative example of a hardware configuration of a computer device suitable for use in the speech recognition method proposed in the present application;

FIG. 3 is a schematic diagram of a hardware configuration of yet another alternative example of a computer device suitable for use in the speech recognition method proposed in the present application;

FIG. 4 is a schematic flow chart of an alternative example of a speech recognition method proposed in the present application;

FIG. 5 is a schematic flow chart of yet another alternative example of a speech recognition method proposed in the present application;

FIG. 6 is a schematic flow chart of yet another alternative example of a speech recognition method proposed in the present application;

FIG. 7 is a schematic flow chart of yet another alternative example of a speech recognition method proposed by the present application;

FIG. 8 is a schematic flow chart of yet another alternative example of a speech recognition method proposed in the present application;

FIG. 9 is a schematic flow chart of yet another alternative example of a speech recognition method proposed by the present application;

fig. 10 is a schematic structural diagram of an alternative example of the speech recognition apparatus proposed in the present application;

fig. 11 is a schematic structural diagram of yet another alternative example of the speech recognition apparatus proposed in the present application;

fig. 12 is a schematic structural diagram of yet another alternative example of the speech recognition apparatus proposed in the present application.

Detailed Description

Aiming at the technical problems described in the background technology part, in order to realize the recognition of the voice data with different speech speeds and obtain a more accurate voice recognition result, the method and the device provide that the voice data of training samples with different speech speeds are simulated in a data enhancement mode in a model training stage, and the generalization of the speech speed data is increased so as to improve the reliability and the accuracy of the recognition model of the awakening words obtained by training based on the voice data of the training samples with various speech speeds, namely, the robustness of a voice recognition engine is increased.

However, this training method requires many balanced training sample speech data with various speech speeds, which is difficult to meet in practical application, and matching with speech data of an inactive word also restricts the model performance, which affects the reliability and accuracy of the recognition result of the awakening word recognition on the collected speech data by using the pre-trained awakening word recognition model in practical application, i.e. reduces the recognition accuracy of the speech recognition engine.

In order to improve the above problems and improve the recognition accuracy of the awakening words, the application provides that adaptive model parameters are trained in advance for voice data of different voice categories, and in practical application, an awakening word recognition model with model parameters corresponding to the speed categories of the collected voice data can be utilized to perform awakening word recognition on the voice data to obtain a more accurate awakening word recognition result, so that the technical problem that the accuracy of the awakening word recognition result of the voice data of various speeds cannot be ensured by adopting the awakening word recognition model with the pre-trained fixed model parameters to perform awakening word recognition on the voice data of various speeds is solved.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, a schematic structural diagram of a system of an optional application environment suitable for the speech recognition method proposed in the present application is shown in fig. 1, in which the system may include a terminal 11 and a server 12, where:

the terminal 11 may be an electronic device equipped with a speech recognition engine (e.g., a speech assistant), and a user may pre-configure a wake-up word for waking up the terminal device or some application thereof according to personal preferences or habits, so that when the user needs to use some application of the terminal 11, the user may speak out speech data containing the corresponding wake-up word, so that the speech recognition engine of the terminal 11 performs wake-up word recognition on the speech data, determines that the voice data contains the pre-configured wake-up word, and wakes up the terminal 11.

In the process of performing awakening word recognition on the collected voice data by the voice recognition engine, a pre-trained awakening word recognition model can be used for realizing the awakening word recognition, and in combination with the description of the technical concept of the application, in the process of training the awakening word recognition model, corresponding model parameters obtained by training are trained aiming at training sample voice data of different voice speed categories, so that in the process of performing awakening word recognition by the voice recognition engine, the voice speed recognition can be performed on the collected voice data firstly, matched target model parameters are determined, and then the collected voice data is input into the awakening word recognition model adopting the target model parameters for processing, so that an awakening word recognition result with high accuracy is obtained. With regard to the model training implementation process, reference may be made to the following description of the corresponding parts of the method embodiments.

In some embodiments, the above wake-up word recognition process may be implemented by the terminal 11, that is, after the terminal device collects the voice data output by the user, a pre-trained wake-up word recognition model may be called to perform wake-up word recognition on the voice data; in still other embodiments, the terminal 11 may also send the collected voice data to the server 12 through a wired communication network or a wireless communication network, and the server executes the voice recognition method provided by the present application to obtain a recognition result of the wakeup word of the voice data, and then feeds the recognition result back to the terminal 11 to wake up the terminal. The subject of the speech recognition method is not limited in this application and is hereinafter referred to collectively as a computer device.

In practical applications, the terminal 11 may include, but is not limited to, a smart phone, a tablet computer, a wearable device (such as a smart watch, a smart bracelet, and the like), an Augmented Reality (AR) device, a Virtual Reality (VR) device, an on-vehicle device, a smart speaker, a robot, a smart home device, a smart transportation device, a smart medical device, and the like, and the product type of the terminal 11 is not limited in this application, and may be determined according to application scene requirements.

The server 12 may be a service device supporting a speech recognition service of a speech recognition engine of the terminal 11, may be an independent physical server, may also be a server cluster formed by a plurality of physical servers, and may also be a cloud server capable of implementing a cloud computing service, and the like, and may implement data interaction with each terminal 11 through the internet, and a specific interaction process may be determined in combination with a speech recognition application scenario, which is not described in detail in this embodiment of the application.

As described above, the awakening word recognition model used in the speech recognition method provided by the present application may be pre-trained in the server 12 for each terminal 11 to call, or the server may directly call the pre-trained awakening word recognition model in a scenario where the server 12 executes the speech recognition method provided by the present application, so as to recognize the awakening word of the speech data to be recognized sent by the terminal, and the implementation process is not described in detail in the present application.

It should be understood that the system structure shown in fig. 1 does not form a limitation of the system architecture of the application environment applicable to the speech recognition method proposed in the present application, and in practical applications, the system may include more devices, such as a database, other application servers, and the like, in different application environments, which is not listed in the present application.

Referring to fig. 2, a schematic diagram of a hardware structure of an alternative example of a computer device suitable for the speech recognition method proposed in the present application, the computer device may be the above-mentioned terminal or server, and the computer device may include at least one memory 21 and at least one processor 22, wherein:

the memory 21 may be used to store a program implementing the speech recognition method as described above; the processor 22 may be configured to load and execute a program stored in the memory 21 to implement each step of the speech recognition method described in any one of the above method embodiments, and the specific implementation process may refer to the description of the corresponding part in the above embodiment, which is not described in detail in this embodiment.

In the present embodiment, the memory 21 may include a high-speed random access memory, and may further include a nonvolatile memory, such as at least one magnetic disk storage device or other volatile solid-state storage device. The processor 22 may be a Central Processing Unit (CPU), an application-specific integrated circuit (ASIC), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA), or other programmable logic device.

It should be understood that the structure of the computer device shown in fig. 2 does not constitute a limitation of the computer device in the embodiment of the present application, and in practical applications, the computer device may include more components than those shown in fig. 2, or may combine some components, such as various communication interfaces and the like. And in case that the computer device is a terminal, as shown in fig. 3, the computer may further include at least one input device such as a camera, a sound pickup, etc.; at least one output device such as a display, speaker, etc.; a sensor module composed of various sensors; a power management module; antennas, etc., which are not listed herein.

Therefore, in the application of the user awakening terminal, the sound pickup of the terminal can be used for collecting the voice data of the user to realize awakening word recognition, or the voice data is sent to a server to realize awakening word recognition. Of course, in some embodiments, the voice data may also be collected by an independent voice collection device and sent to the terminal or the server, so as to realize the awakening word recognition for the terminal.

Referring to fig. 4, a schematic flow chart of an optional example of the speech recognition method provided in the present application is shown, where the method may be executed by a computer device, that is, by a server or a terminal, or the server and the terminal cooperate with each other to implement the speech recognition method. As shown in fig. 4, the method may include:

step S11, acquiring voice data to be recognized;

under the scene that a certain terminal or a certain application contained in the terminal needs to be awakened by voice, a user can directly speak out a preset awakening word, voice collection is carried out by voice collection equipment to obtain corresponding voice data, and the voice data which needs to be detected whether to contain the preset awakening word is recorded as voice data to be recognized.

In combination with the above analysis, the voice collecting device may be located in the terminal or may be an independent device. The voice acquisition equipment is integrated in the terminal, and the processor of the terminal can execute the subsequent steps on the acquired voice data to be recognized; or networking through a communication module of the terminal, sending the voice data to be recognized to a server, and executing the subsequent awakening word recognition step by the server.

Similarly, the voice data to be recognized, which is acquired by the independent voice acquisition equipment, can be sent to the server through the communication module of the server or the terminal equipment to execute the subsequent steps; the method for acquiring the voice data to be recognized and the method for transmitting the voice data to the computer device are not limited and can be determined according to application requirements of a voice recognition scene.

Step S12, carrying out speech rate recognition on the speech data to be recognized to obtain a speech rate recognition result;

in order to improve the accuracy of awakening word recognition in combination with the above description of the technical concept of the present application, in the training stage of the awakening word recognition model, model parameters corresponding to training sample voice data of different voice speed categories are obtained through training, so that for the voice data to be recognized, before the awakening word recognition is performed, voice speed detection needs to be performed on the voice data to be recognized, and the voice speed of the voice data to be recognized is determined.

The speaking speed of the user is generally indicated by the number of bytes spoken in unit time, and the speaking speed requirements of different scenes on the speaker are different due to different standards of different industries on the speaking speed. Therefore, in the actual speed of speech recognition process, the number of user speaking bytes in unit time can be counted, and the speed of speech range where the user speaking bytes are located can be determined, so that the speed of speech category to which the voice data belongs can be determined.

However, the speech rate recognition result obtained through simple mathematical operation is often not suitable for different speaking habits of users in different industries under different scenes, and the speech rate recognition accuracy is reduced. In order to improve the speech speed recognition accuracy, the method can acquire the audio frame characteristic sequence of the speech data to be recognized, inputs the audio frame characteristic sequence into a pre-trained speech speed recognition model to acquire the real-time speech speed of the speech data to be recognized, and determines the speech speed category to which the speech data to be recognized belongs according to the speech speed ranges respectively corresponding to the speech speeds of different categories. But is not limited to the speech rate recognition method proposed in the present application.

Based on the above description, it can be understood that if the actual speech rate of the speech data to be recognized approaches a boundary value of a certain speech rate range, it may be determined as two speech rate categories associated with the boundary value, that is, the speech rate recognition result may be: the voice data to be recognized has a first probability of a first speech speed class and a second probability of a second speech speed class. Of course, at the actual speech rate of the speech data to be recognized, a certain value located at the middle position of a certain speech rate range can accurately determine that the speech data to be recognized is the speech rate category corresponding to the speech rate range.

From the above analysis, it can be known that the speech rate recognition model can directly obtain the probability that the input speech data belongs to a certain speech rate category, and if the probability reaches a probability threshold, for example, the probability of 95% is the normal speech rate, it can be determined that the input speech data is the normal speech rate; if the prediction result is that the 60% probability is the normal speed and the 70% probability is the fast speed, then the speed recognition result is the two types of speeds. It should be noted that the present application does not limit the training method of the speech rate recognition model and the implementation process of how to recognize the speech rate of the speech data to be recognized.

Step S13, acquiring target model parameters matched with the speech speed recognition result in the awakening word recognition model;

in combination with the related description of the technical concept of the present application, in the process of training the awakening word recognition model, the model parameters corresponding to different speech speed categories are obtained through training, so that the awakening word recognition model has model parameters aiming at different speech speed voice data, namely, a plurality of groups of model parameters obtained through training. Therefore, in the actual awakening word recognition process, when the awakening word recognition is performed on the voice data to be recognized at this time according to the speech speed recognition result of the voice data to be recognized, the awakening word recognition model should adopt which set of model parameters to process so as to improve the awakening word recognition accuracy of the voice data to be recognized.

Therefore, after obtaining the speech rate recognition result of the speech data to be recognized, if it is determined that the speech data to be recognized belongs to the first speech rate (i.e. any speech rate category), the model parameters mapped by the first speech rate can be obtained as the target model parameters from the mapping relations between different speech rate categories and different sets of model parameters; if it is determined that the speech data to be recognized may belong to the first speech rate or the second speech rate, and the first speech rate and the second speech rate belong to two speech rate categories adjacent to each other in the speech rate range, in this case, the first model parameter mapped by the first speech rate and the second model parameter mapped by the second speech rate may be collectively referred to as a target model parameter based on the mapping relationship. The content of the target model parameters is not limited by the application and can be determined according to the situation.

In the practical application of the method, the awakening word recognition model is used for determining whether corresponding words contained in the voice data to be recognized are awakening words or not based on the voice characteristic information extracted from the voice data to be recognized, so that the accuracy and granularity of the extracted voice characteristic information directly influence the accuracy of judging the awakening words, and in order to acquire more comprehensive and more detailed voice characteristic information and ensure the identification efficiency of the awakening words, a sliding window with shorter length can be adopted to realize the characteristic extraction of the voice data with high speed of speech; the method comprises the following steps of adopting a sliding window with a longer length to realize the feature extraction of the slow speech speed voice data; and a sliding window with moderate length is adopted to realize the feature extraction of the voice data with normal speech speed.

Therefore, the model parameters may include network parameters for determining the length of the sliding window in the model network, for example, in a feature extraction method for implementing voice data based on a convolutional neural network, the model parameters may include the number of convolutional layers used for performing voice feature extraction, the convolutional kernel size of each convolutional layer, the convolutional step size, and the like, which are not listed herein.

And step S14, inputting the voice data to be recognized into the awakening word recognition model adopting the target model parameters, and outputting the awakening word recognition result of the voice signal to be recognized.

After the description above, after determining the speed category (one or more categories) to which the speech data to be recognized collected in the current scene belongs, and the target model parameters of the trained wakeup word recognition model corresponding to the speed category, when performing wakeup word recognition on the speech data to be recognized, that is, after inputting the speech data to be recognized into the wakeup word recognition model, the wakeup word recognition model processes the input speech data to be recognized by using the target model parameters to obtain a wakeup word recognition result, for example, predicting the probability that the speech data to be recognized contains a preset wakeup word (one or more wakeup words configured in advance for the object to be wakened), and accordingly determining whether the speech data to be recognized contains the preset wakeup word, that is, whether the speech data to be recognized can wake the object to be wakened up. The method and the device adopt the determined model parameters for the awakening word recognition model, and do not detail the awakening word recognition process of the voice data to be recognized.

In summary, in the embodiment of the present application, for the voice data with different speech rates, in the training stage of the awakening word recognition model, the model parameters suitable for the awakening word recognition of the voice data with the corresponding speech rate category are trained, so that the awakening word recognition model obtained by training can adopt different model parameters and is suitable for the awakening word recognition of the voice data with different speech rate categories. Therefore, under the application scene that a user needs to use an object to be awakened (such as any terminal or any application installed therein) and speaks a preset awakening word according to personal preference or habit and the like, voice data output by the user is collected to serve as voice data to be recognized, the voice data is recognized by computer equipment at first, a voice speed recognition result is obtained, after a target model parameter matched with the voice data is determined, the voice data to be recognized is input into an awakening word recognition model adopting the target model parameter to be processed, and compared with the traditional awakening word recognition model which adopts a fixed model to perform awakening word recognition processing on voice data with various voice speeds, the accuracy of the awakening word recognition result of the voice data to be recognized is greatly improved, so that the awakening rate of the object to be awakened is improved, and the user experience is improved.

Referring to fig. 5, a flow diagram of a further alternative example of the speech recognition method proposed by the present application is shown, and the present application embodiment may be an alternative refined implementation method of the speech recognition method described in the above embodiment, but is not limited to such a refined implementation method described in the present embodiment. And the detailed implementation method can still be executed by computer equipment, as shown in fig. 5, the method can include:

step S21, acquiring a voice signal to be recognized;

step S22, performing frame feature extraction on the voice signal to be recognized to obtain a corresponding voice frame feature vector;

step S23, forming the voice data to be recognized by a plurality of voice frame feature vectors;

in order to reliably obtain the speech speed recognition result of the speech signal to be recognized, for the speech signal to be recognized directly collected by the speech collection device, frame-wise feature extraction may be performed on the speech signal to be recognized, that is, feature information of each speech frame is extracted, for example, a feature extraction network (also referred to as a coding network encoder) formed by using a deep neural network maps the speech signal to be recognized from a low-dimensional space to a high-dimensional space, and extracts low-dimensional text features, high-dimensional semantic features, and the like included in the speech signal to be recognized, so as to obtain speech feature information of different dimensions of the speech signal to be recognized, which is recorded as speech data to be recognized in the present application.

It should be noted that, the present application does not limit the implementation method of feature extraction of the collected speech signal to be recognized, including but not limited to the above-listed feature extraction manner implemented based on the deep neural network.

Step S24, inputting the speech data to be recognized into the speech rate classification model, and outputting the speech rate recognition result;

in combination with the description of the corresponding part of the above embodiment, the speech data to be recognized including the speech features with different dimensions are input into the trained speech rate classification model, so that the prediction probabilities that the speech data to be recognized belong to different speech rate categories can be obtained, and then, the speech rate category or the speech rate categories to which the speech data to be recognized belongs can be determined by comparing the obtained prediction probability or probabilities with the probability threshold, so that the implementation process of the speech recognition method is not described in detail in the present application. The probability threshold value is not limited according to the numerical value, can be determined according to the situation, and can be flexibly configured according to the scene requirements.

Step S25, determining that the speech rate recognition result indicates that the speech data to be recognized belongs to a first speech rate, and selecting a target feature layer matched with the first speech rate from a plurality of feature layers contained in the awakening word recognition model;

in the embodiment of the present application, after the speech rate recognition described above, the obtained speech rate recognition result indicates that the speech data to be recognized belongs to the first speech rate, where the first speech rate may be any one of the speech rate categories listed above, such as fast speech rate, normal speech rate, and slow speech rate. For example, if the speech rate recognition result is that the predicted probability (e.g., 95%) that the speech data to be recognized belongs to the fast speech rate is greater than the probability threshold (e.g., 85%), the speech data to be recognized may be considered to belong to the fast speech rate, and one or more feature layers matching the fast speech rate may be selected as the target feature layer from a plurality of feature layers included in the trained awakened word recognition model.

Therefore, the speech recognition method and the speech recognition device can recognize the speech speed of the speech data to be recognized, determine which feature layers (marked as target feature layers) should be adopted to perform feature extraction on the speech data to be recognized when the speech data to be recognized is recognized by a plurality of feature layers contained in a wake-up word recognition model, ensure that the speech feature information contained in the speech data to be recognized can be reliably and comprehensively extracted, and subsequently reliably recognize whether the speech data to be recognized contains the preset wake-up words or not.

It should be noted that, the present application does not limit the configuration parameters of the feature layers respectively matched with the voice data of different speech rates, since the feature layers in the awakening word recognition model perform feature extraction layer by layer on the input voice data, and since the configuration parameters such as the feature mapping region size and the step length, which are used for feature extraction on different feature layers, may be different, the voice feature information extracted by each feature layer is often different, the present application may determine the configuration parameters applicable to the feature layers matched with the voice data of different speech rates at the training stage of the awakening word recognition model, and the embodiment of the present application does not need to be detailed here in the training implementation process.

It can be understood that the number of layers of feature layers matching speech data of different classes of speech rates may be different, and configuration parameters such as feature mapping regions according to which feature extraction is performed on different feature layers may be the same or different. That is, the target feature layers corresponding to different speech rates are different, and/or the feature mapping regions of different target feature layers corresponding to different speech rates are different. In general, the number of feature layer layers corresponding to the faster speech data is smaller, and/or the size of the feature mapping area of the feature layer is smaller, and the mapping relationship between the speech data of different speech speeds and the configuration parameters of the target feature layer is not limited in this application and may be determined according to the situation.

In some embodiments, the feature extraction network of the wake word recognition model may be formed by a convolutional neural network, in which case, the feature layer may refer to a convolutional layer, and the computer device may determine, according to a speech rate recognition result, a plurality of convolutional layers matched with a first speech rate from the convolutional neural network as a target convolutional layer, and configuration parameters such as the number of convolutional layers in the convolutional neural network represented by the target convolutional layer, a convolutional kernel size, and a convolutional step size may be determined in a training stage of the wake word recognition model, which is not limited in this application.

Step S26, inputting the voice data to be recognized into a wakeup word recognition model, and performing feature extraction on the voice data to be recognized by a target feature layer to obtain a target voice feature vector of the voice data to be recognized;

in the embodiment of the present application, the awakening word recognition model may include a feature extraction network, which is suitable for feature extraction of voice data of multiple speech rate categories, but, in combination with the above description, for voice data of different speech rates, it may be matched with different feature layers included in the feature extraction network, that is, after inputting the determined voice data to be recognized of the first speech rate into the feature extraction network, the present application determines a voice feature vector output by a target feature layer in the feature extraction network as a target voice feature vector, and it is apparent that the target voice feature vector is not necessarily a voice feature vector output by the last feature layer in the feature extraction network.

In some embodiments, as shown in the scene diagram of fig. 6, the present application may wake up a training phase of a word recognition model, and determine respective matching feature layers of speech data of different speech rates, as shown in fig. 6, a reduced feature layer is used for speech data of fast speech rate categories, and an obtained feature vector (e.g., a feature vector output by a feature layer with a front number of layers of a feature extraction network) may accurately implement wake-up word recognition; the voice data of the slow speech speed category usually needs to adopt a reduced number of feature layers to ensure that the obtained feature vectors (such as feature vectors output by feature layers with the number of later layers of a feature extraction network) can accurately realize the identification of the awakening words; for the voice data of the normal speech speed of the middle speech speed category, the feature vectors output by the feature layer with moderate layer number (such as the feature vectors output by the middle feature layer of the feature extraction network) are needed, and the accurate identification of the awakening words of the voice data is realized.

Therefore, in the case that the first speech rate is a fast speech rate category, the target feature layer may be a plurality of feature layers at the front of the entire feature extraction network, and the speech feature vector output by the last feature layer in the target feature layer is directly determined as the target speech feature vector of the speech data to be recognized, and the target speech feature vector is output by the sub-network of feature extraction denoted by the reference numeral (i) as shown in fig. 6.

Similarly, for the voice data to be recognized with other types of speech speeds, the voice feature vectors output by other feature layers can be used as the target voice feature vectors. As shown in fig. 6, for the speech data to be recognized with normal speech speed, feature extraction may be performed using a feature extraction subnetwork with a label of two, and a target speech feature vector of the speech data to be recognized with normal speech speed is output; feature extraction is performed on the slow speech speed to-be-recognized speech data by using the feature extraction sub-network labeled with the third label to obtain a target speech feature vector of the slow speech speed to-be-recognized speech data, but the feature extraction sub-network is not limited to the relationship between the feature extraction sub-networks corresponding to different speech speeds shown in fig. 6.

Illustratively, if the wakeup word recognition model is a time sequence convolution selection network TC-SKnet (Temporal convolution-Selective Kernel Networks) structure, the network performs an attention mechanism on convolution kernels with different sizes, so that the network selects an appropriate convolution Kernel by itself, and the working principle of the network is not described in detail in the present application. In the embodiment of the application, the network can select the number of suitable convolution kernels and/or convolution layer(s) according to the voice data of different speech speed categories so as to obtain the voice characteristic information capable of accurately realizing the identification of the awakening word.

Combining the above analysis, obtaining through model training that the to-be-recognized voice data with the first speech rate is input into the trained TC-SKnet, performing convolution processing on the voice data with the fast speech rate by using a convolution kernel with a smaller size (marked as the first size), and determining an output result of convolution layers (marked as the first layer) with a lower number of layers (marked as the first layer) (corresponding to a plurality of convolution layers as a mark of fig. 6) as a voice feature vector of the input voice data with the fast speech rate; for the voice data with normal speech speed, convolution processing can be carried out by adopting a convolution kernel with a medium size (recorded as a second size, and the second size is larger than the first size), and the output result of the convolution layer of the middle layer is determined as the voice feature vector of the input voice data with normal speech speed; for the slow speech speed speech data, convolution processing can be performed by adopting a convolution kernel with a larger size (marked as a third size, wherein the third size is larger than the second size), and the output result of the convolution layer with a higher number of layers is determined as the speech feature vector of the input slow speech speed speech data.

It should be noted that, the present application does not limit the numerical values of the first size, the second size, and the third size of the convolution kernel, and the numerical values may be determined as the case may be; the output paths of the feature extraction of the speech speed data of different classes are different, but the number of layers of the output layers of the corresponding speech feature vectors (namely the last convolution layer of the corresponding feature extraction sub-network) is not limited; moreover, in the convolution processing process, the convolution step lengths of the voice data aiming at different classes of speech rates can be the same, for example, 1 is selected; the convolution processing method can also be different, for example, the faster the speech speed is, the smaller the convolution step length can be, and the like.

In still other embodiments, in the process of selecting the output path of the voice data to be recognized, the guiding weight of each output path (that is, corresponding to each feature extraction sub-network described above) for the voice data to be recognized may be determined according to the speech rate recognition result of the voice data to be recognized, and according to the guiding weight, a feature extraction sub-network used for processing the voice data to be recognized this time, that is, a feature extraction sub-network corresponding to the first speech rate, is determined from the pre-trained feature extraction sub-networks respectively corresponding to the fast speech rate, the normal speech rate, and the slow speech rate, and the output result of the feature extraction sub-network is determined as the target voice feature vector of the voice data to be recognized.

Illustratively, if the speech rate recognition result of the speech data to be recognized is the first speech rate, and the obtained guidance weight is 100, the output result of the sub-network for feature extraction, which is labeled as (r), can be obtained; if the guiding weight is 010, the output result of the feature extraction sub-network with the label of II can be obtained; if the guiding weight is 001, the output result of the sub-network for feature extraction, which is marked by the third letter c, can be obtained. It should be noted that, the implementation method for selecting the output path for determining the target speech feature vector is not limited to the guidance weight obtaining manner described in this embodiment, and the content of the guidance weight is not limited to the binary character representation manner.

And step S27, performing awakening word recognition on the target voice feature vector to obtain an awakening word recognition result of the voice data to be recognized.

Continuing the above description, obtaining the target voice feature vector output by the feature extraction sub-network marked with (r) or (c) in FIG. 6, mapping data to a hidden layer feature space through a convolutional layer to obtain voice feature information of multiple dimensionalities in voice data to be recognized, mapping the feature space to a sample mark space through linear transformation, that is, the target speech feature information is input into a Fully Connected layers (FC layer) equivalent to "classifiers", which in this embodiment refers to two classifiers used for identifying the category of the awakening word and the category of the non-awakening word, so as to obtain the category label prediction result of the speech signal to be recognized, for example, the prediction probability that the object included in the speech signal to be recognized is the preset awakening word, therefore, whether the object is a preset awakening word aiming at the object to be awakened is determined, namely whether the voice signal to be recognized contains the preset awakening word.

It should be noted that, the implementation method of step S27 is not limited in this application, and includes but is not limited to the processing procedure of the FC layer, and the FC layer output result may also be normalized by using the activation function of the activation layer as needed, and mapped to the prediction result in (0,1), and the implementation procedure is not described in detail.

In summary, in the embodiment of the present application, in a scenario where a user wakes up any object to be woken up by voice, frame-wise feature extraction is performed on an acquired voice signal to be recognized to obtain voice data to be recognized, the voice data to be recognized is input into a voice recognition model, a target feature layer, which is suitable for the voice data to be recognized of the speed category, in a wake-up word recognition model is selected by using an obtained speed recognition result, and compared with a method in which feature extraction is directly performed on voice data of various speeds by using a feature extraction network corresponding to a reference number three, the method for realizing feature extraction by selecting a pre-trained targeted feature extraction network obtains a target voice feature vector capable of representing the voice feature to be recognized of the speed more accurately and relatively completely, and accordingly, a wake-up word recognition result of the voice signal to be recognized of the speed can be obtained with high accuracy, and a wake-up rate of the object to be woken up is improved.

Referring to fig. 7, which is a flowchart illustrating a further optional example of the speech recognition method proposed in the present application, an embodiment of the present application may be a further optional detailed implementation method of the speech recognition method described in the foregoing embodiment, which may be executed by a computer device, as shown in fig. 7, and the method may include:

step S31, acquiring a voice signal to be recognized;

step S32, inputting the speech signal to be recognized into the coder for frame feature extraction to obtain the speech data to be recognized;

step S33, inputting the voice data to be recognized into a speech rate classifier to obtain a speech rate recognition result;

regarding the implementation process of step S31 to step S33, reference may be made to the description of the corresponding parts in the above embodiments, which is not repeated in this embodiment. The encoder may be a feature extraction network such as a deep neural network, and is configured to implement feature extraction on a continuous multi-frame speech signal (e.g., a speech signal to be recognized), so as to abstract feature codes of frames included in the speech signal to be recognized into a high-dimensional speech abstract feature, which is convenient for a subsequent model to continue analyzing the high-dimensional speech abstract feature.

In some embodiments, in combination with the above analysis, the speech rate recognition model may be a speech rate classifier, such as a speech rate classifier obtained based on neural network training, and in combination with the above description of the wake-up word classifier, the speech rate classifier may be composed of a full connection layer and an activation layer, and performs classification prediction on a feature vector (i.e., speech data to be recognized) output by a coding network through the full connection layer to obtain a speech rate classification tag prediction value of the speech signal to be recognized, and invokes an activation function, such as a sigmoid function, to map the activation function to a prediction probability of (0,1), and outputs a prediction result of a speech rate category of the speech signal to be recognized. It is understood that the network parameters of the encoder and the speech rate classifier may be iteratively trained together, and the training implementation process is not described in detail in this application.

Step S34, determining the speech rate recognition result to indicate that the speech data to be recognized has a first probability of belonging to a first speech rate and a second probability of belonging to a second speech rate, and obtaining a first feature extraction network matched with the first speech rate and a second feature extraction network matched with the second speech rate;

in the embodiment of the present application, different from the above embodiment, feature extraction of to-be-recognized speech data at different speech speeds is realized by one TC-SKnet with different model parameters, and a plurality of feature extraction networks for realizing feature extraction of speech signals at different speech speeds are pre-trained in the present application.

Referring to the flowchart shown in fig. 8, taking an example that the plurality of feature extraction networks are a plurality of time-series Convolutional networks TCN (temporal Convolutional network), in order to be suitable for feature extraction of three speech rate categories, namely fast speech rate, normal speech rate, and slow speech rate, when designing and training a corresponding TCN, different TCNs may be configured in a threshold manner. For example, a time-series convolution field of less than 0.6s may be configured as the fast speech class TCN; configuring a time sequence convolution receptive field within 0.6 s-1.2 s as a normal speech speed TCN; the time series convolution field greater than 1.2s is configured as a slow speech speed TCN, but is not limited to this threshold size.

Therefore, the faster the speech speed of the voice signal is, the smaller the time sequence convolution receptive field corresponding to the TCN is, and the size of the time sequence convolution receptive field can be determined by network parameters such as the number of convolution layers of the TCN, the size of the convolution kernel and the like.

In combination with the above description of the speech rate recognition process, in some application scenarios, speech recognition is performed on a speech signal to be recognized, and two similar prediction probabilities may exist in the obtained prediction probabilities of different classes of speech rates, for example, after speech rate recognition processing is performed on a speech signal to be recognized with a 0.63s time-series convolution receptive field, a first probability that an obtained speech rate recognition result may be 68% is a fast speech rate (denoted as a first speech rate), a second probability that a speech rate is 76% is a normal speech rate (denoted as a second speech rate), and a third probability that a speech rate is 0% is a slow speech rate (denoted as a third speech rate). It should be noted that, in different scenarios, the obtained numerical values of the first probability, the second probability, and the third probability may be different, and the present application does not limit this.

In the above exemplary scenario, if the speech recognition method shown in fig. 6 is adopted, the output result of the feature extraction sub-network corresponding to the fast speech rate or the normal speech rate is selected as the target speech feature vector, and the accuracy of the obtained awakening word recognition result is poor. In order to improve the problem, the method provided in the embodiment of the present application is adopted, based on the speech rate recognition result exemplified above, it is preliminarily determined that the speech signal to be recognized may belong to the first speech rate or the second speech rate, and in order to obtain a more accurate speech feature vector of the speech signal to be recognized, the present application may utilize the feature extraction networks corresponding to the two speech rates to process the speech data to be recognized, so as to determine the speech feature vectors output by the two feature extraction networks, and weights of the speech feature vectors in the target speech feature vector of the speech data to be recognized, that is, influence of predicted speech features of the multiple speech rates on recognition of the awakened word of the speech data to be recognized is obtained, and the implementation method of the present application is not limited.

For convenience of description, the present application may record a feature extraction network obtained by pre-training and matching a first speech rate as a first feature extraction network, such as a fast speech rate TCN shown in fig. 8; the feature extraction network with the second speech rate matching is marked as a second feature extraction network, for example, the normal speech rate TCN shown in fig. 8, the feature extraction networks corresponding to different speech rates are a plurality of relatively independent TCNs, and the number of convolution layers and/or the convolution kernel size included in the three TCNs shown in the above analysis fig. 8 may be different, and the network parameter related to each TCN may be determined according to the training result, which is not limited in the present application.

Step S35, respectively inputting the voice data to be recognized into a first feature extraction network and a second feature extraction network, and outputting corresponding first voice feature vectors and second voice feature vectors;

step S36, fusing the first voice feature vector and the second voice feature vector to obtain a fused voice feature vector;

step S37, carrying out speech speed classification processing on the fused speech feature vector to obtain a first weight vector of the first speech feature vector and a second weight vector of the second speech feature vector;

as described above, under the condition that it is determined that the speech signal to be recognized belongs to which speech speed class is accurately determined, according to the above, the speech signal to be recognized belongs to the predicted speech speed class speech signal, and the pre-trained corresponding feature extraction network is used to perform feature extraction on the speech signal to be recognized, so as to obtain the speech feature vector of the speech signal to be recognized as the predicted speech with different speech speeds of different classes, as shown in fig. 8, the TCN formed by the number of layers of different convolutional layers with different convolutional cores can be used to perform feature extraction on the input speech data to be recognized, and the implementation process of the present application is not described in detail.

For different voice feature vectors output by TCNs corresponding to different speech speeds and aiming at the same voice data to be recognized, a splicing and fusion mode can be adopted in the method, and the different voice feature vectors are fused into one feature vector and recorded as a fused voice feature vector. It can be understood that, with respect to the speech data to be recognized, the fused speech feature vector can more highlight the acoustic features corresponding to the first speech rate and the second speech rate, so that the speech rate classification processing is continuously performed on the fused speech feature vector in the present application, and the first weight vector of the first speech feature vector and the second weight vector of the second speech feature vector can be obtained. The implementation method for obtaining the first weight vector and the second weight vector is not limited in the present application, and may include, but is not limited to, the implementation manner described in this embodiment.

As can be seen from the above analysis, the influence of the first speech rate feature represented by the first speech rate feature vector on the speech rate prediction result of the speech data to be recognized can be analyzed based on the fused speech feature vector, and the influence of the first speech rate feature on the subsequent awakening word recognition result can be represented; meanwhile, the influence of the second speech rate characteristic represented by the second speech rate characteristic vector on the speech rate prediction result of the speech data to be recognized is analyzed, so that the influence of the second speech rate characteristic on the subsequent awakening word recognition result can be represented.

In a possible implementation manner, for the selected speech feature vectors output by the feature extraction network corresponding to each of the plurality of speech rate categories obtained through prediction, a convolution layer with a convolution kernel of 1 × 1 × 1 may be input to compress and re-expand the plurality of input speech feature vectors to obtain respective weight vectors of the plurality of speech feature vectors, and the implementation process is not described in detail in this application. For convenience of subsequent processing, normalization processing may be performed on each feature dimension by using a softmax activation function, for example, and the feature weight value of the corresponding dimension is mapped between (0,1), so as to obtain the weight vector of the corresponding voice feature vector.

Step S38, carrying out weighted fusion processing on the first voice feature vector, the first weight vector, the second voice feature vector and the second weight vector to obtain a target voice feature vector of the voice data to be recognized;

and step S39, performing awakening word recognition on the target characteristic vector to obtain an awakening word recognition result of the voice data to be recognized.

After the weight vectors of the speech feature vectors corresponding to different predicted speech speed categories are obtained according to the processing mode, product operation can be performed on the weight vectors and the corresponding speech feature vectors, and then same-dimension fusion processing is performed on a plurality of feature vectors obtained through the product operation to obtain target speech feature vectors of the speech data to be recognized. It can be understood that the target speech feature vector highlights a first speech speed feature and a second speech speed feature of the speech data to be recognized, that is, feature information of a speech speed category to which the speech data to be recognized belongs is predicted, and subsequently, the target speech feature vector is directly subjected to awakening word recognition, so that whether the speech signal to be recognized contains a preset awakening word can be determined more reliably and accurately, and the awakening rate is improved.

For example, in combination with the flow diagram shown in fig. 8, if speech speed recognition is performed on a speech signal to be recognized output by a user, it is known that the speech signal to be recognized may be fast speech speed or normal speech speed, a guiding weight obtained based on the speech speed recognition result may be 110, a computer device may input speech data to be recognized into fast speech speed TCN and normal speech speed TCN respectively according to the guiding weight, obtain fast speech speed feature vectors and normal uniform velocity feature vectors of the speech data to be recognized, and then obtain respective weight vectors of the fast speech speed feature and the normal speech speed feature, that is, W, by using the fusion processing method described above_{Fast-acting toy}、W_{Is normal}After weighted fusion processing, the target voice characteristic vector of the voice data to be recognized is obtained and inputAnd the awakening word classifier is used for determining whether each object contained in the voice data to be recognized belongs to a preset awakening word or not and corresponding prediction probability, and further obtaining an awakening word recognition result of the voice data to be recognized according to the corresponding prediction probability.

Therefore, compared with the method for directly identifying the awakening words by the voice characteristic vector output by a certain TCN, the method for identifying the awakening words by the TCN performs weighted fusion on the predicted voice characteristic vectors output by the TCNs corresponding to the multiple voice speed categories, highlights the voice characteristics of the multiple voice speed categories in the target voice characteristic vector, enables the target voice characteristic vector obtained by weighted fusion to be more accurate in representing the voice characteristics of the voice data to be identified, and improves the identification accuracy of the awakening words.

It should be understood that, based on the network structure of the awakening word recognition model shown in fig. 8, when the speech rate classifier outputs that the speech data to be recognized belongs to the third speech rate (which may refer to any one of the pre-classified speech rates), and thus the obtained guidance weight for the awakening word recognition model is 100, 010, or 001, etc., the pre-trained speech rate TCN corresponding to "1" may be selected to perform feature extraction on the speech data to be recognized, in this case, the obtained speech feature vector at the speech rate is used as the target speech feature vector, and the target speech feature vector is directly input to the awakening word classifier to perform prediction, so as to obtain the awakening word recognition result.

Therefore, when the voice data to be recognized belongs to a certain speech speed, the awakening word recognition processing of the voice data to be recognized can be realized by adopting, but not limited to, the awakening word recognition model with the structure of fig. 6 or fig. 8. However, when the speech rate recognition result indicates that the speech data to be recognized may belong to multiple speech rates, usually, the speech data to be recognized is predicted to belong to two adjacent speech rates, and the awakening word recognition result of the speech data to be recognized may be obtained by using, but not limited to, the awakening word recognition model shown in fig. 8.

In combination with the speech recognition method described in the foregoing embodiments, referring to the flowchart shown in fig. 9, in a scenario where a user wakes up a terminal by speech, after acquiring a to-be-recognized speech signal output by the user and possibly including a wake-up word, a computer device may input the to-be-recognized speech signal into an encoder, extract a high-dimensional abstract feature to obtain to-be-recognized speech data, input the to-be-recognized speech data into a speech rate classifier, obtain a guidance weight for a wake-up word recognition model according to an obtained speech rate recognition result of the to-be-recognized speech signal, determine a target model parameter matching the speech rate recognition result, such as selecting a model parameter suitable for the speech rate recognition result obtained by pre-training, such as a convolution kernel, a number of convolution layers, and the like, so as to perform wake-up word recognition on the to-be-recognized speech data by using a wake-up word recognition model having the target model parameter, and improve accuracy of the wake-up word recognition result, thereby improving the device wake-up rate.

For each model related to the speech recognition method in each embodiment, the present application does not detail the implementation process of model training, for example, for the awakening word recognition model shown in fig. 6, training speech data with different speech speeds can be obtained, and input into the initial TC-SKnet network, model parameters such as the sizes of convolution kernels and the number of layers of convolution layers suitable for the speech data with different speech speeds are obtained by training in combination with the working principle of the network, and accordingly, a mapping relationship between different speech speed categories and the different model parameters obtained by training is constructed, so that in practical application, after obtaining the speech speed recognition result of the speech data to be recognized, it is determined according to the mapping relationship what size of convolution kernels and which convolution layers constitute the network for feature extraction when the awakening word recognition model processes the speech data to be recognized, the processing procedure may refer to the description of the corresponding part of the above embodiments, which is not described herein again.

Similarly, for the awakening word recognition model shown in fig. 8, training speech data with different speech velocities are input into the initial awakening word recognition network for iterative training until a training termination condition is met, and if a preset iteration number is reached, the awakening word recognition result is stable in loss or the loss value is smaller than a loss threshold value, and the model obtained through final training is recorded as the awakening word recognition model. It can be understood that in the training process of the awakening word recognition model, network parameters in the TCN can be continuously adjusted, feature fusion processing is realized, and various model parameters such as parameters of a processing network of weight vectors and network parameters of an awakening word classifier are obtained; according to the requirements, parameters of models such as a speech speed classifier and an encoder can be combined, so that the accuracy of speech speed recognition is improved, and meanwhile, the reliability and accuracy of awakening word recognition of speech data with different speech speeds can also be improved.

Referring to fig. 10, a schematic structural diagram of an alternative example of the speech recognition apparatus proposed in the present application may include:

a voice data obtaining module 31, configured to obtain voice data to be recognized;

optionally, the voice data obtaining module 31 may include:

the voice signal acquisition unit is used for acquiring a voice signal to be recognized;

the framing feature extraction unit is used for performing framing feature extraction on the voice signal to be recognized to obtain a corresponding voice frame feature vector;

and the voice data to be recognized forms the voice data to be recognized by a plurality of the voice frame characteristic vectors.

A speech rate recognition module 32, configured to perform speech rate recognition on the speech data to be recognized, so as to obtain a speech rate recognition result;

a target model parameter obtaining module 33, configured to obtain a target model parameter in the wake word recognition model, where the target model parameter is matched with the speech rate recognition result;

and the awakening word recognition module 34 is configured to input the voice data to be recognized into an awakening word recognition model using the target model parameter, and output an awakening word recognition result of the voice signal to be recognized.

In some embodiments, as shown in fig. 11, the target model parameter obtaining module 33 may include:

and the target feature layer selection unit 331 is configured to select a target feature layer matched with the first speech rate from a plurality of feature layers included in the awakening word recognition model when the speech rate recognition result indicates that the speech data to be recognized belongs to the first speech rate.

The target feature layers corresponding to different speech rates are different, and/or the feature mapping areas of different target feature layers are different.

Based on this, the above-mentioned wake word recognition module 34 may include:

a first feature extraction unit 341, configured to input the speech data to be recognized into the awakening word recognition model, and perform feature extraction on the speech data to be recognized by the target feature layer to obtain a target speech feature vector of the speech data to be recognized;

the first awakening word recognition unit 342 is configured to perform awakening word recognition on the target voice feature vector to obtain an awakening word recognition result of the to-be-recognized voice data.

In still other embodiments, as shown in fig. 12, the target model parameter obtaining module 33 may include:

a feature extraction network determining unit 332, configured to determine, when a speech rate recognition result indicates that the speech data to be recognized has a first probability of belonging to a first speech rate and a second probability of belonging to a second speech rate, a first feature extraction network matching the first speech rate and a second feature extraction network matching the second speech rate; the number of the feature layer layers and/or the feature mapping areas contained in different feature extraction networks are different;

accordingly, the wake word recognition module 34 may include:

a second feature extraction unit 343, configured to input the to-be-recognized speech data into the first feature extraction network and the second feature extraction network, respectively, and output corresponding first speech feature vectors and second speech feature vectors;

a weight vector obtaining unit 344, configured to obtain a first weight vector of the first speech feature vector and a second weight vector of the second speech feature vector;

and a second awakening word recognition unit 345, configured to process the first voice feature vector and the second voice feature vector according to the first weight vector and the second weight vector, so as to obtain an awakening word recognition result of the voice data to be recognized.

Optionally, the weight vector obtaining unit 344 may include:

the feature fusion unit is used for fusing the first voice feature vector and the second voice feature vector to obtain a fused voice feature vector;

and the weight obtaining unit is used for carrying out speech speed classification processing on the fusion speech feature vector to obtain a first weight vector of the first speech feature vector and a second weight vector of the second speech feature vector.

Optionally, the second wake word recognition unit 345 may include:

the weighted fusion unit is used for carrying out weighted fusion processing on the first voice feature vector, the first weight vector, the second voice feature vector and the second weight vector to obtain a target voice feature vector of the voice data to be recognized;

and the awakening word classification unit is used for carrying out awakening word recognition on the target characteristic vector to obtain an awakening word recognition result of the voice data to be recognized.

For the solutions described in the foregoing embodiments, the number of feature layers (e.g., convolutional layers of a convolutional neural network) corresponding to the speech data with the faster speech speed is smaller, and/or the size of the feature mapping region (e.g., convolutional kernel of the convolutional neural network) of the feature layers is smaller, that is, the size of the feature receptive field corresponding to the speech data with the faster speech speed is smaller, and the mapping relationship between the speech speed category and the number of feature layers and the size of the feature mapping region may be trained by using training speech data with different speech speeds, which is not limited in this application.

It should be noted that, various modules, units, and the like in the embodiments of the foregoing apparatuses may be stored in the memory as program modules, and the processor executes the program modules stored in the memory to implement corresponding functions, and for the functions implemented by the program modules and their combinations and the achieved technical effects, reference may be made to the description of corresponding parts in the embodiments of the foregoing methods, which is not described in detail in this embodiment.

The present application also provides a computer-readable storage medium on which a computer program may be stored, which may be called and loaded by a processor to implement the steps of the speech recognition method described in the above embodiments.

Finally, it should be noted that, with respect to the above embodiments, unless the context clearly dictates otherwise, the words "a", "an" and/or "the" do not denote a singular number, but may include a plurality. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are explicitly identified, that the steps and elements do not form an exclusive list, and that a method or apparatus may include other steps or elements. An element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

In the description of the embodiments herein, "/" means "or" unless otherwise specified, for example, a/B may mean a or B; "and/or" herein is merely an association describing an associated object, and means that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, in the description of the embodiments of the present application, "a plurality" means two or more than two.

Reference herein to terms such as "first," "second," or the like, is used for descriptive purposes only and to distinguish one operation, element, or module from another operation, element, or module without necessarily requiring or implying any actual such relationship or order between such elements, operations, or modules. And are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated, whereby a feature defined as "first" or "second" may explicitly or implicitly include one or more of such features.

The embodiments in the present description are described in a progressive or parallel manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device, the computer device and the system disclosed by the embodiment correspond to the method disclosed by the embodiment, so that the description is relatively simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of speech recognition, the method comprising:

acquiring voice data to be recognized;

2. The method according to claim 1, wherein the obtaining target model parameters in the wake word recognition model that match the speech rate recognition result comprises:

3. The method according to claim 1, wherein the inputting the voice data to be recognized into a wake-up word recognition model using the target model parameters and outputting a wake-up word recognition result of the voice signal to be recognized comprises:

4. The method according to claim 1, wherein the obtaining target model parameters in the awakening word recognition model matching with the speech rate recognition result, inputting the voice data to be recognized into the awakening word recognition model using the target model parameters, and outputting the awakening word recognition result of the voice signal to be recognized comprises:

5. The method of claim 4, the obtaining a first weight vector of the first speech feature vector and a second weight vector of the second speech feature vector, comprising:

6. The method according to claim 4, wherein the processing the first speech feature vector and the second speech feature vector according to the first weight vector and the second weight vector to obtain the awakening word recognition result of the speech data to be recognized comprises:

7. The method according to any one of claims 2 to 6, wherein the faster the speech speed, the fewer the number of feature layer layers corresponding to the speech data, and/or the smaller the feature mapping area size of the feature layer.

8. The method according to any one of claims 1 to 6, wherein the acquiring voice data to be recognized comprises:

acquiring a voice signal to be recognized;

9. The method according to any one of claims 1 to 6, wherein the performing speech rate recognition on the speech data to be recognized to obtain a speech rate recognition result includes:

10. A speech recognition apparatus, the method comprising: