CN109903750B

CN109903750B - Voice recognition method and device

Info

Publication number: CN109903750B
Application number: CN201910130555.0A
Authority: CN
Inventors: 潘嘉; 魏思; 王智国
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2019-02-21
Filing date: 2019-02-21
Publication date: 2022-01-04
Anticipated expiration: 2039-02-21
Also published as: CN109903750A

Abstract

The application discloses a voice recognition method and a voice recognition device, wherein the method comprises the following steps: after the target voice to be recognized is obtained, the representing information matched with the target voice is obtained from a pre-constructed memory, wherein a large number of sample speaker representing results and/or sample speaking environment representing results are stored in the memory, and further, the target voice can be recognized according to the representing information obtained from the memory. Therefore, because a large number of sample speaker representation results and/or sample speaking environment representation results are stored in the memory, representation information matched with the speaker and/or the speaking environment of the target voice can be obtained from the memory to enrich the identification basis of the target voice, and therefore the voice identification effect and efficiency can be improved when the target voice is subjected to online personalized voice identification.

Description

Voice recognition method and device

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a speech recognition method and apparatus.

Background

With the continuous breakthrough of artificial intelligence technology and the increasing popularization of various intelligent terminal devices, the frequency of human-computer interaction in daily work and life of people is higher and higher. As one of the most convenient and fast interactive modes, speech recognition is just an important link of human-computer interaction. With the increasing use of voice users, the difference of pronunciation habits among users becomes more and more obvious, and under the circumstance, the traditional method for performing voice recognition by adopting a unified voice recognition model cannot achieve good recognition accuracy rate for all users.

Therefore, how to construct an individual speech recognition model for each user according to the pronunciation habits of each user becomes an important research direction in the field of speech recognition at present. The existing personalized speech recognition method is mostly based on a large amount of user historical speech data to construct a personalized speech recognition model aiming at a user, and the method is called as off-line personalization; for new users, offline personalization cannot be realized due to lack of historical data; for old users, the difference between the current session of the user and the historical data of the user often causes the situation that the identification effect of the personalized model does not increase or decrease.

Another personalization method is to perform personalization identification in real time by using current session data of a user, which is called online personalization, but because only the current session of the user is available in the available data, the user data is less, and it is difficult to construct a personalization identification model of the user in real time, how to ensure the recognition effect and efficiency of online personalization is a technical problem to be solved at present.

Disclosure of Invention

The embodiment of the present application mainly aims to provide a voice recognition method and device, which can improve the effect and efficiency of voice recognition when performing online personalized voice recognition.

The embodiment of the application provides a voice recognition method, which comprises the following steps:

acquiring target voice to be recognized;

acquiring the representation information matched with the target voice from a pre-constructed memory, wherein a large number of sample speaker representation results and/or sample speaking environment representation results are stored in the memory;

and recognizing the target voice according to the representation information.

Optionally, the obtaining, from a pre-built memory, the representation information matched with the target speech includes:

splitting the target voice to obtain each unit voice;

and acquiring the representation information matched with the unit voice from the memory according to the acoustic characteristics of the unit voice.

Optionally, the obtaining, from the memory, the representation information matched with the unit voice according to the acoustic feature of the unit voice includes:

taking the acoustic characteristics of the unit voice as the input of a voice recognition model, and enabling each network layer of a recognition network of the voice recognition model to sequentially output the initial representation result of the unit voice;

and acquiring the representation information matched with the initial representation result from the memory.

Optionally, the enabling each network layer of the recognition network of the speech recognition model to sequentially output the initial representation result of the unit speech includes:

using each network layer of the recognition network of the speech recognition model as a current layer in sequence, adjusting an initial representation result of the current layer by using a control parameter to obtain a target representation result of the unit speech corresponding to the current layer, wherein the control parameter is used for enabling the target representation result to approach an actual representation result of the unit speech;

and taking the target representation result as the input of the next layer of the current layer to obtain the initial representation result output by the next layer.

Optionally, the control parameter is further used for suppressing peripheral noise of the unit voice.

Optionally, the control parameter is generated according to the presentation information obtained from the memory and matched with the initial presentation result output by the current layer.

Optionally, the obtaining, from the memory, the representation information matched with the initial representation result includes:

generating a target speaker representation result according to the correlation degree between the initial representation result and each sample speaker representation result in the memory;

and/or generating a target speaking environment representation result according to the correlation degree between the initial representation result and each sample speaking environment representation result in the memory.

Optionally, the recognizing the target voice according to the representation information includes:

acquiring target representation results of each unit voice corresponding to the last layer in the recognition network;

and identifying the target voice according to the acquired target representation result of each voice unit.

An embodiment of the present application further provides a speech recognition apparatus, including:

the target voice acquiring unit is used for acquiring target voice to be recognized;

a presentation information acquiring unit for acquiring presentation information matched with the target speech from a memory constructed in advance, in which a large number of sample speaker presentation results and/or sample speech environment presentation results are stored;

and the target voice recognition unit is used for recognizing the target voice according to the representation information.

Optionally, the representation information obtaining unit includes:

the unit voice obtaining subunit is used for splitting the target voice to obtain each unit voice;

and the representing information acquiring subunit is used for acquiring representing information matched with the unit voice from the memory according to the acoustic characteristics of the unit voice.

Optionally, the presentation information obtaining subunit includes:

a first initial result obtaining subunit, configured to use the acoustic features of the unit voice as input of a voice recognition model, so that each network layer of a recognition network of the voice recognition model sequentially outputs an initial representation result of the unit voice;

and the first representation information acquisition subunit is used for acquiring the representation information matched with the initial representation result from the memory.

Optionally, the first initial result obtaining subunit includes:

a first target result obtaining subunit, configured to enable each network layer of the recognition network of the speech recognition model to serve as a current layer in sequence, adjust an initial representation result of the current layer by using a control parameter, and obtain a target representation result of the unit speech corresponding to the current layer, where the control parameter is used to enable the target representation result to approach an actual representation result of the unit speech;

and the second initial result obtaining subunit is configured to use the target representation result as an input of a next layer of the current layer to obtain an initial representation result output by the next layer.

Optionally, the first representing information acquiring subunit is specifically configured to:

Optionally, the target speech recognition unit includes:

a second target result obtaining subunit, configured to obtain a target representation result of each unit voice corresponding to the last layer in the recognition network;

and the target voice recognition subunit is used for recognizing the target voice according to the acquired target representation result of each voice unit.

An embodiment of the present application further provides a speech recognition device, including: a processor, a memory, a system bus;

the processor and the memory are connected through the system bus;

the memory is used for storing one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform any one implementation of the above-described speech recognition method.

An embodiment of the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are run on a terminal device, the terminal device is caused to execute any implementation manner of the voice recognition method.

The embodiment of the present application further provides a computer program product, which, when running on a terminal device, enables the terminal device to execute any implementation manner of the above speech recognition method.

According to the voice recognition method and device provided by the embodiment of the application, after the target voice to be recognized is obtained, the representing information matched with the target voice is obtained from the pre-constructed memory, wherein a large number of sample speaker representing results and/or sample speaking environment representing results are stored in the memory, and further, the target voice can be recognized according to the representing information obtained from the memory. Therefore, because a large number of sample speaker representation results and/or sample speaking environment representation results are stored in the memory, representation information matched with the speaker and/or the speaking environment of the target voice can be obtained from the memory to enrich the identification basis of the target voice, and therefore the voice identification effect and efficiency can be improved when the target voice is subjected to online personalized voice identification.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart illustrating a process of obtaining presentation information matched with a target voice from a pre-constructed memory according to an embodiment of the present application;

FIG. 3 is a schematic flow chart illustrating a process of obtaining information from a memory according to acoustic characteristics of a unit voice, wherein the information is matched with the unit voice;

FIG. 4 is a schematic structural diagram of a speech recognition model provided in an embodiment of the present application;

fig. 5 is a schematic flowchart illustrating that each network layer of the recognition network of the speech recognition model sequentially outputs an initial representation result of a unit speech according to the embodiment of the present application;

FIG. 6 is a flowchart illustrating a process of recognizing a target speech according to representation information according to an embodiment of the present application;

fig. 7 is a schematic composition diagram of a speech recognition apparatus according to an embodiment of the present application.

Detailed Description

The existing personalized speech recognition methods can be generally divided into two types, one is an off-line personalized recognition method, and the other is an on-line personalized recognition method. The off-line personalized recognition method is that a personalized voice recognition model for a user is constructed based on a large amount of historical voice data of the user, and then the model is used for carrying out personalized recognition on the voice of the user. However, for a new user, the offline personalized recognition method cannot realize speech recognition by the offline personalized recognition method because the personalized speech recognition model for the new user is constructed by lacking the historical speech data of the new user; even for the old user, there may be a certain difference between the current voice uttered by the old user and the historical voice thereof, and if the personalized voice recognition model constructed based on the historical voice data is still used to perform voice recognition on the current voice, the recognition effect may be deteriorated.

The online personalized recognition method refers to the real-time personalized voice recognition of the current conversation of the user by utilizing the voice data in the conversation. In the identification process, firstly, receiving voice data in the current conversation of a user, and extracting acoustic features of the voice data; then, extracting a speaker representation result corresponding to each frame of voice data; then, calculating the neural network output corresponding to each frame of voice data; further, a recognition result can be obtained, and speech recognition can be completed.

Specifically, when extracting the acoustic features of the voice data in the current session of the user, firstly, the voice data needs to be framed to obtain a corresponding voice frame sequence, and then, the acoustic features of each voice frame are extracted, where the acoustic features refer to feature data used for representing acoustic information of the corresponding voice frame, and for example, the feature data may be Mel-scale Frequency Cepstral Coefficients (MFCC) features or Perceptual Linear Prediction (PLP) features.

For each voice frame in the voice frame sequence, in order to extract a speaker representation result corresponding to the voice of the frame, firstly, the acoustic features of all historical frames before the frame in the voice frame sequence need to be spliced into a feature sequence, then, a speaker representation vector corresponding to the voice frame is obtained through maximum likelihood criterion estimation by using a pre-constructed speaker recognition model, and the speaker representation vector is used as a representation result of the corresponding speaker. The speaker recognition Model usually adopts a global Variable space (global Variable space) Model, and the specific construction process is as follows: firstly, collecting a large amount of voice data of a plurality of different users; then extracting acoustic features of the voice data; then, the maximum posterior probability criterion is used for training the global variable space model to construct the speaker recognition model.

Further, after the acoustic features of the speech data in the current session of the user and the speaker representation result (i.e., the corresponding speaker representation vector) corresponding to each speech frame are obtained by the above method, the two may be spliced, and the spliced vector is used as the input of the speech recognition neural network to obtain the output of the neural network, i.e., the acoustic posterior probability value of each state of each phoneme in the speech data is obtained. The output value of the neural network and a decoding algorithm (such as a Viterbi (Viterbi) algorithm) can be further used to perform a search of the decoding network to obtain a final recognition result, thereby completing the speech recognition.

However, the online personalized recognition method for performing personalized speech recognition on the current speech data of the user in the conversation in real time may bring a problem of poor personalized recognition effect, for example, in application scenarios such as a speech input method and speech man-machine interaction, because the duration of each session input by the user is very short, usually only a few seconds, the recognition basis for performing speech recognition on the user is small, the accuracy of the online generated speaker representation result is reduced, and the accuracy of the subsequent speech recognition result is reduced.

In order to solve the above drawbacks, the present application provides a speech recognition method, after a target speech to be recognized is obtained, representing information matched with the target speech is obtained from a pre-constructed memory, wherein a large number of sample speaker representing results and/or sample speaking environment representing results are stored in the memory, and further, the target speech can be recognized according to the representing information obtained from the memory. Therefore, even under the condition that the target voice data is less, the representation information matched with the target voice can be obtained from the memory to enrich the identification basis, so that the representation result of the target voice (such as the speaker representation result of the target voice and the like) can be more accurately extracted according to the identification basis, and further the target voice can be subjected to online personalized voice recognition based on the extracted representation result of the target voice, and the effect and the efficiency of the voice recognition are improved.

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

First embodiment

Referring to fig. 1, a schematic flow chart of a speech recognition method provided in this embodiment is shown, where the method includes the following steps:

s101: and acquiring target voice to be recognized.

In this embodiment, any voice subjected to voice recognition by the present embodiment is defined as a target voice. In addition, the present embodiment does not limit the language type of the target speech, for example, the target speech may be a chinese speech, an english speech, or the like; meanwhile, the embodiment also does not limit the length of the target speech, for example, the target speech may be a sentence, or multiple sentences.

It can be understood that the target voice can be obtained by recording and the like according to actual needs, for example, phone call voice or conference recording and the like in daily life of people can be used as the target voice, and after the target voice is obtained, the target voice can be recognized by using the embodiment.

S102: and acquiring the representation information matched with the target voice from a pre-constructed memory, wherein a large number of sample speaker representation results and/or sample speaking environment representation results are stored in the memory.

In this embodiment, after the target speech to be recognized is acquired in step S101, in order to avoid that the target speech data is too little to affect the effect and efficiency of speech recognition on the target speech, firstly, the representation information matched with the target speech may be acquired from the pre-constructed memory, and the representation information and the target speech data are used together as the recognition basis to realize effective recognition on the target speech in subsequent step S103. The expression information comprises all or part of expression information of each expression result in at least one expression result in the memory, in the at least one expression result, all or part of expression information of a sample speaker expression result can represent the speaking characteristics of the speaker to which the target speech belongs, and all or part of expression information of a sample speaking environment expression result can represent the environmental characteristics of the speaking environment of the speaker to which the target speech belongs.

It should be noted that a large number of different sample speaker representation results and/or sample speech environment representation results are stored in the memory. The sample speaker representation result refers to data representing the tone, gender, age, region of the sample speaker and other personalized information, and can be represented in a vector form or other forms; the sample speaking environment representation result refers to data representing personalized information of the sample speaking environment, and can also be represented in a vector form or other forms, for example, vector data representing noisy speaking environments such as conference rooms and shopping malls, or vector data representing quiet speaking environments such as valleys and libraries.

In practical applications, one of the following two embodiments can be used to obtain different sample speaker representation results in the memory:

in a first embodiment, different speaker representation vectors may be generated as sample speaker representation results in memory using a pre-trained speaker recognition model. Specifically, firstly, voice data of a plurality of speakers are collected as training data, and voice features of the training data are extracted; then, training a speaker recognition model after parameter initialization by using training data and voice characteristics thereof, wherein the speaker recognition model can be a factor analysis model (such as a global variable space model) or a model based on a deep neural network; then, after the speaker recognition model is obtained through training, the speaker recognition model is used for re-recognizing the voice data of each speaker in the training data, and the expression vector corresponding to each speaker is extracted and stored to be used as different sample speaker expression results.

For example, if the global variable space model is obtained by training, the model is used to re-identify the voice data of each speaker in the training data, and the representation vector of each speaker extracted and stored is the acoustic feature I-vector corresponding to each speaker, so that the representation vector can be used as the representation result of the sample speaker in the memory, and the corresponding speaker is used as the sample speaker in the memory; if the model based on the deep neural network is obtained through training, after the model is used for re-identifying the voice data of each speaker in the training data, the representation vector of each speaker is extracted and stored as the output vector of the last hidden layer in the model, and then the representation vector can be used as the representation result of the sample speaker in the memory, and the corresponding speaker is used as the sample speaker in the memory.

In a second embodiment, different speaker representation vectors may be generated as sample speaker representation results in memory using a pre-trained speaker adaptive speech recognition model. Specifically, firstly, collecting voice data of a plurality of speakers as training data, extracting voice characteristics of the training data, and marking the speaker to which each voice data belongs; then, training the pre-trained general neural network speech recognition models in a self-adaptive manner by using the training data and the speech characteristics of each speaker to obtain the self-adaptive speech recognition models corresponding to each speaker, wherein the general neural network speech recognition models are obtained by training a large amount of speech data, the specific training method is consistent with the existing method, and the details are not repeated herein; then, after the adaptive speech recognition model corresponding to each speaker is obtained through training, the adaptive speech recognition model of the speaker can be used for re-recognizing the speech data of the corresponding speaker to obtain the expression vector of the corresponding speaker, and then the expression vector is used as the expression result of the sample speaker in the memory, and the corresponding speaker is used as the sample speaker in the memory.

It should be noted that, in the second embodiment, through training, an adaptive speech recognition model corresponding to each speaker can be obtained, if the training data includes multiple pieces of speech data of the same speaker, the multiple pieces of speech data need to be recognized by the adaptive speech recognition model corresponding to the speaker, and then the output vector of the last hidden layer obtained after each recognition is subjected to arithmetic mean, that is, the elements of each corresponding position in all the output vectors are subjected to arithmetic mean; then, the obtained vector is used as the representation vector of the corresponding speaker and is used as the representation result of the corresponding speaker.

For example, the following steps are carried out: assuming that the training data includes 5 pieces of speech data of a speaker a, the 5 pieces of speech data are respectively recognized by using a self-adaptive speech recognition model corresponding to the speaker a, and output vectors of a last hidden layer obtained after recognition are respectively represented as [ a ]₁,a₂,...a_n]、[b₁,b₂,...b_n]、[c₁,c₂,...c_n]、[d₁,d₂,...d_n]、[e₁,e₂,...e_n](ii) a Then, the 5 output vectors are subjected to arithmetic mean, and the obtained vector can be represented as:

this vector can then be used as the representation vector for speaker a and as the representation result for speaker a.

It should be noted that, in the second embodiment, the adaptive speech recognition model corresponding to each speaker obtained through training may be completely independent, that is, each speaker corresponds to a separate and personalized adaptive speech recognition model. Of course, some model parameters of the adaptive speech recognition models corresponding to these speakers may be shared, for example, parameters of the input layer and parameters of the output layer of the models are shared, but the intermediate layers of the models are different, i.e., each speaker corresponds to a personalized intermediate layer.

As can be seen from the above description of the two embodiments, the second embodiment has higher accuracy of the obtained speaker representation result than the first embodiment, but takes longer time than the first embodiment, even more than several times, so that in practical applications, the result can be represented according to the speaker representation resultThe requirement of precision and the requirement of acquisition time, one of the more suitable embodiments is selected to acquire the sample speaker representation result. Further, if a speaker representation result with higher accuracy needs to be obtained, the two embodiments may be combined, that is, the sample speaker representation vectors corresponding to the same sample speaker obtained by the two embodiments are spliced, and the spliced vector is used as the final sample speaker representation vector of the sample speaker, that is, the final sample speaker representation vector is used as the representation result of the sample speaker. For example: it is assumed that the dimensions of the sample speaker representation vectors of the same sample speaker obtained by the above two embodiments are f₁And f₂Then the two can be spliced to obtain the dimension f₁+f₂The final sample speaker representation vector is used as the final sample speaker representation vector, that is, the final sample speaker representation vector is used as the representation result of the sample speaker.

In addition, one of the two embodiments can be used to obtain different sample speaking environment representation results in the memory, and in a specific implementation process, the "speaker" in each embodiment may be replaced by the "speaking environment" correspondingly, which may specifically refer to the related descriptions of the two embodiments, and will not be described herein again.

It is understood that, in order to reduce the subsequent calculation amount of the whole memory, the storage amount of the sample speaker representation results in the memory may be limited within a preset range, for example, the storage amount of the sample speaker representation results may be limited within 1000. And if the number of the obtained sample speaker representation results is too large, the sample speaker representation results can be clustered through the existing or future clustering algorithm (such as a K-means algorithm), and the clustered class center vector is used as a representative to replace the representation results of all speakers in the cluster and is stored in the memory, so that the requirement of the memory on the number of the sample speaker representation results is met.

Similarly, in order to reduce the subsequent calculation amount on the whole memory, the storage amount of the sample speaking environment representation result in the memory may be limited within a preset range, for example, the storage amount of the sample speaking environment representation result may also be limited within 1000. Moreover, if the number of the obtained sample speaking environment representation results is too large, the sample speaking environment representation results can be clustered through the existing or future clustering algorithm (such as a K-means algorithm), and the clustered class center vector is taken as a representative to replace the representation results of all the speaking environments in the cluster and is stored in the memory, so that the requirement of the memory on the number of the sample speaking environment representation results is met.

In this embodiment, as shown in fig. 2, an implementation process of "obtaining the representation information matched with the target voice from the pre-constructed memory" in step S102 may specifically include steps S201 to S202:

s201: and splitting the target voice to obtain each unit voice.

In this implementation, in order to obtain the representation information matching the target speech from the memory for enriching the recognition bases, the target speech needs to be split first to obtain each unit speech included in the target speech, for example, each unit speech may be each speech frame constituting the target speech, and each speech frame may be a phoneme or a state in a phoneme.

S202: for each unit voice, according to the acoustic feature of the unit voice, the representation information matched with the unit voice is obtained from the memory.

In this implementation, after obtaining each unit voice included in the target voice in step S201, feature extraction may be performed on each unit voice to extract an acoustic feature of each unit voice, where the acoustic feature may be an MFCC feature or a PLP feature of the corresponding unit voice.

Then, the extracted acoustic features of each unit voice, and the sample speaker representation result and/or the sample speaking environment representation result stored in the memory can be subjected to data processing, and the representation information matched with each unit voice can be respectively obtained from the memory according to the processing result, and further, the representation information matched with each unit voice can be integrated, and the integrated representation information is used as the representation information matched with the target voice.

The expression information comprises all or part of expression information of each expression result in at least one expression result in a memory, all or part of expression information of a sample speaker expression result in the at least one expression result can represent the speaking characteristics of the speaker to which the unit speech belongs, and all or part of expression information of a sample speaking environment expression result can represent the environmental characteristics of the speaking environment of the speaker to which the unit speech belongs.

It should be noted that a specific implementation manner of the step S202 will be described in the second embodiment.

S103: based on the presentation information, the target speech is recognized.

In this embodiment, in step S102, after the representation information matching the target speech is acquired from the pre-constructed memory, the target speech can be further recognized according to the representation information. Specifically, the representation information and the acoustic features of the target speech may be used to predict an acoustic posterior probability value corresponding to each unit speech in the target speech, for example, when the unit speech is a phoneme, the acoustic posterior probability value corresponding to the unit speech is a posterior probability value when the unit speech belongs to each phoneme type (each phoneme type of the language to which the unit speech belongs). Then, these posterior probability values are used to search the decoding network through a decoding algorithm (such as a Viterbi algorithm) to obtain the recognition result of the target voice.

It should be noted that a specific implementation manner of the step S103 will be described in the second embodiment.

In summary, after the target speech to be recognized is obtained, the speech recognition method provided in this embodiment obtains the representation information matched with the target speech from the pre-constructed memory, where a large number of sample speaker representation results and/or sample speaking environment representation results are stored in the memory, and further, the target speech can be recognized according to the representation information obtained from the memory. Therefore, because a large number of sample speaker representation results and/or sample speaking environment representation results are stored in the memory, representation information matched with the speaker and/or the speaking environment of the target voice can be obtained from the memory to enrich the identification basis of the target voice, and therefore the voice identification effect and efficiency can be improved when the target voice is subjected to online personalized voice identification.

Second embodiment

Next, this embodiment will describe a specific implementation procedure of step S202 "acquiring the presentation information matching the unit voice from the memory according to the acoustic feature of the unit voice" in the first embodiment.

Referring to fig. 3, a schematic diagram of a process for acquiring the presentation information matched with the unit voice from the memory according to the acoustic features of the unit voice provided by the embodiment is shown, where the process includes the following steps:

s301: the acoustic characteristics of the unit voice are used as the input of the voice recognition model, the voice recognition model is used for recognizing each network layer of the network, and the initial representation result of the unit voice is sequentially output.

In this embodiment, after obtaining each unit voice included in the target voice in step S201, the acoustic feature extraction may be performed on each unit voice to obtain an acoustic feature corresponding to each unit voice, and then, according to the acoustic features, an initial feature vector corresponding to each unit voice may be generated by using a vector generation method to serve as an initial representation result corresponding to each unit voice. It should be noted that, in the following content, how to perform data processing on a unit voice is described with reference to a unit voice of a target voice, so as to obtain a corresponding initial representation result, and the processing manners of other unit voices are similar, and are not described in detail.

Specifically, the pre-constructed speech recognition model of the present embodiment may be formed by a multi-layer network, as shown in fig. 4, and the model structure includes an input layer, a recognition network, a memory coding module, a control module, and an output layer.

The input layer is configured to input an acoustic feature of a unit speech, and taking the unit speech as a speech frame as an example, the data input in the input layer is an acoustic feature such as an MFCC feature or a PLP feature of the speech frame.

The recognition network is used for converting the acoustic features of the unit voice input by the input layer and outputting the feature vectors obtained after conversion to the output layer. As shown in fig. 4, the recognition network may be formed by a deep neural network, which includes a plurality of network layers, wherein each network layer sequentially adjusts the feature vector of the unit voice output by the previous network layer so as to be able to output the feature vector of the unit voice in each network layer, where the feature vector of the unit voice output by each network layer is defined as an initial feature vector output by the corresponding network layer, and is used as an initial representation result corresponding to each unit voice, which can be represented by h, that is, this embodiment may update the initial representation result h of the unit voice layer by layer through each network layer in the recognition network.

Taking the unit speech as the t-th frame speech frame in the target speech as an example, the initial representation result of the speech frame output at the l-th layer of the recognition network can be represented as

And is

Wherein R represents a real number, D_lA dimension representing an initial representation result output by the l-th layer, wherein l is 1, 2 … N, and N is the total number of layers of the network layer; at the same time, the result may be represented based on the initial representation

Generating control parameters for adjusting the initial representation by a control model

That is, the initial representation result output by each network layer

Each corresponding to an adjusted target representation result, and the manner of generating the target representation result will be described in the following step S3011. Based on this, the initial representation result output for each network layer

Can be generated by network parameters of the corresponding network layer, i.e.

Wherein f is a transformation function,

representing the target representation result output by the t frame speech frame at the l-1 layer,

and representing the target representation result output by the t-1 frame speech frame at the l layer.

It should be noted that, in this embodiment, the structure of the deep Neural network in the recognition network is not limited, for example, the deep Neural network may be a one-way or two-way long-term memory model structure, or may also be a Convolutional Neural Network (CNN) structure, which network structure is specifically adopted and may be selected according to an actual situation, which is not limited in this embodiment of the present application. For example, in practical applications, for a large vocabulary speech recognition task with more model training data, the deep neural network in the recognition network may generally adopt a bidirectional long-term memory neural network with 5 to 10 layers, and for a limited domain speech recognition task with less model training data, the deep neural network in the recognition network may generally adopt a unidirectional long-term memory neural network with 1 to 3 layers.

Further, in order to improve the computational efficiency of the model, it may be selected to insert a downsampling layer between a plurality of network layers included in the identification network, for example, one downsampling layer may be inserted between every two adjacent network layers, that is, a plurality of downsampling layers are inserted in common, or only one downsampling layer may be inserted between any two adjacent network layers, that is, one downsampling layer is inserted in common.

Next, how each network layer of the recognition network outputs "the initial expression result of the unit voice in sequence" will be described.

An alternative implementation manner is that, as shown in fig. 5, the implementation process of causing each network layer of the recognition network of the speech recognition model to sequentially output the initial representation result of the unit speech in step S301 may specifically include steps S3011 to S3012:

s3011: and sequentially taking each network layer of the recognition network of the speech recognition model as a current layer, and adjusting the initial representation result of the current layer by using the control parameters to obtain a target representation result of the unit speech corresponding to the current layer, wherein the control parameters are used for enabling the target representation result to approach the actual representation result of the unit speech.

In this implementation, in order to enable each network layer of the recognition network of the speech recognition model to sequentially output the initial representation result of the unit speech, that is, to realize the layer-by-layer update of the initial representation result of the unit speech, each network layer of the recognition network of the speech recognition model may be sequentially used as the current layer from the input layer to the output layer; then, the initial representation result h outputted by the current layer is adjusted by using the control parameter (which can be represented by g) outputted by the control module in the speech recognition model, and the adjusted representation result is defined as the target representation result (which can be represented by g) corresponding to the unit speech of the current layer

Representation).

The control parameter of the current layer is used for adjusting the initial expression result h of the unit voice output by the current layer, so that the target expression result obtained after adjustment

The actual representation result of the unit voice can be more approximated. It should be noted that the control parameters for identifying each network layer of the network are generated based on the initial representation result output by each network layer, so that the control parameters of each network layer may be the same or different.

In a possible implementation manner of this embodiment, the control parameter of the current layer is generated according to the presentation information that is obtained from the memory and matches with the initial presentation result output by the current layer.

Specifically, as shown in fig. 4, in the speech recognition model constructed in this embodiment, the memory coding module is respectively connected to each network layer, the memory and the control module in the recognition network, so that the representation information related to the initial representation result output by the current layer can be obtained from the memory through the memory coding module, and then the control parameter of the current layer is generated according to the representation information output by the memory coding module.

Since a large number of sample speaker representation results and/or sample speaking environment representation results are stored in the memory, an alternative implementation manner is that the representation information related to the initial representation result output by the current layer can be obtained from the memory through the memory coding module. In specific implementation, the target speaker representation result can be generated according to the correlation between the initial representation result and each sample speaker representation result in the memory, and/or the target speaking environment representation result can be generated according to the correlation between the initial representation result and each sample speaking environment representation result in the memory, so that the generated target speaker representation result and/or the target speaking environment representation result can be used as representation information which is obtained from the memory and is related to the initial representation result output by the current layer.

Next, how the "targeted speaker representation result" and the "targeted utterance environment representation result" are generated will be described.

The memory coding module can be used for determining the correlation degree between each sample speaker representation result in the memory and the initial representation result of the unit voice, and then, according to the correlation degrees, the sample speaker representation results in the memory are subjected to linear combination to generate a representation result capable of representing the voice characteristics of the speaker to which the unit voice belongs, and the representation result is defined as a target speaker representation result. For example, taking the unit speech as the t-th frame speech frame in the target speech and the current layer as the l-th layer as an example, the target speaker representation result of the speech frame generated by the memory coding module can be represented as

Wherein, when determining the correlation degree between each sample speaker representation result in the memory and the initial representation result of the unit voice, a combination coefficient characterizing the correlation degree between each sample speaker representation result in the memory and the initial representation result can be generated. The combination coefficient comprises a coefficient corresponding to each sample speaker representation result, the larger the coefficient is, the higher the correlation degree between the corresponding sample speaker representation result and the initial representation result is, and conversely, the smaller the coefficient is, the lower the correlation degree between the corresponding sample speaker representation result and the initial representation result is.

In this embodiment, the combination coefficients may be generated by using each network layer of the memory coding module, specifically, the memory coding module may be formed by three or more layers of neural networks, and specifically, may include an input layer, a full connection layer, and an output layer. As shown in fig. 4, the input layer of the memory coding module is used to input the representation result of each sample speaker in the memory and the initial representation result of the unit speech output at the current layer, or, in order to improve the coding effect, the unit speech and all the historical unit speech before the unit speech may be output at the current layerThe arithmetic mean of the initial representation result is used as input data and input to the input layer of the memory coding module, for example, taking unit speech as the t-th frame speech frame in the target speech as an example, assuming that the initial representation result output by the speech frame at the current layer is the initial feature vector output by the speech frame at the current layer, the arithmetic mean of the initial feature vectors output by the t-th frame speech frame and all previous historical speech frames (t-1 st frame, t-2 nd frame, … …) at the current layer can be used as input data and input to the input layer of the memory coding module; one or more fully-connected layers are arranged behind the input layer, the number of the fully-connected layers can be less than 3, and each layer contains less than 512 nodes, after data output by the input layer is coded through the fully-connected layers, the output layer of the memory coding module can generate a combination coefficient representing the correlation degree between each sample speaker representation result in the memory and the initial representation result of the unit voice based on the output result of the fully-connected layers, the combination coefficient is defined as alpha, the t-th frame voice frame in the unit voice and the i-th sample speaker representation result in the memory are taken as examples, and if the current layer of the network is identified as the l-th layer, the output layer of the memory coding module outputs the coefficient corresponding to the current layer

The coefficient corresponding to the speaker representation result of each sample can be obtained by the above method

These coefficients will form the combined coefficients by which the target speaker representation of the t-th speech frame can be calculated according to the following formula

The specific calculation formula is as follows:

wherein the content of the first and second substances,

representing the correlation degree between the initial representation result output by the ith frame speech frame in the target speech at the l layer of the recognition network and the representation result of the ith sample speaker in the memory; m represents the total number of the sample speaker representation results in the memory; m is_iRepresenting the representation result of the ith sample speaker in the memory;

and representing the target speaker representation result of the t-th frame speech frame.

Similarly, in the above implementation, the target speaking environment representation result of the unit speech can also be calculated. In a specific calculation process, only the "sample speaker representation result" in the memory needs to be replaced with the "sample speaking environment representation result", and the specific calculation process can refer to the related description of the above implementation manner, which is not described herein again.

Therefore, through the memory coding module, the representation information related to the initial representation result output by the current layer can be obtained from the memory, and the three forms can be included: the first is to generate a target speaker representation; the second one is to generate a target speaking environment representation result; and the third one is to generate the target speaker representation result and the target speaking environment representation result.

Furthermore, after the representation information related to the initial representation result output by the current layer is acquired from the memory through the memory coding module, the control parameter can be generated by the control module by using the representation information.

Specifically, after obtaining the representation information related to the initial representation result output by the current layer from the memory, the memory coding module sends the representation information to the control module in the speech recognition model, as shown in fig. 4, where the control module in the speech recognition model connects the memory coding module and the recognition network, and more specifically, connects the memory coding module and each network layer in the recognition network.

In practical applications, the control module may be composed of three or more layers of neural networks (the neural network structure may be a multi-layer feedforward neural network in general), including an input layer, an intermediate layer, and an output layer. Wherein, the input layer is used for inputting the representation information output by the memory coding module, namely the representation result of the target speaker and/or the representation result of the target speaking environment; the middle layer is a multi-layer full connection layer, and the number of layers of the full connection layer is the same as the number of network layers of the identification network; the output layer is composed of N parts, N is the total number of the network layers included in the identification network, each part in the N parts in the output layer corresponds to each network layer of the identification network, so that the output layer can output the control parameters corresponding to each network layer of the identification network through the N parts, and therefore, when the initial characteristic vector output by the network layer is used as an initial representation result, for each part in the N parts of the output layer, the number of nodes included in the part is the same as the dimension of the initial characteristic vector output by the network layer corresponding to the part in the identification network, so that the dimension of the control parameter vector output by each part of the output layer can be ensured to be the same as the dimension of the initial characteristic vector output by the corresponding network layer (identification network).

When the control parameter corresponding to the current layer of the recognition network generated by the control module adjusts the initial representation result of the unit voice output by the current layer to obtain the target representation result of the unit voice corresponding to the current layer, the control parameter not only can make the target representation result approach the actual representation result of the unit voice, but also can suppress peripheral noise of the unit voice, such as suppressing the voice of peripheral speakers except the speaker to which the unit voice belongs, and suppressing environmental noise.

Further, for convenience of calculation, normalization operation may be performed on the control parameter vector corresponding to the current layer to control a value range of the control parameter vector between 0 and 1, specifically, normalization operation may be performed on the control parameter vector by using a sigmoid function, and a specific calculation formula is as follows:

wherein g represents a normalized control parameter vector; x represents the control parameter vector before normalization.

Furthermore, the normalized control parameter vector g can be used to adjust the initial expression result h of the unit voice output by the current layer, so as to obtain the target expression result of the unit voice corresponding to the current layer

In a specific adjustment process, when the initial feature vector output by the current layer is used as an initial representation result h of the unit speech output by the current layer, the control parameter vector g may be multiplied by a corresponding position element of the initial feature vector h, and a specific adjustment formula is as follows:

wherein the content of the first and second substances,

target representation result representing unit voice corresponding to current layer

The j-th dimension element of (1); g_jA j-th dimension element representing the normalized control parameter vector g; h is_jAnd j-th dimension element of initial representation result h of the unit voice output by the current layer.

S3012: and taking the target representation result of the unit voice as the input of the next layer of the current layer to obtain the initial representation result output by the next layer.

In the present embodiment, the target expression result corresponding to the unit voice of the current layer is acquired in step S3011

The target may then represent the result

As an input to a next layer of the current layer, the result is represented to the target using a network parameter of the next layer (e.g., the transformation function f described above)

And transforming to obtain an initial representation result h output by the next layer.

S302: and for the initial representation result output by each network layer, obtaining the representation information matched with the initial representation result from the memory.

As described in step S301 above, in order to obtain the representation information matching the initial representation result from the memory, specifically, the target speaker representation result may be generated according to the correlation between the initial representation result and each sample speaker representation result in the memory, and/or the target speech environment representation result may be generated according to the correlation between the initial representation result and each sample speech environment representation result in the memory. That is, the generated target speaker representation result and/or target speaking environment representation result is used as the representation information matched with the initial representation result obtained from the memory.

It can be seen that, for the initial representation result of the unit voice output by each network layer, the representation information matched with the initial representation result can be obtained from the memory, so that each unit voice of the target voice corresponds to a set of matched representation information, and the embodiment can perform voice recognition on the target voice based on the representation information.

Specifically, after the target representation result belonging to the unit voice corresponding to each network layer of the recognition network is obtained in step S3011, step S103 "recognizing the target voice according to the representation information" may be further implemented, referring to fig. 6, where the specific flow includes the following steps:

s601: target representation results corresponding to the respective unit voices of the last layer in the recognition network are acquired.

In this embodiment, after the target speech is split in step S201 to obtain each unit speech, the acoustic features of each unit speech may be sequentially input into the speech recognition model shown in fig. 4, and the target representation result of each unit speech corresponding to the last network layer in the model recognition network may be obtained through the model

S602: and identifying the target voice according to the acquired target representation result of each voice unit.

In this embodiment, after the target representation result corresponding to each unit voice in the last layer in the recognition network is obtained in step S601, the target representation result may be input to an output layer of the voice recognition model, and the output layer may be normalized by a normalization method (such as a softmax normalization function) to obtain an acoustic posterior probability value corresponding to each unit voice. The a posteriori probability values can be used to perform a decoding network search through a decoding algorithm (such as a Viterbi algorithm) to obtain a recognition result of the target speech.

Next, the present embodiment will specifically describe the training process of the speech recognition model:

in order to train a speech recognition model, a large amount of speech data of a plurality of different users needs to be collected as training data; then, extracting acoustic features of the voice data; then, by using the training data and the acoustic characteristics thereof, the cross entropy function can be used as an optimization target of the model, and model parameters are continuously updated through an error back propagation algorithm, wherein the model parameters refer to a weight conversion matrix and corresponding bias connected among each layer of the network in the identification network, the control module and the memory coding module of the model. In the updating process, the model parameters can be updated in a multi-iteration mode, and when the preset convergence target is reached (namely the cross entropy function reaches the preset value), the iteration is stopped, the updating of the model parameters is completed, and the trained voice recognition model is obtained.

In summary, in the embodiment, the pre-constructed speech recognition model is used, the representation information matched with each unit of speech in the target speech can be obtained from the memory according to the acoustic feature of each unit of speech in the target speech, and then the representation information matched with the speaker and/or the speaking environment of the target speech can be obtained, so that the obtained representation information can be used to enrich the recognition basis of the target speech, and further, the speech recognition effect and efficiency can be improved when performing online personalized speech recognition on the target speech.

Third embodiment

In this embodiment, a speech recognition apparatus will be described, and for related contents, please refer to the above method embodiment.

Referring to fig. 7, a schematic diagram of a speech recognition apparatus provided in this embodiment is shown, where the apparatus 700 includes:

a target voice acquiring unit 701 configured to acquire a target voice to be recognized;

a presentation information acquiring unit 702 configured to acquire presentation information matched with the target speech from a memory constructed in advance, in which a large number of sample speaker presentation results and/or sample speech environment presentation results are stored;

a target speech recognition unit 703, configured to recognize the target speech according to the representation information.

In an implementation manner of this embodiment, the representation information obtaining unit 702 includes:

In an implementation manner of this embodiment, the presentation information obtaining subunit includes:

In an implementation manner of this embodiment, the first initial result obtaining subunit includes:

In one implementation of this embodiment, the control parameter is further used to suppress peripheral noise of the unit voice.

In one implementation manner of this embodiment, the control parameter is generated according to the representation information that is obtained from the memory and matches with the initial representation result output by the current layer.

In an implementation manner of this embodiment, the first indicating information acquiring subunit is specifically configured to:

In an implementation manner of this embodiment, the target speech recognition unit 703 includes:

Further, an embodiment of the present application further provides a speech recognition device, including: a processor, a memory, a system bus;

the processor and the memory are connected through the system bus;

the memory is used for storing one or more programs, and the one or more programs comprise instructions which, when executed by the processor, cause the processor to execute any one of the implementation methods of the voice recognition method.

Further, an embodiment of the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are run on a terminal device, the instructions cause the terminal device to execute any implementation method of the foregoing speech recognition method.

Further, an embodiment of the present application further provides a computer program product, which when running on a terminal device, causes the terminal device to execute any implementation method of the above-mentioned speech recognition method.

As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that all or part of the steps in the above embodiment methods can be implemented by software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network communication device such as a media gateway, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.

It should be noted that, in the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A speech recognition method, comprising:

acquiring target voice to be recognized;

according to the representation information, the target voice is recognized;

the obtaining of the representation information matched with the target voice from the pre-constructed memory comprises:

splitting the target voice to obtain each unit voice;

taking the acoustic characteristics of the unit voice as the input of a voice recognition model, and adjusting the initial representation results of each network layer of the recognition network of the voice recognition model by using control parameters so that each network layer of the recognition network of the voice recognition model sequentially outputs the initial representation results of the unit voice;

2. The method according to claim 1, wherein the adjusting the initial representation result of each network layer of the recognition network of the speech recognition model by using the control parameter to make each network layer of the recognition network of the speech recognition model output the initial representation result of the unit speech in turn comprises:

3. The method according to claim 1, wherein the control parameter is further used to suppress ambient noise of the unit speech.

4. The method of claim 2, wherein the control parameter is generated according to the representation information obtained from the memory and matched with the initial representation result output by the current layer.

5. The method according to any one of claims 1 to 4, wherein said retrieving the representation information matching the initial representation result from the memory comprises:

6. The method according to any one of claims 2 to 4, wherein the recognizing the target speech according to the representation information comprises:

7. A speech recognition apparatus, comprising:

the target voice recognition unit is used for recognizing the target voice according to the representation information;

the presentation information acquisition unit includes:

a first initial result obtaining subunit, configured to use acoustic features of the unit voice as input of a voice recognition model, adjust initial representation results of each network layer of a recognition network of the voice recognition model by using a control parameter, so that each network layer of the recognition network of the voice recognition model sequentially outputs the initial representation results of the unit voice;

8. The apparatus of claim 7, wherein the first initial result obtaining subunit comprises:

9. The apparatus of claim 8, wherein the control parameter is generated according to the representation information obtained from the memory and matched with the initial representation result output by the current layer.

10. The apparatus according to any one of claims 7 to 9, wherein the first presentation information acquiring subunit is specifically configured to:

11. A speech recognition device, comprising: a processor, a memory, a system bus;

the processor and the memory are connected through the system bus;

the memory is to store one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of any of claims 1-6.

12. A computer-readable storage medium having stored therein instructions that, when executed on a terminal device, cause the terminal device to perform the method of any one of claims 1-6.