WO2024045475A1 - 语音识别方法、装置、设备和介质 - Google Patents

语音识别方法、装置、设备和介质 Download PDF

Info

Publication number
WO2024045475A1
WO2024045475A1 PCT/CN2023/072417 CN2023072417W WO2024045475A1 WO 2024045475 A1 WO2024045475 A1 WO 2024045475A1 CN 2023072417 W CN2023072417 W CN 2023072417W WO 2024045475 A1 WO2024045475 A1 WO 2024045475A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
language
probability
candidate
model
Prior art date
Application number
PCT/CN2023/072417
Other languages
English (en)
French (fr)
Inventor
邵俊尧
蒋正翔
钱胜
付晓寅
王海峰
贾磊
Original Assignee
北京百度网讯科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京百度网讯科技有限公司 filed Critical 北京百度网讯科技有限公司
Priority to KR1020247014438A priority Critical patent/KR20240067971A/ko
Publication of WO2024045475A1 publication Critical patent/WO2024045475A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting

Definitions

  • the present disclosure relates to the field of artificial intelligence, specifically to speech recognition, natural language processing, deep learning and other technical fields, and in particular to a speech recognition method, device, equipment and medium.
  • an acoustic model built based on deep learning technology can be used to recognize speech to convert the collected speech into text.
  • the present disclosure aims to provide a speech recognition method, device, equipment and medium.
  • a speech recognition method including: using an acoustic model to process the speech data to be recognized and the recognized first text segment to obtain respective acoustic probabilities of multiple candidate text segments; using language The first language sub-model in the model processes the first text fragment and obtains the initial language probabilities of multiple alternative text fragments; the constraint sub-model in the language model is used to process the first text fragment and obtains multiple alternatives
  • Each text segment has an extensible relationship with respect to the first text segment; and according to the extensible relationship, the initial language probability of the candidate text segment is adjusted to obtain the first language probability of each of the multiple candidate text segments; and according to the first language Probability and acoustic probability, determine the target text segment among multiple candidate text segments to obtain a text sequence for the speech data to be recognized.
  • the constrained sub-model is trained based on the text in the predetermined text set.
  • a speech recognition device including: an acoustic probability acquisition module, configured to use an acoustic model to process the speech data to be recognized and the recognized first text segment to obtain a plurality of alternative texts.
  • the acoustic probability of each segment is used to process the first text segment using the first language sub-model in the language model to obtain the initial language probabilities of multiple alternative text segments;
  • the extended relationship acquisition module is used
  • the constraint sub-model in the language model is used to process the first text fragment to obtain the scalable relationships of multiple candidate text fragments with respect to the first text fragment;
  • the probability adjustment module is used to, according to the scalable relationship, Adjust the initial language probability of the candidate text fragments to obtain the first language probabilities of each of the multiple candidate text fragments; and a text determination module for determining the first language probability of the multiple candidate text fragments based on the first language probability and the acoustic probability.
  • the target text fragment is obtained to obtain a text sequence for the speech data to be recognized,
  • an electronic device including: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions that can be executed by the at least one processor, and the instructions are At least one processor executes, so that at least one processor can execute the speech recognition method provided by the present disclosure.
  • a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the speech recognition method provided by the present disclosure.
  • a computer program product including a computer program/instruction that, when executed by a processor, implements the speech recognition method provided by the present disclosure.
  • Figure 1 is a schematic diagram of an application scenario of a speech recognition method and device according to an embodiment of the present disclosure
  • Figure 2 is a schematic flowchart of a speech recognition method according to an embodiment of the present disclosure
  • Figure 3 is a schematic diagram of the principle of obtaining initial language probabilities of multiple candidate text segments according to the first embodiment of the present disclosure
  • Figure 4 is a schematic diagram of the principle of obtaining the first language probabilities of multiple alternative text segments according to the second embodiment of the present disclosure
  • Figure 5 is a schematic structural diagram of a language model according to an embodiment of the present disclosure.
  • Figure 6 is a schematic diagram of the principle of determining a target text segment according to the first embodiment of the present disclosure
  • Figure 7 is a schematic diagram of the principle of determining a target text segment according to the second embodiment of the present disclosure.
  • Figure 8 is a schematic diagram of the principle of determining a target text segment according to the third embodiment of the present disclosure.
  • Figure 9 is a schematic diagram of the generation principle of negative samples for training constrained sub-models according to an embodiment of the present disclosure.
  • Figure 10 is a structural block diagram of a speech recognition device according to an embodiment of the present disclosure.
  • FIG. 11 is a block diagram of an electronic device used to implement the speech recognition method according to an embodiment of the present disclosure.
  • speech recognition acoustic modeling technology can be used to complete speech recognition tasks. For example, the accuracy of speech recognition can be improved by building an end-to-end attention model.
  • the accuracy of speech recognition can be improved by building an end-to-end attention model.
  • relying only on the acoustic model obtained by modeling to perform recognition tasks is difficult to meet the high accuracy requirements of speech recognition for specific businesses. This is because the training data of acoustic models is usually limited and cannot cover a variety of business areas.
  • the business needs of various business fields usually change with current hot topics.
  • iterative updates of the acoustic model are usually required.
  • due to the high iteration cost and long iteration cycle of acoustic models they are usually unable to keep up with the changing speed of accuracy requirements.
  • a combination of language model and acoustic model can be used to complete the speech recognition task.
  • the advantages of the language model's massive training data and fast update and iteration speed can be used to make up for the shortcomings of the acoustic model and meet the business's demand for high accuracy in speech recognition.
  • the language model can be, for example, a neural network language model (Neural Network Language Model, NNLM).
  • NNLM is essentially a sequence model.
  • the input is a text sequence including the text fragments predicted in the previous cycle, and the output is the probability distribution for multiple predetermined text fragments obtained in the current cycle.
  • the predetermined text segment with the largest probability value can be used as the text segment predicted by the current cycle according to the probability distribution.
  • the acoustic model may be an attention-based acoustic model.
  • Each text segment can be a text of any granularity such as a word, a word, a syllable, or a phrase.
  • decoding algorithms that rely on language models and attention-based acoustic models can fuse the probability distribution of a single acoustic model output with the probability distribution of a single NNLM output, using a beam search (Beam Search) approach to The fusion result obtains the candidate path selected by a single decoding process. For example, assuming that there are N predetermined text fragments and the beam used in beam search is 3, the first decoding can filter out the 3 fragments with the highest probability values from the N predetermined text fragments as candidate text fragments.
  • Beam Search Beam Search
  • Each decoding can select the 3 paths with the highest total probability from 3*N paths as candidate paths, until the candidate paths are selected all include the end-of-text identifier ⁇ EOS>, or until the length of the text fragments in the filtered candidate paths reaches the length threshold.
  • the path may be represented by a fragment sequence obtained from the first decoding to the current decoding, in which the fragments are arranged in the order of generation.
  • the total probability value of the path may be the product of the probability values of each segment in the segment sequence, or the sum of the logarithms of the probability values of each segment in the segment sequence.
  • the method of combining the language model and the acoustic model can improve the recognition accuracy to a certain extent, in this method, the expansion of the decoding path is guided based on the probability distribution output by the language model.
  • this method For closed set recognition tasks, there is no guarantee that the text finally recognized is a text in the text set set for the closed set recognition task, thus affecting downstream tasks (such as searching based on the recognized text, voice response, etc.) ) implementation. That is, this method still has the problem of low recognition accuracy and poor recognition task completion effect.
  • the present disclosure provides a speech recognition method and device that improves speech recognition accuracy and makes the recognition results consistent with the recognition task.
  • the application scenarios of the method and device provided by the present disclosure will be described below with reference to FIG. 1 .
  • Figure 1 is a schematic diagram of an application scenario of a speech recognition method and device according to an embodiment of the present disclosure.
  • the application scenario 100 of this embodiment may include an electronic device 110.
  • the electronic device 110 may be various electronic devices with processing functions, including but not limited to smart phones, tablet computers, laptop computers, Desktop computers, smart watches or smart speakers, to name a few.
  • the electronic device 110 may, for example, process the obtained voice data 120, for example, may perform voice recognition on the voice data 120 to convert the voice data 120 into text 130.
  • the voice data 120 may be data obtained by processing the collected voice.
  • the collected voice can be the user's voice collected by an audio collector such as a microphone.
  • the electronic device 110 may be provided with an audio collector, and the electronic device 110 may be installed with client applications with voice recognition functions such as input methods, browsers, smart speaker APPs, and car APPs (examples only). ), the electronic device 110 can convert voice data into input characters through voice recognition for information query, smart speaker remote control, vehicle remote control, etc.
  • voice recognition functions such as input methods, browsers, smart speaker APPs, and car APPs (examples only).
  • the electronic device 110 can convert voice data into input characters through voice recognition for information query, smart speaker remote control, vehicle remote control, etc.
  • the electronic device 110 may use the end-to-end model 140 to complete the speech recognition task.
  • the end-to-end model 140 may include, for example, the language model and acoustic model described above, and the end-to-end model 140 may use beam search to obtain the text 130 .
  • the end-to-end model 140 may be the end-to-end streaming attention model described above.
  • the electronic device 110 may also use the speech recognition method described below to complete Speech recognition tasks, this disclosure does not limit this.
  • the application scenario 100 may also include a server 150 .
  • the server 150 may be, for example, a background management server that supports the execution of client applications in the electronic device 110 .
  • the electronic device 110 may be communicatively connected to the server 150 through a network, which may include wired or wireless communication links.
  • the server 150 can train a language model based on massive text samples and train an acoustic model based on speech-text pairs.
  • the server 150 can combine the trained language model and acoustic model into an end-to-end model 140, and fine-tune the end-to-end model 140 based on specific scenarios.
  • the server 150 may respond to the acquisition request sent by the electronic device 110 and send the fine-tuned end-to-end model 140 to the electronic device 110 so that the electronic device 110 can use the end-to-end model 140 to complete the speech recognition task.
  • the electronic device 110 can send the obtained voice data 120 to the server 150 , and the server 150 performs speech recognition on the voice data 120 according to the end-to-end model 140 to obtain the text 130 .
  • the speech recognition method provided by the present disclosure can be executed by the electronic device 110 or by the server 150 .
  • the speech recognition device provided by the present disclosure can be provided in the electronic device 110 or in the server 150 .
  • Figure 2 is a schematic flowchart of a speech recognition method according to an embodiment of the present disclosure.
  • the speech recognition method 200 of this embodiment may include operations S210 to S250.
  • an acoustic model is used to process the speech data to be recognized and the recognized first text segment to obtain respective acoustic probabilities of multiple candidate text segments.
  • the acoustic model may be a model composed of a Gaussian Mixed Model (GMM) and a Hidden Markov Model (HMM), or may be a model composed of a deep neural network (Deep Neural Networks). , DNN) and HMM model.
  • GMM Gaussian Mixed Model
  • HMM Hidden Markov Model
  • DNN Deep neural Networks
  • the acoustic model may include an encoder and a decoder, for example.
  • the input of the encoder is the speech data to be recognized, and the output is the extracted acoustic features.
  • the input to the decoder includes the acoustic features and the embedded features of the recognized first text segment.
  • the output of the acoustic model is a probability distribution of multiple candidate text segments, and the probability distribution includes respective acoustic probabilities of the multiple candidate text segments.
  • the recognized first text segment in the initial stage of speech recognition, may be the text start symbol ⁇ SOS>.
  • the recognized first text segment in the subsequent stage, may be the text start symbol ⁇ SOS> and A sequence of text fragments composed of the recognized text fragments.
  • the multiple candidate text segments may be, for example, multiple characters in the character library.
  • the characters included in the font library can be set according to actual needs, and this disclosure does not limit this.
  • the first language sub-model in the language model is used to process the first text fragment to obtain initial language probabilities of each of the plurality of candidate text fragments.
  • the constraint sub-model in the language model is used to process the first text fragment, and each of the multiple candidate text fragments has an expandable relationship with respect to the first text fragment.
  • the initial language probabilities of the candidate text segments are adjusted according to the extensible relationship to obtain the first language probabilities of each of the multiple candidate text segments.
  • the language model may adopt the NNLM described above or the N-gram model.
  • the first text segment can be input into the language model, and the language model outputs a probability distribution of multiple candidate text segments, where the probability distribution includes the first language probabilities of each of the multiple candidate text segments.
  • the language model may include, for example, a first language sub-model and a constraint sub-model.
  • the first language sub-model and the constraint sub-model can be set up side by side, and the first language sub-model can use the NNLM described above.
  • the structure of the constrained submodel is similar to that of NNLM.
  • the inputs of the first language sub-model and the constraint sub-model can both be embedded features of the first text fragment.
  • the network structures of the two sub-models can be similar.
  • the main difference lies in that the first language sub-model can process the first text fragment.
  • the probability distribution is obtained, and the second language sub-model processes the first text fragment to obtain a vector representing the scalable relationship.
  • the probability distribution obtained by the first language sub-model includes the language probabilities of multiple candidate text segments, and the language probabilities can be used as the initial language probabilities.
  • the vector characterizing the expandable relationship includes a plurality of elements, each element representing the expandable relationship of an alternative text fragment with respect to the first text fragment.
  • having an extensible relationship means that the alternative text fragment can be a fragment after the first text fragment.
  • the value of each element in the plurality of elements is 0 or 1, 0 indicates that there is no extensible relationship, and 1 indicates that there is an extensible relationship.
  • the initial probability of the candidate text segment may be adjusted based on the expandable relationship. For example, you can The first language probability of each candidate text segment is obtained by multiplying the element value representing the expandable relationship of each candidate text segment to the first text segment and the initial language probability of each candidate text segment. . Alternatively, you can take the logarithm of the value of the element representing the expandable relationship of each candidate text segment with respect to the first text segment, and take the logarithm of the initial language probability of each candidate text segment, and the resulting two The logarithms are added together as the first language probability for each candidate text segment.
  • the constrained sub-model may be trained based on text in a predetermined text set.
  • the predetermined text set may be a text set set for a closed set recognition task, and the closed set recognition task may be set according to actual needs.
  • a target text segment among the plurality of candidate text segments is determined according to the first language probability and the acoustic probability to obtain a text sequence for the speech data to be recognized.
  • the first language probability and the acoustic probability may be added or multiplied, and the value obtained by the addition or multiplication is used as the probability value of each candidate text segment.
  • this embodiment may select the text segment with the largest probability value as the target text segment.
  • the target text segment may be added to the recognized first text segment, and operations S210 to S250 may be continued until the selected text segment with the highest probability value is the text end identifier ⁇ EOS> , or the sum of the text fragment with the largest probability value and the text fragments in the first text fragment reaches a predetermined number.
  • a beam search method can also be used to determine that the segment at the last position in a predetermined number (for example, M) of paths with a larger total probability value is the target text segment. Subsequently, each target text segment is added to the first text segment to obtain M adjusted text segments. Subsequently, each adjusted text segment is regarded as a first text segment, and operations S210 to S240 are performed back to obtain a total of M*N candidate paths. Subsequently, M paths with the highest total probability value are selected from the M*N candidate paths. By analogy, until all the filtered candidate paths include the end-of-text identifier ⁇ EOS>, or until the length of the text fragments in the filtered candidate paths reaches the length threshold. Finally, the text segments on the candidate path with the highest total probability constitute the text sequence of the speech data to be recognized.
  • M predetermined number
  • the embodiments of the present disclosure set up a constraint sub-model for predicting the scalable relationship between the alternative text segment and the first text segment in the language model, and adjust the predicted initial language probability according to the scalable relationship, which can be combined with the scalable relationship. Expansion relations and initial language probabilities are used to guide the expansion of decoding paths. So, in the constraint The model is a neural network model, and by learning the scalable relationship between each of the multiple candidate text fragments based on the text collection set in the closed set recognition task, under the guidance of the scalable relationship, it can be made The recognized text is a certain text in the text set set for the closed set recognition task, and therefore can improve the recognition accuracy, improve the completion effect of the recognition task, and facilitate the implementation of downstream tasks.
  • FIG. 3 is a schematic diagram of the principle of obtaining the first language probabilities of multiple candidate text segments according to the first embodiment of the present disclosure.
  • a vertical category identifier can also be added to the input of the language model.
  • the language model can be used to guide different paths for texts of different vertical categories. This allows the language model of the present disclosure to be used to predict text of a variety of different vertical categories, which is beneficial to improving the robustness of the speech recognition method of the present disclosure.
  • the first text segment 301 may be processed first to obtain the text embedding feature 302 of the first text segment 301 .
  • the word2vec method or the Global Vectors for Word Representation (GloVe) method can be used to process the first text fragment 301.
  • this embodiment can also determine the vertical category 303 to which the first text fragment belongs, and the first identification feature 304 of the vertical category 303 to which it belongs. It can be understood that the vertical category 303 to which the first text segment belongs can be determined, for example, in response to a user operation, or, in the initial stage of speech recognition, multiple predetermined vertical categories can be used as the vertical category 303 to which the first text segment belongs, For each predetermined vertical category, a probability distribution is obtained. As the path is expanded, the predetermined vertical category corresponding to the selected path can be used as the vertical category to which the recognized first text fragment belongs.
  • This embodiment may assign an identifier to each predetermined vertical category in a plurality of predetermined vertical categories. This embodiment may encode the identifier of the vertical category to obtain the first identification feature of the vertical category.
  • this embodiment may first fuse the text embedding feature 302 and the first identification feature 304.
  • the fused features are then input into the first language sub-model 320, and after being processed by the first language sub-model 320, the language probability distribution 305 can be obtained.
  • the language probability distribution 305 includes initial language probabilities for a plurality of predetermined text segments.
  • both can be achieved by splicing the text embedding feature 302 and the first identification feature 304. fusion.
  • the text embedding feature 302 and the first identification feature 304 can be set to have the same dimension.
  • the adder 310 can be used to add the text embedding feature 302 and the first identification feature 304 to achieve the fusion of the two. It can be understood that the above fusion method is only used as an example to facilitate understanding of the present disclosure, and the present disclosure does not limit this.
  • the first language sub-model 320 may adopt the NNLM model.
  • the first language sub-model 320 may include an input layer, a hidden layer and an output layer connected in sequence, where the input layer may be used to convert text into embedded features.
  • the input layer may include the above description.
  • the first text fragment is processed to obtain the function of text embedding features, the function of obtaining the first identification feature according to the vertical category, and the function of fusing the text embedding feature and the first identification feature.
  • the hidden layer can be a fully connected layer, or a network structure composed of a sequence network and a fully connected layer, to facilitate learning the contextual information between multiple data in the input sequence.
  • the sequence network may include a network based on an attention mechanism (such as Transformer) or a long short-term memory network LSTM, etc., and this disclosure is not limited to this.
  • the output layer can include a logistic regression network such as softmax.
  • FIG. 4 is a schematic diagram of the principle of obtaining the first language probabilities of multiple candidate text segments according to the second embodiment of the present disclosure.
  • a general language model branch can be set, and the language model branch can be obtained by using text training of multiple vertical categories.
  • this embodiment can combine the two and share the parameters of the general language model branch to the vertical class.
  • Vertical class language model while adding some additional parameters to the vertical class language model to perform separate reinforcement learning for vertical classes. That is, two branches are set up in the language model, one is the general language model branch and the other is the language model branch for vertical categories. In this way, on the basis of optimizing the recognition rate of the language model for multiple vertical categories, the size of the model can be ensured to be small, thereby reducing the computing power requirements when running the model, which is beneficial to improving the robustness of the method in this embodiment.
  • the language model may include a first language sub-model 410 , a second language sub-model 420 and a constraint sub-model 430 arranged in parallel with the first language sub-model 410 .
  • the first language sub-model 410 and the constraint sub-model 430 constitute a language model branch for vertical categories.
  • this embodiment can input the text embedding feature 401 into the second language sub-model 420 to obtain the first implicit representation output by the hidden layer of the second language sub-model 420.
  • This embodiment can also fuse the text embedding feature 401 with the first identification feature 402 of the corresponding vertical category and then input to the first language sub-model 410, and fuse the second implicit representation output by the hidden layer of the first language sub-model 410 with the above-mentioned first implicit representation.
  • the fused features are then input into the output layer of the first language sub-model 410, and the output layer outputs a language probability distribution, thereby obtaining the initial language probabilities of each of the multiple candidate text segments.
  • This embodiment can also fuse the text embedding feature 401 with the first identification feature 402 and then input it into the constraint sub-model 430, and the constraint sub-model 430 outputs a vector representing the extensible relationship.
  • the vector and the initial language probability are input to the fusion layer 440, and the fusion layer 440 adjusts the initial language probability according to the vector representing the extensible relationship, thereby outputting the first language probability 403 of each of the multiple candidate text segments.
  • the hidden layer in the first language sub-model can be used as the first feature extraction network, and the output layer can be used as the first prediction network.
  • the input of the first prediction network includes the fused features of the second implicit representation and the first implicit representation (for example, using an adder to fuse), the output of the first prediction network is a probability distribution, and the vector representing the scalable relationship can be a pair Adjusts the logarithm of the probability values in this probability distribution.
  • the multiplicative relationship between the values can be converted into the additive relationship between the logarithms of the values, thus ensuring calculation accuracy. This is because electronic devices usually have lower calculation accuracy for multiplication of floating point numbers, but higher accuracy for addition.
  • the first language sub-model may include an LSTM layer, an adder, a fully connected layer and a logistic regression layer (softmax).
  • the adder can be set between the fully connected layer and the logistic regression layer.
  • the LSTM and the fully connected layer constitute the first feature extraction network
  • the softmax layer constitutes the first prediction network.
  • the adder is not only provided between the fully connected layer and the logistic regression layer, but also between the LSTM layer and the fully connected layer.
  • the LSTM layer, the adder and the fully connected layer arranged between the LSTM layer and the fully connected layer can constitute the first feature extraction network 411, and the adder and the logistic regression layer arranged between the fully connected layer and the logistic regression layer.
  • a first prediction network 412 may be formed.
  • the adder between the LSTM layer and the fully connected layer is used to fuse the first implicit representation and the features output by the LSTM
  • the adder between the fully connected layer and the logistic regression layer is used to fuse the first implicit representation and the second implicit representation. representation.
  • the full integration of the first implicit representation and the features in the first language sub-model can be achieved, the sharing of network parameters in the first language sub-model and the network parameters in the second language sub-model can be enhanced, and the obtained first language probability can be improved. Accuracy, improve speech recognition accuracy.
  • the second language sub-model 420 may include an LSTM layer, a fully connected layer and a softmax layer.
  • the LSTM layer and the fully connected layer constitute the second feature extraction network 421 of the second language sub-model
  • the softmax layer constitutes the second prediction network 422 of the second language sub-model.
  • This embodiment embeds the text of the first text fragment into a special After the feature 401 is input into the second feature extraction network 421 to obtain the second implicit representation, the second implicit representation can be input into the second prediction network 422, and the second prediction network 422 outputs another probability distribution, thereby obtaining multiple preparations. Select the second language probability 404 of each text segment.
  • this embodiment may determine the target text segment based on the first language probability 403, the second language probability 404, and the acoustic probability. Specifically, the first language probability 403 and the second language probability 404 may be added to the acoustic probability respectively. If the number of predetermined text segments is set to N, a total of 2*N added probability values will be obtained. Subsequently, M larger probability values are selected from the 2*N added probability values to obtain the candidate paths obtained by the current decoding. In this way, the method of the embodiment of the present disclosure can be applied not only in multiple vertical scenarios, but also in general speech recognition scenarios, thereby improving the robustness of the method in this embodiment.
  • Figure 5 is a schematic structural diagram of a language model according to an embodiment of the present disclosure.
  • a third language sub-model parallel to the first language sub-model can be set in the language model for learning the relationship between speech data and text from different sources.
  • the language probability obtained by the third language sub-model and the language probability obtained by the language model branch for the vertical category can be screened as parallel options.
  • the language model of this embodiment can be applied to different vertical categories in different scenarios, and there is no need to conduct separate training for different vertical categories and different scenarios, which can improve the robustness of the model and reduce the training cost of the model.
  • the language model may include a first language sub-model 510, a second language sub-model 520, a constraint sub-model 530 and a third language sub-model 540.
  • the first language sub-model 510, the second language sub-model 520 and the constraint sub-model 530 are similar to the corresponding models in Figure 4 described above, and will not be described again here.
  • the third language sub-model 540 is similar to the first language sub-model 510. The difference is that the input of the third language sub-model is the second identification feature 503 characterizing the source of the speech data to be recognized and the text embedding. Feature 501 Features after fusion.
  • the second identification feature 503 that characterizes the source of the speech data to be recognized can also be determined.
  • training data can be provided when the user determines that speech recognition is poor.
  • the method of this embodiment can assign an identifier to the user and train the third language sub-model based on the training data provided by the user.
  • the user can be determined according to the source of the speech to be recognized, and the second identification feature is obtained by encoding the identifier assigned to the determined user. It is understandable that users can use various client applications with speech recognition functions.
  • the second identification feature can also be passed to the customer It is obtained by encoding the name of the terminal application, etc., and this disclosure does not limit this.
  • the embodiment 500 can use the third language sub-model 540 to process the fused features of the text embedding feature 501 and the second identification feature 503. Based on a principle similar to the principle in which the first language sub-model obtains the initial language probability, the third language sub-model 540 can output a probability distribution. By taking the logarithm of the probability value in the probability distribution, the third language probabilities 506 of each of the multiple candidate text segments can be obtained.
  • the third language sub-model 540 may include a third feature extraction network and a third prediction network.
  • the fused features of the text embedding feature 501 and the second identification feature 503 can be input to the third feature extraction network 541 to obtain the third implicit representation.
  • the features obtained by fusing the first implicit representation and the third implicit representation are input to the third prediction network 542, and the third prediction network 542 outputs a probability distribution.
  • the third language probabilities 506 of each of the multiple candidate text segments can be obtained.
  • this embodiment can determine the target text segment based on the third language probability 506, the first language probability 504 and the acoustic probability. This principle is similar to the above-mentioned principle of determining the target text segment based on the first language probability, the second language probability and the acoustic probability, and will not be described again here.
  • the embodiment 500 can be based on the first language probability 504, the second language probability 505, the third language probability 506 and the acoustic probability, to determine the target text fragment.
  • This principle is similar to the above-mentioned principle of determining the target text segment based on the first language probability, the second language probability and the acoustic probability, and will not be described again here.
  • the language model is a sequence model.
  • the initial input of the first language sub-model in the language model includes P features, and the P features are obtained by converting the text start identifier ⁇ SOS> into It is obtained by adding the embedded features and the identification features of P predetermined vertical categories respectively.
  • the initial input of the second language sub-model is the embedded feature of the text start identifier ⁇ SOS>.
  • the initial input of the third language sub-model is the feature obtained by adding the embedded feature of the text start identifier ⁇ SOS> and the second identification feature characterizing the source of the speech to be recognized.
  • (P+2)*N probability values can be obtained, corresponding to (P+2)*N expansion paths.
  • the identified first text fragment includes M text fragments, which are obtained by combining the text start identifier ⁇ SOS> and the text fragments corresponding to the M paths with higher total probability values. .
  • the M text fragments are input into the second language sub-model respectively, and M*N expansion paths are obtained; the M text fragments are combined with M probabilities respectively.
  • the identification features of the vertical categories corresponding to the paths with higher values are fused and then input into the first language sub-model to obtain M*N expansion paths.
  • the M text fragments are respectively fused with the second identification features and then input into the third language sub-model to obtain M*N expansion paths, and a total of 3M*N expansion paths are obtained. Subsequently, M paths with higher total probability values are selected from the 3M*N expansion paths, and so on, after multiple decodings, until the M paths obtained by screening all include the end-of-text identifier ⁇ EOS>, or until The lengths of the text fragments in the filtered M paths all reach the length threshold. Finally, the text sequence corresponding to the path with the highest total probability value is used as the text sequence of the recognized speech data to be recognized. It can be understood that in the i-th decoding, the number of text fragments included in the filtered path is (i+1), and the text fragment includes the text start identifier ⁇ SOS>.
  • a predetermined text list may be set according to a text set set for the closed set recognition task.
  • the target text segment is selected from a plurality of candidate text segments according to the predetermined text list.
  • the text composed of the recognized text sequence can be made to belong to the text set set for the closed set recognition task, so that the method of this embodiment has the ability to forcibly recognize a certain text in the closed set.
  • the closed set recognition task is a smart speaker speech recognition task, through the method of this embodiment, it can be ensured that the song titles, singer names, etc. included in the recognized text sequence are existing song titles and singer names, which is conducive to based on the recognition results Play music that meets the actual needs of users.
  • the plurality of alternative text fragments may include, for example, multiple first alternative fragments indicating alternative words.
  • the alternative words may be set according to actual needs, and this disclosure does not limit this.
  • the predetermined text list may be first queried based on the first text segment, and the first specified segment among the plurality of first candidate segments may be determined based on the query result.
  • the predetermined text list can be queried to determine that the text including the first text fragment in the predetermined text list is used as the first text. For example, if the text set includes the text "Please play song a by singer A" and the first text segment is "Please play", then the text "Please play song a by singer A" can be determined to be the first text.
  • this embodiment may use the word "put" after the first text fragment in the first text as the first designated fragment. That is, the text formed by splicing the first specified fragment and the first text fragment belongs to the predetermined text list.
  • this embodiment may determine the target text segment among the plurality of first candidate segments based on the first language probability and acoustic probability of the first designated segment. For example, this embodiment may add the logarithm of the first linguistic probability to the logarithm of the acoustic probability of the first specified segment. Add the resulting value as the first Probability value of the specified fragment with respect to the first text fragment. When there is only one first text segment, this embodiment may use M first specified segments with larger probability values for the first text segment as target text segments.
  • this embodiment can use the first designated segment among the M texts with the highest probability value as the target text segment.
  • a recognition weight can be set for each text in the predetermined text list, and the recognition weight can be determined according to the difficulty of recognition.
  • the recognition weight can be positively related to the difficulty of recognition.
  • the recognition weight can also be used to filter alternative text fragments, which will help the speech recognition method identify texts with high recognition difficulty and strengthen the speech recognition method's ability to recognize difficult text. It can be understood that, for example, the identification weight can be set and modified according to actual needs, and this disclosure does not limit this.
  • FIG. 6 is a schematic diagram of the principle of determining a target text segment according to the first embodiment of the present disclosure.
  • this embodiment 600 when determining the target text fragment, this embodiment 600 can first query the predetermined text list 602 based on the first text fragment 601, and determine that the text of the first text fragment 601 is included in the predetermined text list 602 as the first text fragment. Text 603.
  • a text fragment that belongs to multiple first candidate fragments and is located after the first text fragment 601 in the first text 603 can be used as the first designated fragment 604.
  • This embodiment can then determine the text spliced by the first text fragment 601 and the first specified fragment 604 as the spliced text 605 , and use the part of the first text 603 that includes the spliced text as the first target text 606 . Finally, this embodiment may determine the target text segment based on the recognition weight of the first target text 606 , the first language probability of the first specified segment 604 , and the acoustic probability of the first specified segment 604 .
  • the recognition weight of the first target text 606, the first language probability of the first designated segment 604, and the logarithm of the acoustic probability of the first designated segment 604 may be added together as the first designated segment 604 for the first text
  • the probability value of the segment 601 is then used to filter out the target text segment from the determined first designated segment 604 based on the probability value.
  • the predetermined text may be represented in the form of a template, entity text fragments, etc. in the predetermined text may be represented by slots, and the corresponding slots may be listed in the predetermined text list.
  • the entity class can include entities, which facilitates refined path management for speech recognition and facilitates To improve speech recognition accuracy.
  • the entity class text fragments may include, for example, text fragments representing song names, singer names, point of interest names, etc. Different types of entities correspond to one slot. For example, the slot corresponding to the entity of the song title category is [song], and the singer The slot corresponding to the entity of the name category is [singer], the slot corresponding to the entity of the point of interest name category is [POI], etc.
  • this embodiment can use the decoding method of large pictures within small pictures to predict the text sequence.
  • the large picture corresponds to the text template
  • the small picture corresponds to the slot.
  • this embodiment can predict the entity represented by the slot in combination with the identification features of the slot.
  • bit prediction which is beneficial to improving the accuracy of the predicted target text fragments. This is because by considering the identification features of slots, the language model can learn the mapping relationship between different slots and predicted text fragments.
  • FIG. 7 is a schematic diagram of the principle of determining a target text segment according to the second embodiment of the present disclosure.
  • the plurality of alternative text fragments also include a plurality of second alternative fragments indicating alternative slots.
  • the alternative slots can be set according to actual needs.
  • the alternative slots can be set according to the categories of entities in the actual scene, and each category of entities corresponds to one alternative slot.
  • the slots corresponding to the entities of each category in the alternative slots can be understood as in-slot slots.
  • out-slot slots can also be set to indicate completion of the prediction of entities.
  • the entry slot 701 belongs to The target slot 703 of the predetermined text list 702.
  • the text in the predetermined text list is composed of words and slots, and the corresponding position of the slot is the position of the entity in the predetermined text.
  • This embodiment can compare the slots constituting the text in the predetermined text list with the entry slot 701 to obtain the target slot 703.
  • this embodiment can use the language model 710 to process the features obtained according to the third identification feature 704 of the target slot 703 and the starting identifier ⁇ SOS> 705 of the text, to obtain the first candidate segments of the plurality of first candidate segments.
  • the fourth language probability may represent the probability that each candidate word belongs to the segment in the target slot 703. This part is the process of jumping into the small picture for decoding.
  • the embedding feature of the starting identifier of the text is used to replace the text embedding feature of the first text fragment
  • the third identification feature 704 of the target slot 703 is used to replace the text embedding feature of the first text fragment.
  • a text fragment The first identifying characteristic of the category to which it belongs.
  • this embodiment may first determine the third identification feature 704 of the target slot 703 , and the third identification feature 704 may be obtained by encoding the identifier assigned to the target slot 703 .
  • the start identifier ⁇ SOS>705 can be encoded to obtain the start identifier encoding characteristics.
  • the third identification feature 704 is added to the starting identifier encoding feature to obtain a feature obtained according to the third identification feature 704 of the target slot 703 and the starting identifier ⁇ SOS> 705 of the text.
  • This feature can be As input to the first language sub-model and the constraint sub-model in the language model 710. Using a principle similar to the above-mentioned principle of obtaining the first language probability, the fourth language probability 706 of the first candidate fragment for the target slot is obtained.
  • this embodiment may determine the target text segment in the first candidate segment based on the fourth language probability 706, the first language probability and the acoustic probability. For example, the number of target slots is set to Q. For each target slot, this embodiment can indicate each target according to the fourth language probability and sum obtained based on the third identification feature of each target slot.
  • the first language probability of the second candidate fragment of the slot determines the probability of a plurality of first candidate fragments being the text fragment in each target slot.
  • the fourth language probability of each first candidate segment may be multiplied by the first language probability of the second candidate segment indicating each target slot, as the each first candidate segment as the each The probability of a text fragment in a target slot.
  • N' probabilities can be obtained, and for Q target slots, a total of Q*N' probabilities can be obtained.
  • the Q*N' probabilities and the first language probabilities of the N' first candidate segments can be combined into a probability set, and the probability set includes a total of (Q+1)*N' probabilities.
  • This embodiment 700 may, for example, add the logarithmic values of (Q+1)*N' probabilities to the logarithmic values of the acoustic probabilities corresponding to the first candidate segments to obtain (Q+1)*N' extended probabilities.
  • M paths can be selected from the (Q+1)*N' paths corresponding to the (Q+1)*N' expansion probability, and the M paths can be The text fragment corresponding to the last position in is used as the target text fragment.
  • the target text fragment may be determined for the out-slot slot in a similar manner as for the in-slot slot.
  • the difference is that for the exit slot, among the features input to the language model 710, what replaces the text embedding feature of the first text fragment is: the identification feature of the exit slot, specifically the text fragment at the last position in the first text fragment.
  • the first identification feature shall be the identification feature of the vertical category to which the first text fragment belongs.
  • the second fusion feature can be obtained by fusing the fourth identification feature and the first identification feature.
  • the second fusion feature can be used as the input of the language model, and the language model processes it to obtain the fifth language probabilities of the plurality of first candidate fragments for the outgoing slot.
  • this embodiment can be based on the fifth language probability, the first Linguistic probabilities and acoustic probabilities are used to determine a target text segment among a plurality of first alternative segments.
  • this embodiment can use the method described above to obtain a total of Q*N' probabilities for Q target slots.
  • This embodiment may also multiply the first language probability of the second text segment indicating the slot slot with the fifth language probability of each first alternative segment for the slot slot as each first alternative The probability that the fragment is the first text fragment after the slot is released.
  • For the N' first candidate fragments a total of N' probabilities can be obtained.
  • the obtained Q*N' probabilities and N' first candidate segments can be used as the N' probabilities of the first text segment after the slot is released and the N'th probability of the N' first candidate segments.
  • a language probability constitutes a probability set, and the probability set includes (Q+2)*N' probabilities in total.
  • this embodiment may add the logarithmic values of the (Q+2)*N' probabilities to the logarithmic values of the acoustic probabilities corresponding to the first candidate segments, respectively, to obtain (Q+2)*N' extended probabilities.
  • M paths can be selected from the (Q+2)*N' paths corresponding to the (Q+2)*N' expansion probability, and the M paths can be The text fragment corresponding to the last position in is used as the target text fragment.
  • a slot belonging to a predetermined text list among the entry slots may be used as an initial slot.
  • the first language probability of the second candidate segment indicating the initial slot is compared with the first language probabilities of the plurality of first candidate segments, and the second candidate segment indicated by the relatively larger probability value is The initial slot is used as the target slot.
  • this embodiment may first determine a predetermined number of probabilities with larger values among the first language probabilities of multiple first candidate segments; and then compare the first language probability of the second candidate segment indicating the initial slot with the predetermined Compare the minimum probability among a number of probabilities, if the first language probability of the second candidate fragment indicating an initial slot is higher than the minimum probability, or lower than the minimum probability and the absolute value of the difference from the minimum probability is less than or equal to the If a predetermined threshold is reached, the initial slot is determined to be the target slot.
  • this embodiment may compare the first language probability of the second candidate segment indicating the initial slot with the maximum probability among the first language probabilities of the plurality of first candidate segments, if the absolute value of the difference between the two is less than the second predetermined threshold, the initial slot is determined to be the target slot. It can be understood that the above method of determining the target slot based on differences is only used as an example to facilitate understanding of the present disclosure, and the present disclosure does not limit this.
  • the embodiments of the present disclosure can further filter the entry slots and eliminate slots with a low probability of being expanded, thereby in While ensuring prediction accuracy, it reduces the amount of calculation and improves the calculation efficiency of decoding target text fragments.
  • FIG. 8 is a schematic diagram of the principle of determining a target text segment according to the third embodiment of the present disclosure.
  • the target text fragment when jumping into a thumbnail for decoding, for example, may also be filtered in combination with the recognition weight assigned to the text in the predetermined text list.
  • the speech recognition method can be used to identify texts that are difficult to recognize, and the speech recognition method's ability to recognize difficult texts can be strengthened.
  • the predetermined text list For example, after obtaining the fourth language probability described above, or at any time, query the predetermined text list according to the first text fragment to obtain the second target text and the second specified fragment among the plurality of first candidate fragments.
  • the first text fragment can be spliced with the second candidate fragment indicating the corresponding slot of each first candidate fragment to obtain multiple spliced texts.
  • the predetermined text list is queried according to the spliced text, the predetermined text including any one of the multiple spliced texts is determined to be the second target text, and the first candidate fragment corresponding to the indication slot included in any text is used as the third target text.
  • this embodiment may use the second candidate fragment indicating the slot corresponding to the second specified fragment as the target candidate fragment.
  • this embodiment may determine the initial probability of the target candidate segment based on the recognition weight of the second target text and the first language probability of the target candidate segment.
  • the recognition weight of the second target text may be multiplied by the first language probability of the target candidate segment, and the product may be used as the initial probability.
  • the logarithmic value of the recognition weight of the second target text is added to the logarithmic value of the first language probability of the target candidate fragment to obtain the initial probability, which is not limited by this disclosure.
  • this embodiment can determine the probability that the second specified fragment is the first text fragment in the target slot based on the initial probability and the fourth language probability of the second specified fragment. For example, the initial probability can be Added to the logarithm of the fourth language probability of the second specified fragment, the probability of the second specified fragment being the first text fragment in the target slot is obtained. This probability can replace the corresponding probability among the Q*N’ probabilities described above.
  • this embodiment 800 it is set to use beam search mode for decoding to obtain a text sequence.
  • Set beam to M.
  • the number of first text fragments is M.
  • Set the number of candidate words to N', and the candidate slots include Q' entry slots and one exit slot.
  • this embodiment can use the acoustic model 810 to obtain N' acoustic probabilities 802.
  • the language model 820 it is possible to obtain N' language probabilities corresponding to N' candidate words, respectively corresponding to the entry probabilities of the Q' entry slots, and corresponding to the exit probabilities of the exit slots, totaling (N'+Q'+1) language probability 803.
  • this embodiment can query the predetermined text list 830 according to the text fragment 801, and obtain information 804.
  • the information 804 can include the first target text and its recognition weight w1 described above, and the second target text and Its identification weight w2.
  • This embodiment can filter the text segments corresponding to the predicted language probabilities according to the information 804 obtained from the query, thereby obtaining the expandable word 805, the target slot 806 and the outgoing slot 807 described above.
  • the expandable word may be the first designated segment described above. When the probability of exiting the slot 807 is much smaller than the probability of the target slot and the expandable word, the exiting slot can also be eliminated.
  • the expansion probability of the expandable word 805 can be determined by the logarithm of the acoustic probability of the expandable word, the logarithm of the linguistic probability of the expandable word, and the recognition weight w1 of the first target text corresponding to the expandable word. and to express.
  • the scalable initial probability of the target slot 806 can be represented by the sum of the logarithmic value of the entry probability of the target slot 806 and the recognition weight w2 of the second target text corresponding to the target slot.
  • the expanded initial probability of a slot being released is represented by the logarithm of the probability of being released.
  • the expandable word 805 can be used as a candidate text segment, the candidate text segment can be spliced with the text segment 801, and the spliced text can be added to the first candidate pool 808 for the text segment 801.
  • this embodiment can use a similar method described above to input the embedding feature of the text start identifier and the identification feature of the target slot into the language model 820, jump into the small picture and perform the decoding operation, thereby obtaining the above Description of fourth language probabilities.
  • this embodiment can use a similar method described above to input the identification feature of the vertical category to which the first text fragment belongs and the identification feature of the slot corresponding to the last text fragment in the first text fragment into the language model. 830, jump into the big picture to perform the decoding operation, thereby obtaining the fifth language probability described above.
  • this embodiment can query the predetermined text list, constrain the fourth language probability and the fifth language probability according to the text in the list, filter out text fragments belonging to the text in the predetermined text list, and combine the text fragment with the text fragment 801 After splicing, it is added to the first candidate pool 808.
  • M candidate pools can be obtained. This embodiment can select M candidate text fragments with the largest total probability value from the M candidate pools as the M first text fragments in the next cycle. Until all the selected M candidate text fragments include the end identifier ⁇ EOS> of the text, or the number of text fragments in the M candidate text fragments reaches a predetermined number.
  • this embodiment can store the language probability obtained by processing the first target feature by the language model in the cache when the number of times the language model is used to process the first target feature reaches a predetermined number of times. for subsequent calls.
  • the cache can be queried first to determine whether the cache stores a language probability for the second target feature. If so, then The language probability is directly read from the cache to complete the processing of the second target feature by the language model without using the language model to perform complex calculations.
  • the first target feature and the second target feature may include any one of the following features: the text embedding feature of the first text fragment; the feature after the fusion of the text embedding feature and the identification feature of the vertical category; the text embedding feature Features that are fused with the identification features of the data source; features that are fused with text embedding features and the identification features of the slot. That is, the first target feature and the second target feature can be any features of the hidden layer in the input language model described above, and this disclosure is not limited thereto.
  • a high-performance processor such as a graphics processor (GPU) may also be used to perform the operation of determining the target text segment, so that the calculation of the M first text segments or the process of determining the target text segment involves Any computation that can be performed in parallel can be executed in parallel by a GPU or the like to further improve decoding efficiency and improve speech recognition efficiency.
  • GPU graphics processor
  • a text fragment table may be maintained for an alternative slot, and text fragments belonging to the alternative slot are added to the text fragment table.
  • the slot text fragments belonging to the candidate slots in the text sequence can be compared with the text fragments in the text fragment table of the candidate slots.
  • the text fragment table for the alternative slot is queried according to the slot text fragment. If the slot text fragment does not belong to the text fragment table for the alternative slot, the slot text fragment can be compared with each text fragment in the text fragment table for the slot, and the text fragment table and the slot text can be compared.
  • the text fragment with the greatest fragment similarity is used as a candidate fragment.
  • the candidate segment is used to replace the slot text segment in the text sequence, and the replaced text segment is used as the recognition result for the speech data to be recognized.
  • Figure 9 is a schematic diagram of the generation principle of negative samples for training constrained sub-models according to an embodiment of the present disclosure.
  • samples for training the constrained sub-model may include positive samples and negative samples, for example.
  • the positive samples may include texts in the predetermined text set
  • the negative samples may be any text except the texts in the predetermined text set.
  • the predetermined text segment may be adjusted based on the second text segment among the plurality of candidate text segments that is inconsistent with the text segment at the target position in the predetermined text, and the adjusted text may be used as a negative sample.
  • the target position can be any position in the predetermined text. Negative samples are generated in this way. Since the difference between negative samples and positive samples is only the text fragment at the target position, the learning ability of the constrained sub-model can be improved.
  • this embodiment 900 can randomly extract a predetermined text from the predetermined text set 910 as a positive sample 911 .
  • This embodiment can also remove a predetermined number of text fragments ranked last in the extracted predetermined text, and use the obtained text as a positive sample.
  • the second text fragment 920 described above can be used to replace the text fragment at the target position in the predetermined text, thereby obtaining a negative sample 930 .
  • the target position can be, for example, the last position of the predetermined text.
  • the negative sample and the positive sample can have the same prefix tree.
  • the text in the last cycle that does not belong to the predetermined text set can be effectively detected. Text generation path for clipping.
  • the target position can be any position.
  • the text located after the target position in the predetermined text can be removed. fragments to obtain negative samples.
  • This embodiment obtains negative samples by removing text segments after the target position, so that all negative samples have the same prefix as positive samples.
  • the constrained sub-model can learn the scalable relationship between any two text fragments in the predetermined text, which is beneficial to improving the cutting accuracy and effectiveness of the decoding path.
  • the second text fragment when using the second text fragment to adjust the predetermined text, for example, it may first be determined based on the confusion relationship between the second text fragment and the text fragment at the target position in the predetermined text. of the fragment to be replaced. Subsequently, the text fragment at the target position in the predetermined text can be replaced with the fragment to be replaced. segment, using the replaced text as a negative sample.
  • the generated negative samples can be text that is easily confused with the predetermined text (that is, the positive sample), which is beneficial to improving the discrimination ability of the constrained sub-model.
  • the number of negative samples and the pertinence of the negative samples can be effectively reduced, which is beneficial to improving the training efficiency of the constrained sub-model.
  • the confusion relationship can be represented, for example, by text similarity, syllable similarity, etc. between text segments.
  • a second text segment when generating a negative sample, for example, a second text segment may be first used to replace the text segment at the target position in the predetermined text, and the resulting text segment may be used as a candidate sample.
  • the pre-trained first language sub-model described above can be used to process each candidate sample, and the language probability of each candidate sample generated by the first language sub-model can be obtained.
  • the language probability can be The product of multiple language probabilities of multiple text fragments in each candidate sample is generated in turn.
  • this embodiment can filter the candidate samples according to the sixth language probability, and use the candidate samples whose sixth language probability is higher than the probability threshold as negative samples. Or, use several candidate samples with higher probabilities of the sixth language as negative samples.
  • the scale of negative samples can be controlled, and the generation path of negative samples can be guaranteed to be an optional path for the first language sub-model to decode and obtain the text sequence, so that targeted training of the constrained sub-model can be achieved and improve The training efficiency of the constrained sub-model and the accuracy of the trained constrained sub-model.
  • the sixth language probability and confusion relationship can be combined to control the size of negative samples, thereby improving the training efficiency and training effect of the constrained sub-model.
  • the present disclosure also provides a speech recognition device.
  • the device will be described in detail below with reference to FIG. 10 .
  • Figure 10 is a structural block diagram of a speech recognition device according to an embodiment of the present disclosure.
  • the speech recognition device 1000 of this embodiment may include an acoustic probability obtaining module 1010 , an initial probability obtaining module 1020 , an extended relationship obtaining module 1030 , a probability adjustment module 1040 and a text determination module 1050 .
  • the acoustic probability obtaining module 1010 is configured to use an acoustic model to process the speech data to be recognized and the recognized first text segment to obtain respective acoustic probabilities of multiple candidate text segments.
  • the acoustic probability obtaining module 1010 may be used to perform the above-described operation S210, which will not be described again here.
  • the initial probability obtaining module 1020 is used to use the first language sub-model in the language model to evaluate the first text segment. Line processing is performed to obtain the initial language probabilities of multiple candidate text fragments.
  • the extended relationship obtaining module 1030 is used to process the first text fragment using the constraint sub-model in the language model, and obtain the extendable relationships of each of the multiple candidate text fragments with respect to the first text fragment.
  • the probability adjustment module 1040 is used to adjust the initial language probabilities of the candidate text segments according to the extensible relationship to obtain the first language probabilities of each of the multiple candidate text segments. Among them, the constrained sub-model is trained based on the text in the predetermined text set.
  • the initial probability obtaining module 1020, the extended relationship obtaining module 1030 and the probability adjustment module 1040 may be used to perform the above-described operations S220 to S240 respectively, which will not be described again here.
  • the text determination module 1050 is configured to determine a target text segment among multiple candidate text segments according to the first language probability and the acoustic probability, so as to obtain a text sequence for the speech data to be recognized.
  • the text determination module 1050 may be used to perform the operation S250 described above, which will not be described again here.
  • the above-mentioned initial probability obtaining module 1020 may include: an embedding processing sub-module, used to perform embedding processing on the first text fragment to obtain text embedding features; a feature determination sub-module, used to determine which first text fragment belongs to The first identification feature of the vertical category; and the first probability determination sub-module, used to use the first language sub-model to process the features after the fusion of the text embedding feature and the first identification feature, and obtain the initial initial values of each of the multiple candidate text segments. Linguistic probability.
  • the language model further includes a second language sub-model arranged in parallel with the first language sub-model.
  • the above device also includes: an implicit representation obtaining module, configured to input text embedding features into the second language sub-model to obtain the first implicit representation of the first text fragment.
  • the above-mentioned first language sub-model includes a first feature extraction network and a first prediction network.
  • the above-mentioned first probability determination sub-module may include: an implicit representation acquisition unit, used to input the fused text embedding feature and the first identification feature into the first feature extraction network to obtain the second implicit representation; and the first probability acquisition unit A unit configured to input the features obtained by fusing the first implicit representation and the second implicit representation into the first prediction network to obtain the initial language probabilities of each of the multiple candidate text segments.
  • the second language sub-model is trained using sample texts of multiple predetermined categories.
  • the above-mentioned second language sub-model includes a second feature extraction network and a second prediction network.
  • the above-mentioned implicit representation obtaining module is used to input text embedded features into the second feature extraction network to obtain the second implicit representation.
  • the above-mentioned device 1000 may further include: a first probability obtaining module, configured to input the second implicit representation into the second prediction network to obtain the second language probabilities of each of the plurality of candidate text segments.
  • the above text determination module 1050 is also used to determine the target text segment according to the second language probability, the first language probability and the acoustic probability.
  • the language model further includes a third language sub-model arranged in parallel with the first language sub-model.
  • the above-mentioned device 1000 may also include: an identification feature determination module, used to determine a second identification feature that characterizes the source of the voice data to be recognized; a second probability acquisition module, used to use a third language sub-model to embed the text features and the second identification The features after feature fusion are processed to obtain the third language probabilities of multiple candidate text segments.
  • the above text determination module 1050 is also used to determine the target text segment based on the third language probability, the first language probability and the acoustic probability.
  • the third language sub-model includes a third feature extraction network and a third prediction network.
  • the above-mentioned second probability obtaining module may include: an implicit representation obtaining sub-module, which is used to input the fused text embedding feature and the second identification feature into a third feature extraction network to obtain a third implicit representation; and a first probability obtaining sub-module.
  • the vertical category to which the first text fragment belongs includes a plurality of predetermined vertical categories.
  • the above-mentioned first probability determination sub-module may include: a feature fusion unit for fusing text embedding features and identification features of each predetermined vertical category for each predetermined vertical category to obtain the first fusion feature; a second probability acquisition unit for The first language sub-model is used to process the first fusion feature to obtain the initial language probabilities of each of the multiple candidate text segments.
  • the plurality of alternative text segments include a plurality of first alternative segments indicating alternative words.
  • the above-mentioned text determination module 1050 may include: a designated fragment determination sub-module, used to query a predetermined text list according to the first text fragment, and determine the first designated fragment, the first text fragment and the first designated fragment among the plurality of first candidate fragments.
  • the spliced text belongs to the predetermined text list; the first segment determination submodule is used to determine the target text segment among the plurality of first candidate segments based on the first language probability and acoustic probability of the first specified segment.
  • the predetermined text list includes a plurality of texts and a recognition weight of each text in the plurality of texts, and the recognition weight indicates the difficulty of recognition of the text.
  • the above-mentioned first segment determination sub-module includes: a first determination unit for determining the first target text to which the text composed of the first text segment and the first specified segment in the predetermined text list belongs; and a second determination unit for determining according to The recognition weight of the first target text, the first language probability and the acoustic probability of the first specified segment determine the target text segment among the plurality of candidate text segments.
  • the plurality of alternative text fragments further include a plurality of second alternative fragments indicating alternative slots; the alternative slots include in-slot slots.
  • the above text determination module 1050 may include: slot determination sub-module, used Determining the target slot belonging to the predetermined text list in the slot entry slot; the second probability determination submodule is used to use the language model to process the features obtained based on the third identification feature of the target slot and the starting identifier of the text , obtain the fourth language probabilities of each of the plurality of first candidate segments for the target slot; and the second segment determination submodule is used to determine the plurality of first candidate segments based on the fourth language probability, the first language probability and the acoustic probability. Select the target text fragment in the fragment.
  • the alternative slots also include outlet slots.
  • the above-mentioned text determination module 1050 may also include: a fusion sub-module for fusing the first identification feature of the vertical category to which the first text fragment belongs, and the fourth identification feature of the slot corresponding to the last text fragment in the first text fragment, Obtain the second fusion feature; the second probability determination submodule is used to process the second fusion feature using a language model to obtain the fifth language probability of each of the plurality of first candidate segments for the slot; and the third segment The determining sub-module is used to determine the target text segment among the plurality of first candidate segments based on the fifth language probability, the fourth language probability, the first language probability and the acoustic probability.
  • the above-mentioned slot determination sub-module may include: an initial slot determination unit, used to determine the slots belonging to the predetermined text list among the incoming slots to obtain the initial slot; a target slot determination unit, The target slot in the initial slot is determined based on the difference between the first language probability of the second candidate segment indicating the initial slot and the first language probabilities of the plurality of first candidate segments. Wherein, the first language probability of the second candidate segment indicating the target slot is greater than the first language probability of the second candidate segment indicating other slots in the initial slot except the target slot.
  • the above-mentioned second segment determination sub-module may include: a third determination unit configured to query a predetermined text list according to the first text segment to obtain the second target text and the third of the plurality of first candidate segments. Two designated fragments; the text formed by splicing the first text fragment and the target candidate fragment indicating the target slot corresponding to the second designated fragment belongs to the second target text; the probability determination unit is used to determine the target text based on the recognition weight of the second target text and the target candidate.
  • the first language probability of the selected segment is used to obtain the initial probability of the target candidate segment; and the segment determination unit is used to determine the target text segment in the second specified segment based on the initial probability and the fourth language probability of the second specified segment.
  • the above-mentioned device 1000 may further include: a table query module, configured to query the text for the alternative slot according to the slot text fragment in response to the text sequence including a slot text fragment belonging to the alternative slot. Fragment table; candidate fragment determination module, used to determine the text fragment in response to the slot text fragment not belonging to the text fragment table. The text fragment in this fragment table that is most similar to the slot text fragment is used as a candidate fragment; and a recognition result acquisition module is used to replace the slot text fragment in the text sequence with the candidate fragment to obtain the recognition of the speech data to be recognized. result.
  • a table query module configured to query the text for the alternative slot according to the slot text fragment in response to the text sequence including a slot text fragment belonging to the alternative slot.
  • Fragment table candidate fragment determination module, used to determine the text fragment in response to the slot text fragment not belonging to the text fragment table.
  • the text fragment in this fragment table that is most similar to the slot text fragment is used as a candidate fragment
  • a recognition result acquisition module is used to replace the
  • the above-mentioned device 1000 may further include: a probability storage module, configured to, in response to the number of times the language model is used to process the first target feature reach a predetermined number of times, store the results obtained by using the language model to process the first target feature.
  • a probability storage module configured to, in response to the number of times the language model is used to process the first target feature reach a predetermined number of times, store the results obtained by using the language model to process the first target feature.
  • the language probability of is stored in the cache; the cache query module is used to respond to the need to use the language model to process the second target feature, and query the cache according to the second target feature; For the language probability of the second target feature, read the language probability for the second target feature from the cache, and complete the processing of the second target feature using the language model, where the first target feature and the second target feature include the following features Any feature of: the text embedding feature of the first text fragment; the feature after the fusion of the text embedding feature and the identification feature of the vertical category; the feature after the fusion of the text embedding feature and the identification feature of the data source; the feature of the text embedding feature and the slot Identifies the features after feature fusion.
  • the operation of determining the target text segment among the plurality of candidate text segments based on the first language probability and the acoustic probability is performed by a graphics processor provided on the electronic device.
  • the samples for training the constrained sub-model include positive samples and negative samples, where the positive samples include text in a predetermined text set.
  • the above device also includes: a negative sample obtaining module, configured to adjust the predetermined text according to the second text fragment among the plurality of candidate text fragments that is inconsistent with the text fragment at the target position in the predetermined text, and obtain a negative sample.
  • the above-mentioned negative sample obtaining module includes: a fourth segment determination sub-module, configured to determine the second text segment based on the confusion relationship between the second text segment and the text segment at the target position in the predetermined text. The segment to be replaced; and the first replacement submodule is used to replace the text segment at the target position in the predetermined text with the segment to be replaced to obtain a negative sample.
  • the above-mentioned negative sample acquisition module includes: a second replacement sub-module, used to replace the text segment at the target position in the predetermined text with a second text segment to obtain a candidate sample; a second probability acquisition sub-module, It is used to process each sample in the candidate sample using the first language sub-model to obtain the sixth language probability of each sample; and the sample screening sub-module is used to screen the candidate samples according to the sixth language probability to obtain Negative sample.
  • the above-mentioned negative sample obtaining module may include: a third replacement sub-module for collecting Use the second text fragment to replace the text fragment at the target position in the predetermined text to obtain the initial text; and the fragment removal submodule is used to remove the text fragment after the target position in the initial text to obtain a negative sample.
  • the collection, storage, use, processing, transmission, provision, disclosure and application of user personal information are all in compliance with relevant laws and regulations, and necessary confidentiality measures have been taken. , and does not violate public order and good customs.
  • the user's authorization or consent is obtained before obtaining or collecting the user's personal information.
  • the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
  • FIG. 11 shows a schematic block diagram of an example electronic device 1100 that may be used to implement the speech recognition method of embodiments of the present disclosure.
  • Electronic devices are intended to refer to various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices, and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit implementations of the disclosure described and/or claimed herein.
  • the device 1100 includes a computing unit 1101 that can execute according to a computer program stored in a read-only memory (ROM) 1102 or loaded from a storage unit 1108 into a random access memory (RAM) 1103 Various appropriate actions and treatments.
  • ROM read-only memory
  • RAM random access memory
  • various programs and data required for the operation of the device 1100 can also be stored.
  • Computing unit 1101, ROM 1102 and RAM 1103 are connected to each other via bus 1104.
  • An input/output (I/O) interface 1105 is also connected to bus 1104.
  • I/O interface 1105 Multiple components in the device 1100 are connected to the I/O interface 1105, including: input unit 1106, such as a keyboard, mouse, etc.; output unit 1107, such as various types of displays, speakers, etc.; storage unit 1108, such as a magnetic disk, optical disk, etc. ; and communication unit 1109, such as a network card, modem, wireless communication transceiver, etc.
  • the communication unit 1109 allows the device 1100 to exchange information/data with other devices through computer networks such as the Internet and/or various telecommunications networks.
  • Computing unit 1101 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 1101 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various dedicated artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital information processor (DSP), and any appropriate processor, controller, microcontroller, etc.
  • the computing unit 1101 performs various methods and processes described above, such as speech recognition methods.
  • the speech recognition method may be implemented as a computer software program that is tangibly embodied in a machine-readable medium, such as storage unit 1108.
  • part or all of the computer program may be loaded and/or installed onto device 1100 via ROM 1102 and/or communication unit 1109 .
  • the computer program When the computer program is loaded into the RAM 1103 and executed by the computing unit 1101, one or more steps of the speech recognition method described above may be performed.
  • the computing unit 1101 may be configured to perform the speech recognition method in any other suitable manner (eg, by means of firmware).
  • Various implementations of the systems and techniques described above may be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on a chip implemented in a system (SOC), complex programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof.
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • ASSPs application specific standard products
  • SOC system
  • CPLD complex programmable logic device
  • computer hardware firmware, software, and/or combinations thereof.
  • These various embodiments may include implementation in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor
  • the processor which may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • An output device may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • An output device may be a special purpose or general purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.
  • Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing device, such that the program codes, when executed by the processor or controller, cause the functions specified in the flowcharts and/or block diagrams/ The operation is implemented.
  • the program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include electrical connections based on one or more wires, laptop disks, hard drives, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), fiber optics, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above .
  • RAM random access memory
  • ROM read-only memory
  • EPROM or flash memory erasable programmable read-only memory
  • CD-ROM compact disk read-only memory
  • optical storage device magnetic storage device, or any suitable combination of the above .
  • the systems and techniques described herein may be implemented on a computer having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer.
  • a display device eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • a keyboard and pointing device eg, a mouse or a trackball
  • Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including Acoustic input, voice input or tactile input) to receive input from the user.
  • the systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system.
  • the components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: local area network (LAN), wide area network (WAN), and the Internet.
  • Computer systems may include clients and servers.
  • Clients and servers are generally remote from each other and typically interact over a communications network.
  • the relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other.
  • the server can be a cloud server, also known as cloud computing server or cloud host. It is a host product in the cloud computing service system to solve the problem of traditional physical host and VPS service ("Virtual Private Server", or "VPS" for short). "), there are defects such as difficult management and weak business scalability.
  • the server can also be a distributed system server or a server combined with a blockchain.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

一种语音识别方法、装置、设备和介质。该方法包括:采用声学模型对待识别语音数据和已识别得到的第一文本片段进行处理,得到多个备选文本片段各自的声学概率(S210);采用第一语言子模型对第一文本片段进行处理,得到多个备选文本片段各自的初始语言概率(S220);采用约束子模型对第一文本片段进行处理,得到多个备选文本片段各自针对第一文本片段的可扩展关系(S230);根据可扩展关系,对备选文本片段的初始语言概率进行调整,得到多个备选文本片段各自的第一语言概率(S240);以及根据第一语言概率和声学概率,确定多个备选文本片段中的目标文本片段(S250)。

Description

语音识别方法、装置、设备和介质
本申请要求于2022年9月1日提交的、申请号为202211064891.8的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开涉及人工智能领域,具体涉及语音识别、自然语言处理和深度学习等技术领域,尤其涉及一种语音识别方法、装置、设备和介质。
背景技术
随着计算机技术和网络技术的发展,深度学习技术在众多领域得到了广泛应用。例如,可以采用基于深度学习技术构建的声学模型来识别语音,以将采集到的语音转化成文本。
发明内容
本公开旨在提供一种语音识别方法、装置、设备和介质。
根据本公开的一个方面,提供了一种语音识别方法,包括:采用声学模型对待识别语音数据和已识别得到的第一文本片段进行处理,得到多个备选文本片段各自的声学概率;采用语言模型中的第一语言子模型对第一文本片段进行处理,得到多个备选文本片段各自的初始语言概率;采用语言模型中的约束子模型对第一文本片段进行处理,得到多个备选文本片段各自针对第一文本片段的可扩展关系;以及根据可扩展关系,对备选文本片段的初始语言概率进行调整,得到多个备选文本片段各自的第一语言概率;以及根据第一语言概率和声学概率,确定多个备选文本片段中的目标文本片段,以得到针对待识别语音数据的文本序列。其中,约束子模型是基于预定文本集中的文本训练得到的。
根据本公开的另一个方面,提供了一种语音识别装置,包括:声学概率获得模块,用于采用声学模型对待识别语音数据和已识别得到的第一文本片段进行处理,得到多个备选文本片段各自的声学概率;初始概率获得模块,用于采用语言模型中的第一语言子模型对第一文本片段进行处理,得到多个备选文本片段各自的初始语言概率;扩展关系获得模块,用于采用语言模型中的约束子模型对第一文本片段进行处理,得到多个备选文本片段各自针对第一文本片段的可扩展关系;概率调整模块,用于根据可扩展关系, 对备选文本片段的初始语言概率进行调整,得到多个备选文本片段各自的第一语言概率;以及文本确定模块,用于根据第一语言概率和声学概率,确定多个备选文本片段中的目标文本片段,以得到针对待识别语音数据的文本序列,其中,约束子模型是基于预定文本集中的文本训练得到的。
根据本公开的另一个方面,提供了一种电子设备,包括:至少一个处理器;以及与至少一个处理器通信连接的存储器;其中,存储器存储有可被至少一个处理器执行的指令,指令被至少一个处理器执行,以使至少一个处理器能够执行本公开提供的语音识别方法。
根据本公开的另一个方面,提供了一种存储有计算机指令的非瞬时计算机可读存储介质,其中,计算机指令用于使计算机执行本公开提供的语音识别方法。
根据本公开的另一个方面,提供了一种计算机程序产品,包括计算机程序/指令,所述计算机程序/指令在被处理器执行时实现本公开提供的语音识别方法。
应当理解,本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征,也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。
附图说明
附图用于更好地理解本方案,不构成对本公开的限定。其中:
图1是根据本公开实施例的语音识别方法和装置的应用场景示意图;
图2是根据本公开实施例的语音识别方法的流程示意图;
图3是根据本公开第一实施例的得到多个备选文本片段的初始语言概率的原理示意图;
图4是根据本公开第二实施例的得到多个备选文本片段的第一语言概率的原理示意图;
图5是根据本公开实施例的语言模型的结构示意图;
图6是根据本公开第一实施例的确定目标文本片段的原理示意图;
图7是根据本公开第二实施例的确定目标文本片段的原理示意图;
图8是根据本公开第三实施例的确定目标文本片段的原理示意图;
图9是根据本公开实施例的用于训练约束子模型的负样本的生成原理示意图;
图10是根据本公开实施例的语音识别装置的结构框图;以及
图11是用来实施本公开实施例的语音识别方法的电子设备的框图。
具体实施方式
以下结合附图对本公开的示范性实施例做出说明,其中包括本公开实施例的各种细节以助于理解,应当将它们认为仅仅是示范性的。因此,本领域普通技术人员应当认识到,可以对这里描述的实施例做出各种改变和修改,而不会背离本公开的范围和精神。同样,为了清楚和简明,以下的描述中省略了对公知功能和结构的描述。
通常,可以采用语音识别声学建模技术来完成语音识别任务。例如,可以通过构建端到端的注意力模型,来提升语音识别的准确率。但在实际业务中,仅依赖建模得到的声学模型来执行识别任务,难以满足特定业务对语音识别高准确率的需求。这是由于声学模型的训练数据通常有限,无法覆盖多种多样的业务领域。另外,各业务领域的业务需求通常会随着时事热点发生变化,为了提高识别精度,通常需要对声学模型进行迭代更新。但由于声学模型的迭代成本高、迭代周期长,通常无法跟上准确率需求的变化速度。
基于此,可以采用语言模型和声学模型相结合的方式来完成语音识别任务。如此,可以借助语言模型的训练数据海量、更新迭代速度快的优点,来弥补声学模型的不足,满足业务对语音识别高准确率的需求。
其中,语言模型例如可以采用神经网络语言模型(Neural Network Language Model,NNLM)。该NNLM实质上为序列模型,输入为包括前一循环中预测得到的文本片段的文本序列,输出为当前循环得到的针对多个预定文本片段的概率分布。该实施例可以根据概率分布,将概率值最大的预定文本片段作为当前循环预测得到的文本片段。其中,声学模型可以为基于注意力的声学模型。其中,每个文本片段可以为一个字、一个词、一个音节的文本或一个短语等任意粒度的文本。
根据本公开的实施例,依赖语言模型与基于注意力的声学模型的解码算法可以将单个声学模型输出的概率分布与单个NNLM输出的概率分布进行融合,采用束搜索(Beam Search)的方式来根据融合结果得到单次解码过程选择的候选路径。例如,以多个预定文本片段为N个,束搜索中采用的beam为3为例,第一次解码可以从N个预定文本片段中筛选出概率值最高的3个片段作为候选文本片段,后续的每一次解码可以从3*N个路径中筛选出概率总值最高的3个路径作为候选路径,直至筛选出的候选路径 都包括文本结束标识符<EOS>,或者直至筛选出的候选路径中文本片段的长度均达到长度阈值。其中,路径可以由自第一次解码至当前次解码所得到的、其中片段按生成顺序排列的片段序列来表示。该路径的概率总值可以为片段序列中各片段的概率值的乘积,或者片段序列中各片段的概率值的对数之和。
将语言模型和声学模型相结合的方式虽然可以在一定程度上提高识别准确率,但该方式中,是根据语言模型输出的概率分布来对解码路径的扩展进行引导的。对于闭集的识别任务,则无法保证最终识别得到的文本为闭集识别任务所设置的文本集合中的某个文本,从而影响下游任务(例如基于识别得到的文本进行搜索、进行语音应答等任务)的实施。即,该方式仍然存在识别精度低,识别任务完成效果差的问题。
基于此,本公开提供了一种提高语音识别精度,使得识别结果与识别任务相符的语音识别方法和装置。以下先结合图1对本公开提供的方法和装置的应用场景进行描述。
图1是根据本公开实施例的语音识别方法和装置的应用场景示意图。
如图1所示,该实施例的应用场景100可以包括电子设备110,该电子设备110可以为具有处理功能的各种电子设备,包括但不限于智能手机、平板电脑、膝上型便携计算机、台式计算机、智能手表或智能音箱等等。
该电子设备110例如可以对获得的语音数据120进行处理,例如可以对语音数据120进行语音识别,以将语音数据120转换为文本130。例如,语音数据120可以是通过对采集的语音处理后所得到的数据。采集的语音可以为采用麦克风等音频采集器所采集的用户语音。
在一实施例中,电子设备110中可以设置有音频采集器,且该电子设备110可以安装有输入法、浏览器、智能音箱APP、车载APP等具有语音识别功能的客户端应用(仅为示例),电子设备110可以通过语音识别来将语音数据转换为输入的字符,以进行信息查询、智能音箱远程控制或者车辆远程控制等。
在一实施例中,电子设备110可以采用端到端模型140来完成语音识别任务。其中,端到端模型140例如可以包括上文描述的语言模型和声学模型,该端到端模型140可以采用束搜索的方式来得到文本130。或者,端到端模型140可以为上文描述的端到端的流式注意力模型。或者,电子设备110也可以采用下文描述的语音识别方法来完成 语音识别任务,本公开对此不做限定。
在一实施例中,如图1所示,该应用场景100中还可以包括服务器150。该服务器150例如可以为支持电子设备110中客户端应用运行的后台管理服务器。电子设备110可以通过网络与服务器150通信连接,网络可以包括有线或无线通信链路。
例如,服务器150可以基于海量的文本样本对语言模型进行训练,并基于语音-文本对来对声学模型进行训练。服务器150可以将训练得到的语言模型和声学模型构成端到端模型140,并结合具体场景对该端到端模型140进行微调。服务器150例如可以响应于电子设备110发送的获取请求,将微调后的端到端模型140发送给电子设备110,以供电子设备110采用该端到端模型140完成语音识别任务。
在一实施例中,电子设备110可以将获得的语音数据120发送给服务器150,由服务器150来根据端到端模型140对语音数据120进行语音识别,得到文本130。
需要说明的是,本公开提供的语音识别方法可以由电子设备110执行,也可以由服务器150执行。相应地,本公开提供的语音识别装置可以设置在电子设备110中,也可以设置在服务器150中。
应该理解,图1中的电子设备110和服务器150的数目和类型仅仅是示意性的。根据实现需要,可以具有任意数目和类型的电子设备110和服务器150。
以下将结合图2~图9对本公开提供的语音识别方法进行详细描述。
图2是根据本公开实施例的语音识别方法的流程示意图。
如图2所示,该实施例的语音识别方法200可以包括操作S210~操作S250。
在操作S210,采用声学模型对待识别语音数据和已识别得到的第一文本片段进行处理,得到多个备选文本片段各自的声学概率。
根据本公开的实施例,声学模型可以采用由高斯混合模型(Gaussian Mixed Model,GMM)和隐马尔可夫模型(Hidden Markov Model,HMM)构成的模型,也可以采用由深度神经网络(Deep Neural Networks,DNN)和HMM构成的模型。可以理解的是,该声学模型例如可以包括编码器和解码器。编码器的输入为待识别语音数据,输出为提取得到的声学特征。解码器的输入包括声学特征和已识别得到的第一文本片段的嵌入特征。声学模型的输出为多个备选文本片段的概率分布,该概率分布包括多个备选文本片段各自的声学概率。
其中,在语音识别的初始阶段,已识别得到的第一文本片段可以为文本起始符<SOS>,在后续阶段中,已识别得到的第一文本片段为由文本起始符<SOS>和识别得到的文本片段构成的文本片段序列。
其中,多个备选文本片段例如可以为字库中的多个字。字库中包括的字可以根据实际需求进行设定,本公开对此不做限定。
在操作S220,采用语言模型中的第一语言子模型对第一文本片段进行处理,得到多个备选文本片段各自的初始语言概率。
在操作S230,采用语言模型中的约束子模型对第一文本片段进行处理,得到多个备选文本片段各自针对第一文本片段的可扩展关系。
在操作S240,根据可扩展关系,对备选文本片段的初始语言概率进行调整,得到多个备选文本片段各自的第一语言概率。
根据本公开的实施例,语言模型可以采用上文描述的NNLM,也可以采用N-gram模型。该实施例可以将第一文本片段输入语言模型,由语言模型输出多个备选文本片段的概率分布,该概率分布包括多个备选文本片段各自的第一语言概率。
根据本公开的实施例,语言模型例如可以包括第一语言子模型和约束子模型。该第一语言子模型和约束子模型可以并列设置,第一语言子模型可以采用上文描述的NNLM。约束子模型的结构与NNLM的结构类似。该第一语言子模型和约束子模型的输入均可以为第一文本片段的嵌入特征,该两个子模型的网络结构可以类似,区别主要在于,第一语言子模型对第一文本片段进行处理可以得到概率分布,第二语言子模型对第一文本片段进行处理可以得到表征可扩展关系的向量。其中,第一语言子模型得到的概率分布包括多个备选文本片段各自的语言概率,可以将该语言概率作为初始语言概率。表征可扩展关系的向量包括多个元素,每个元素表示一个备选文本片段针对第一文本片段的可扩展关系。其中,具有可扩展关系表示备选文本片段可以作为第一文本片段之后的片段。
在一实施例中,多个元素中每个元素的取值为0或1,0表示没有可扩展关系,1表示具有可扩展关系。
在根据约束子模型的输出得到多个备选文本片段各自针对第一文本片段的可扩展关系之后,可以根据该可扩展关系,对备选文本片段的初始概率进行调整。例如,可 以将表示每个备选文本片段针对第一文本片段的可扩展关系的元素取值与该每个备选文本片段的初始语言概率相乘,得到该每个备选文本片段的第一语言概率。或者,可以对表示每个备选文本片段针对第一文本片段的可扩展关系的元素取值取对数,并对该每个备选文本片段的初始语言概率取对数,将得到的两个对数相加,作为该每个备选文本片段的第一语言概率。
在一实施例中,约束子模型可以是基于预定文本集中的文本训练得到的。其中,预定文本集可以是为闭集识别任务所设置的文本集合,闭集识别任务可以根据实际需求进行设定。
在操作S250,根据第一语言概率和声学概率,确定多个备选文本片段中的目标文本片段,以得到针对待识别语音数据的文本序列。
根据本公开的实施例,可以针对每个备选文本片段,将第一语言概率和声学概率相加或者相乘,将相加或相乘得到的值作为该每个备选文本片段的概率值。随后,该实施例可以选择概率值最大的文本片段,作为目标文本片段。
在得到该目标文本片段后,可以将该目标文本片段添加至已识别得到的第一文本片段,继续执行操作S210~操作S250,直至选择的概率值最大的文本片段为文本结束标识符<EOS>,或者概率值最大的文本片段与第一文本片段中文本片段的总和达到预定数量。
在一实施例中,还可以采用束搜索的方式来确定概率总值较大的预定数量个(例如M个)路径中最后位置的片段为目标文本片段。随后,将每个目标文本片段添加至第一文本片段,得到M个调整后的文本片段。随后,将每个调整后的文本片段作为一个第一文本片段,返回执行操作S210~操作S240,总计得到M*N个候选路径。随后,从该M*N个候选路径中筛选出概率总值最高的M个路径。以此类推,直至筛选出的候选路径都包括文本结束标识符<EOS>,或者直至筛选出的候选路径中文本片段的长度均达到长度阈值。最终,由概率总值最高的候选路径上的文本片段构成待识别语音数据的文本序列。
本公开实施例通过在语言模型中设置用于预测备选文本片段对第一文本片段的可扩展关系的约束子模型,并根据该可扩展关系对预测得到的初始语言概率进行调整,可以结合可扩展关系和初始语言概率来对解码路径的扩展进行引导。如此,在约束子 模型为神经网络模型,且通过依据闭集识别任务中所设置的文本集合来学习多个备选文本片段中的各片段彼此之间的可扩展关系后,在可扩展关系的引导下,可以使得识别得到的文本为闭集识别任务所设置的文本集合中的某个文本,并因此可以提高识别精度,提高识别任务的完成效果,利于下游任务的实施。
以下将结合图3~图5,对上述操作S220~操作S240的实施进行进一步扩展和限定。
图3是根据本公开第一实施例的得到多个备选文本片段的第一语言概率的原理示意图。
根据本公开的实施例,在采用语言模型得到语言概率时,例如还可以为语言模型的输入添加垂类标识。如此,可以使得语言模型能够针对不同垂类的文本进行不同路径的引导。使得本公开的语言模型可以用于对多种不同垂类的文本进行预测,利于提高本公开的语音识别方法的鲁棒性。
如图3所示,该实施例300中,在确定备选文本片段各自的初始语言概率时,可以先对第一文本片段301进行处理,以得到该第一文本片段301的文本嵌入特征302。例如可以采用word2vec方法或者字表达的全局向量方法(Global Vectors for Word Representation,GloVe)等来对第一文本片段301进行处理。
在得到文本嵌入特征302的任意时机,该实施例还可以确定第一文本片段所属的垂类303,以及该所属的垂类303的第一标识特征304。可以理解的是,第一文本片段所属的垂类303例如可以响应于用户操作确定,或者,在语音识别的初始阶段,可以将多个预定垂类均作为第一文本片段所属的垂类303,针对每个预定垂类,均得到一个概率分布。随着路径的扩展,可以将所选择路径对应的预定垂类作为已识别得到的第一文本片段所属的垂类。该实施例可以为多个预定垂类中的每个预定垂类分配标识符,该实施例可以通过对垂类的标识符进行编码,从而垂类的第一标识特征。
在得到文本嵌入特征302和第一标识特征304后,该实施例可以先融合该文本嵌入特征302和第一标识特征304。随后将融合得到的特征输入第一语言子模型320中,经由第一语言子模型320处理后可以得到语言概率分布305。该语言概率分布305包括多个预定文本片段的初始语言概率。
示例性地,可以通过将文本嵌入特征302和第一标识特征304拼接,来实现两者 的融合。或者,可以设定文本嵌入特征302与第一标识特征304具有相同的维度,该实施例可以通过采用加法器310对文本嵌入特征302和第一标识特征304相加,来实现两者的融合。可以理解的是,上述融合方法仅作为示例以利于理解本公开,本公开对此不做限定。
示例性地,第一语言子模型320可以采用NNLM模型。例如,该第一语言子模型320可以包括依次连接的输入层、隐藏层和输出层,其中,输入层可以用于将文本转换为嵌入特征,可以理解的是,该输入层可以包括上文描述的对第一文本片段进行处理,得到文本嵌入特征的功能、根据垂类得到第一标识特征的功能、以及将文本嵌入特征和第一标识特征融合的功能。隐藏层可以为全连接层,也可以为由序列网络与全连接层构成的网络结构,以利于学习到输入序列中多个数据之间的上下文信息。其中,序列网络可以包括基于注意力机制的网络(例如Transformer)或者长短期记忆网络LSTM等,本公开对此不做限定。输出层可以包括softmax等逻辑回归网络。
图4是根据本公开第二实施例的得到多个备选文本片段的第一语言概率的原理示意图。
根据本公开的实施例,在语言模型中,可以设置一个通用的语言模型分支,该语言模型分支可以是采用多个垂类的文本训练得到。考虑到该通用的语言模型分支的垂类的偏向存在不足,而针对垂类的语言模型通常参数量过高,该实施例可以将两者相结合,将通用的语言模型分支的参数共享给针对垂类的语言模型,同时为针对垂类的语言模型额外添加一部分参数,以对垂类进行单独的强化学习。即,在语言模型中设置两个分支,一个为通用语言模型分支,一个为针对垂类的语言模型分支。如此,可以在对语言模型对多垂类的识别率进行优化的基础上,保证了模型的体积较小,从而降低模型运行时的算力需求,利于提高该实施例方法的鲁棒性。
如图4所示,在该实施例400中,语言模型可以包括第一语言子模型410、与第一语言子模型410并列设置的第二语言子模型420和约束子模型430。其中,第一语言子模型410和约束子模型430构成针对垂类的语言模型分支。
在得到第一语言概率时,该实施例可以将文本嵌入特征401输入第二语言子模型420,得到第二语言子模型420的隐藏层输出的第一隐式表征。
该实施例还可以将文本嵌入特征401与所属垂类的第一标识特征402融合后输入 到第一语言子模型410,并将第一语言子模型410的隐藏层输出的第二隐式表征和上述第一隐式表征融合。随后将融合后的特征输入第一语言子模型410的输出层,由该输出层输出语言概率分布,从而得到多个备选文本片段各自的初始语言概率。
该实施例还可以将文本嵌入特征401与第一标识特征402融合后输入到约束子模型430,由该约束子模型430输出表征可扩展关系的向量。将该向量与初始语言概率输入到融合层440,由融合层440根据表征可扩展关系的向量来对初始语言概率进行调整,从而输出多个备选文本片段各自的第一语言概率403。
其中,第一语言子模型中的隐藏层可以作为第一特征提取网络,输出层可以作为第一预测网络。第一预测网络的输入包括第二隐式表征和第一隐式表征融合后的特征(例如采用加法器来融合),第一预测网络的输出为概率分布,表征可扩展关系的向量可以是对该概率分布中的概率值的对数值进行调整。该实施例通过根据概率值的对数值来确定语言概率,可以将数值之间的相乘关系转换为数值的对数之间的相加关系,如此可以保证计算精度。这是由于电子设备通常对浮点数的乘法的计算精度比较低,而对加法的计算精度比较高。
在一具体实施例中,第一语言子模型可以包括LSTM层、加法器、全连接层和逻辑回归层(softmax)。其中,加法器可以设置在全连接层与逻辑回归层之间。相应地,LSTM和全连接层构成第一特征提取网,softmax层构成第一预测网络。在一具体实施例中,加法器不仅设置在全连接层与逻辑回归层之间,还设置在LSTM层与全连接层之间。相应地,LSTM层、设置在LSTM层与全连接层之间的加法器和全连接层可以构成第一特征提取网络411,设置在全连接层与逻辑回归层之间的加法器和逻辑回归层可以构成第一预测网络412。其中,LSTM层与全连接层之间的加法器用于融合第一隐式表征和LSTM输出的特征,全连接层与逻辑回归层之间的加法器用于融合第一隐式表征和第二隐式表征。如此,可以实现第一隐式表征与第一语言子模型中特征的充分融合,加强第一语言子模型中网络参数与第二语言子模型中网络参数的共享,提高得到的第一语言概率的精度,提高语音识别精度。
在一具体实施例中,第二语言子模型420可以包括LSTM层、全连接层和softmax层。其中,LSTM层和全连接层构成第二语言子模型的第二特征提取网络421,softmax层构成第二语言子模型的第二预测网络422。该实施例在将第一文本片段的文本嵌入特 征401输入第二特征提取网络421得到第二隐式表征之后,可以将该第二隐式表征输入第二预测网络422,由该第二预测网络422输出另一概率分布,从而得到多个备选文本片段各自的第二语言概率404。最后,该实施例可以根据第一语言概率403、第二语言概率404和声学概率,来确定目标文本片段。具体地,可以将第一语言概率403和第二语言概率404分别与声学概率相加。若设定多个预定文本片段为N个,则总计得到2*N个相加后的概率值。随后,从该2*N个相加后的概率值中选择M个较大的概率值,从而得到当前次解码所得到的候选路径。通过该方式,可以使得本公开实施例的方法不仅可以应用于多个垂类的场景中,也可以应用于通用的语音识别场景中,提高该实施例方法的鲁棒性。
图5是根据本公开实施例的语言模型的结构示意图。
根据本公开的实施例,可以在语言模型中设置与第一语言子模型并列的第三语言子模型,用于学习不同来源的语音数据与文本之间的关系。该实施例可以将第三语言子模型得到的语言概率和针对垂类的语言模型分支所得到的语言概率作为并列选项进行筛选。如此,该实施例的语言模型可以应用于不同场景的不同垂类中,且无需针对不同垂类不同场景进行分别训练,可以提高模型的鲁棒性,降低模型的训练成本。
如图5所示,在该实施例500中,语言模型可以包括第一语言子模型510、第二语言子模型520、约束子模型530和第三语言子模型540。其中,第一语言子模型510、第二语言子模型520和约束子模型530与上文描述的图4中的相应模型类似,在此不再赘述。
在该实施例500中,第三语言子模型540与第一语言子模型510类似,区别在于,该第三语言子模型的输入为表征待识别语音数据的来源的第二标识特征503和文本嵌入特征501融合后的特征。
相应地,该实施例在进行语音识别时,还可以确定表征待识别语音数据的来源的第二标识特征503。例如,在使用者确定语音识别效果较差时,可以提供训练数据。该实施例的方法可以为该使用者分配标识符,并根据该使用者提供的训练数据对第三语言子模型进行训练。在实际语音识别中,可以根据待识别语音的来源确定使用者,通过对为该确定的使用者分配的标识符进行编码,得到第二标识特征。可以理解的是,使用者可以为各种具有语音识别功能的客户端应用。第二标识特征还可以通过对客户 端应用的名称等进行编码来得到,本公开对此不做限定。
在得到第二标识特征503后,该实施例500可以采用第三语言子模型540对文本嵌入特征501和第二标识特征503融合后的特征进行处理。基于与第一语言子模型得到初始语言概率的原理类似的原理,该第三语言子模型540可以输出概率分布。通过对该概率分布中的概率值取对数,即可得到多个备选文本片段各自的第三语言概率506。
可以理解的是,如图5所示,与上文描述的第一语言子模型类似,该实施例500中,第三语言子模型540可以包括第三特征提取网络和第三预测网络。该实施例可以将文本嵌入特征501和第二标识特征503融合后的特征输入到第三特征提取网络541,从而得到第三隐式表征。随后,将融合第一隐式表征和第三隐式表征所得到的特征输入到第三预测网络542,由第三预测网络542输出概率分布。通过对该概率分布中的概率值取对数,即可得到多个备选文本片段各自的第三语言概率506。
在得到第三语言概率506后,该实施例可以根据第三语言概率506、第一语言概率504和声学概率,来确定目标文本片段。该原理与上述根据第一语言概率、第二语言概率和声学概率,来确定目标文本片段的原理类似,在此不再赘述。
在一实施例中,在第二语言子模型520得到第二语言概率505的基础上,该实施例500可以根据第一语言概率504、第二语言概率505、第三语言概率506和声学概率,来确定目标文本片段。该原理与上述根据第一语言概率、第二语言概率和声学概率,来确定目标文本片段的原理类似,在此不再赘述。
可以理解的是,语言模型为序列模型,在对待识别语音进行识别时,语言模型中第一语言子模型的初始输入包括P个特征,该P个特征通过将文本起始标识符<SOS>的嵌入特征与P个预定垂类的标识特征分别相加而得到。第二语言子模型的初始输入为文本起始标识符<SOS>的嵌入特征。第三语言子模型的初始输入为文本起始标识符<SOS>的嵌入特征与表征待识别语音的来源的第二标识特征相加所得到的特征。经过语言模型的处理,可以得到(P+2)*N个概率值,对应(P+2)*N个扩展路径。该实施例可以从该(P+2)*N个扩展路径中选择M个概率总值较高的路径。如此,在第二次解码中,已识别的第一文本片段包括M个文本片段,由文本起始标识符<SOS>与该M个概率总值较高的路径所对应的文本片段分别组合得到。随后,将该M个文本片段分别输入第二语言子模型,得到M*N个扩展路径;将该M个文本片段分别与M个概率 值较高的路径对应的垂类的标识特征融合后输入第一语言子模型,得到M*N个扩展路径。将该M个文本片段分别与第二标识特征融合后输入第三语言子模型,得到M*N个扩展路径,总计得到3M*N个扩展路径。随后,从该3M*N个扩展路径中选择M个概率总值较高的路径,以此类推,经过多次解码,直至筛选得到的M个路径均包括文本结束标识符<EOS>,或者直至筛选出的M个路径中文本片段的长度均达到长度阈值。最后,将概率总值最高的路径对应的文本序列作为识别得到的待识别语音数据的文本序列。可以理解的是,在第i次解码中,筛选得到的路径中,包括的文本片段的个数为(i+1),且该文本片段中包括文本起始标识符<SOS>。
以下将对上述操作S250的实施进行进一步扩展和限定。
根据本公开的实施例,针对闭集识别任务,例如可以根据为该闭集识别任务所设置的文本集合来设置预定文本列表。在确定目标文本片段时,依据该预定文本列表来从多个备选文本片段中选择目标文本片段。如此,可以使得识别得到的文本序列构成的文本属于针对闭集识别任务所设置的文本集合,使得该实施例的方法具有强制识别出闭集中某一文本的能力。在闭集识别任务为智能音箱语音识别任务时,通过该实施例的方法,可以保证识别得到的文本序列中包括的歌名、歌手名等为已有的歌名、歌手名,利于根据识别结果向用户播放符合实际需求的音乐。
该实施例中,多个备选文本片段例如可以包括指示备选字的多个第一备选片段,备选字可以根据实际需求进行设定,本公开对此不做限定。该实施例在确定目标文本片段时,可以先根据第一文本片段查询预定文本列表,根据查询结果来确定多个第一备选片段中的第一指定片段。例如,可以查询预定文本列表,确定预定文本列表中包括该第一文本片段的文本,作为第一文本。例如,设定文本集合中包括文本“请播放歌手A的歌曲a”,第一文本片段为“请播”,则可以确定该文本“请播放歌手A的歌曲a”为第一文本。随后,该实施例可以将该第一文本中第一文本片段之后的字“放”作为第一指定片段。即,该第一指定片段与第一文本片段拼接成的文本是属于该预定文本列表的。
在得到第一指定片段后,该实施例可以根据该第一指定片段的第一语言概率和声学概率,来确定多个第一备选片段中的目标文本片段。例如,该实施例可以将第一指定片段的第一语言概率的对数值与声学概率的对数值相加。将相加得到的值作为第一 指定片段针对第一文本片段的概率值。在第一文本片段仅为一个时,该实施例可以将针对第一文本片段的概率值较大的M个第一指定片段作为目标文本片段。在第一文本片段为多个时,可以先从多个第一文本片段中,选择与第一指定片段拼接后的文本属于预定文本列表的第二文本片段,将第一指定片段针对第二文本片段的概率值与该第二文本片段的概率值相乘,得到针对第一指定片段与第二文本片段拼接得到的文本的概率值。最后,该实施例可以将概率值最高的M个文本中的第一指定片段作为目标文本片段。
在一实施例中,例如可以为预定文本列表中的每个文本设置识别权重,该识别权重可以根据识别的难易程度来确定。例如,识别权重可以与识别的难度正相关。则在确定目标文本片段时,还可以结合该识别权重来对备选文本片段进行筛选,利于语音识别方法识别出识别难度高的文本,强化该语音识别方法对高难度文本的识别能力。可以理解的是,识别权重例如可以根据实际需求进行设定和修改,本公开对此不做限定。
图6是根据本公开第一实施例的确定目标文本片段的原理示意图。
如图6所示,该实施例600在确定目标文本片段时,可以先根据第一文本片段601查询预定文本列表602,确定预定文本列表602中包括该第一文本片段601的文本,作为第一文本603。该实施例可以将属于多个第一备选片段,且在该第一文本603中位于第一文本片段601之后的文本片段作为第一指定片段604。
随后该实施例可以确定第一文本片段601和第一指定片段604拼接成的文本,作为拼接后文本605,并将第一文本603中包括该拼接后文本的部分作为第一目标文本606。最后,该实施例可以根据该第一目标文本606的识别权重、第一指定片段604的第一语言概率和该第一指定片段604的声学概率,来确定目标文本片段。例如,可以将第一目标文本606的识别权重、第一指定片段604的第一语言概率和该第一指定片段604的声学概率的对数相加,作为该第一指定片段604针对第一文本片段601的概率值,随后根据该概率值,从确定的第一指定片段604中筛选出目标文本片段。
根据本公开的实施例,在预定文本列表中,例如可以以模板形式来表示预定文本,将预定文本中的实体类文本片段等采用槽位表示,并在预定文本列表中列出槽位对应的实体类可以包括的实体,以此,利于对语音的识别进行精细化的路径管理,利 于提高语音识别精度。其中,实体类文本片段例如可以包括表示歌曲名、歌手名、兴趣点名称等的文本片段,不同类型的实体对应一个槽位,例如,歌曲名类别的实体对应的槽位为[song],歌手名类别的实体对应的槽位为[singer],兴趣点名称类别的实体对应的槽位为[POI]等。
相应地,该实施例可以采用大图套小图的解码方法来预测得到文本序列。其中,大图对应文本模板,小图对应槽位。在第一文本片段之后的文本片段为一个槽位表示的实体时,该实施例可以结合该槽位的标识特征来对槽位表示的实体进行预测,以此,可以使得语言模型针对不同的槽位进行预测,利于提高预测得到的目标文本片段的精度。这是由于,通过考虑槽位的标识特征,可以使得语言模型能够学习到不同槽位与预测得到的文本片段之间的映射关系。
以下将结合图7对该实施例确定目标文本片段的原理进行详细描述。
图7是根据本公开第二实施例的确定目标文本片段的原理示意图。
如图7所示,该实施例700中,多个备选文本片段除了包括指示备选字的多个第一备选片段外,还包括指示备选槽位的多个第二备选片段。其中,备选槽位可以根据实际需求进行设定,例如可以根据实际场景中实体的类别来设置备选槽位,每个类别的实体对应一个备选槽位。该备选槽位中对应每个类别的实体的槽位可以理解为入槽槽位,在预测过程中,还可以设置出槽槽位,用于表示完成对实体的预测。
该实施例700中,在采用语言模型得到第一语言概率后,例如可以采用与上述根据预定文本列表确定第一指定片段的类似方法,先根据预定文本列表702,确定入槽槽位701中属于该预定文本列表702的目标槽位703。通过该方式,可以过滤掉无法识别出闭集中文本的入槽槽位。具体地,如上文描述,预定文本列表中的文本由字和槽位构成,槽位对应位置处为预定文本中实体所在的位置。该实施例可以将预定文本列表中构成文本的槽位与入槽槽位701进行比较,从而得到目标槽位703。
随后,该实施例可以采用语言模型710对根据该目标槽位703的第三标识特征704和文本的起始标识符<SOS>705得到的特征进行处理,得到多个第一备选片段的第四语言概率。该第四语言概率可以表示各备选字属于目标槽位703中片段的概率。该部分为跳入小图进行解码的过程,该解码过程中,采用文本的起始标识符的嵌入特征替代第一文本片段的文本嵌入特征,采用目标槽位703的第三标识特征704替代第一文本片段 所属垂类的第一标识特征。具体地,该实施例可以先确定目标槽位703的第三标识特征704,该第三标识特征704可以通过为目标槽位703分配的标识符编码来得到。同时,可以对起始标识符<SOS>705进行编码,得到起始标识符编码特征。随后,将该第三标识特征704与起始标识符编码特征相加,得到根据该目标槽位703的第三标识特征704和文本的起始标识符<SOS>705得到的特征,该特征可以作为语言模型710中第一语言子模型和约束子模型的输入。采用与上述得到第一语言概率的原理类似的原理,得到第一备选片段针对目标槽位的第四语言概率706。
在得到第四语言概率706后,该实施例可以根据第四语言概率706、第一语言概率和声学概率,确定第一备选片段中的目标文本片段。例如,设定目标槽位的个数为Q个,针对每个目标槽位,该实施例可以根据基于该每个目标槽位的第三标识特征得到的第四语言概率和指示该每个目标槽位的第二备选片段的第一语言概率,确定多个第一备选片段作为该每个目标槽位中的文本片段的概率。例如,可以将每个第一备选片段的第四语言概率与指示该每个目标槽位的第二备选片段的第一语言概率相乘,作为该每个第一备选片段作为该每个目标槽位中的文本片段的概率。设定多个第一备选片段为N’个,则针对每个目标槽位,可以得到N’个概率,针对Q个目标槽位,可以总计得到Q*N’个概率。该实施例可以将该Q*N’个概率和N’个第一备选片段的第一语言概率组成概率集,该概率集中总计包括(Q+1)*N’个概率。
该实施例700例如可以将(Q+1)*N’个概率的对数值分别于对应第一备选片段的声学概率的对数值相加,得到(Q+1)*N’扩展概率。该实施例可以根据该(Q+1)*N’扩展概率,从(Q+1)*N’扩展概率对应的(Q+1)*N’个路径中选择M个路径,将M个路径中最后位置对应的文本片段作为目标文本片段。
根据本公开的实施例,对于出槽槽位,可以采用与针对入槽槽位类似的方式,来确定目标文本片段。区别在于,针对出槽槽位,输入语言模型710的特征中,替代第一文本片段的文本嵌入特征的是:跳出的槽位的标识特征,具体为第一文本片段中最后位置的文本片段所对应槽位的第四标识特征。第一标识特征应为第一文本片段所属垂类的标识特征。该实施例可以通过融合第四标识特征和第一标识特征,得到第二融合特征。该第二融合特征可以作为语言模型的输入,由语言模型处理得到多个第一备选片段针对出槽槽位的第五语言概率。最后,该实施例可以根据该第五语言概率、第一 语言概率和声学概率,确定多个第一备选片段中的目标文本片段。
例如,该实施例可以采用上文描述的方法,针对Q个目标槽位,总计得到Q*N’个概率。该实施例还可以将指示出槽槽位的第二文本片段的第一语言概率与每个第一备选片段针对出槽槽位的第五语言概率相乘,作为该每个第一备选片段作为出槽后第一个文本片段的概率,针对N’个第一备选片段,可以总计得到N’个概率。该实施例可以将得到的Q*N’个概率、N’个第一备选片段作为出槽后第一个文本片段的N’个概率和N’个第一备选片段的N’个第一语言概率组成概率集,该概率集中总计包括(Q+2)*N’个概率。
随后,该实施例可以将(Q+2)*N’个概率的对数值分别于对应第一备选片段的声学概率的对数值相加,得到(Q+2)*N’扩展概率。该实施例可以根据该(Q+2)*N’扩展概率,从(Q+2)*N’扩展概率对应的(Q+2)*N’个路径中选择M个路径,将M个路径中最后位置对应的文本片段作为目标文本片段。
根据本公开的实施例,在确定目标槽位703时,例如可以将入槽槽位中属于预定文本列表的槽位作为初始槽位。随后,将指示初始槽位的第二备选片段的第一语言概率与多个第一备选片段的第一语言概率进行比较,将相对而言概率值较大的第二备选片段所指示的初始槽位作为目标槽位。例如,该实施例可以先确定多个第一备选片段的第一语言概率中取值较大的预定数量个概率;然后将指示初始槽位的第二备选片段的第一语言概率与预定数量个概率中的最小概率进行比较,若指示某个初始槽位的第二备选片段的第一语言概率高于最小概率,或者低于最小概率且与最小概率的差值绝对值小于等于第一预定阈值,则确定该某个初始槽位为目标槽位。或者,该实施例可以将指示初始槽位的第二备选片段的第一语言概率与多个第一备选片段的第一语言概率中的最大概率进行比较,若两者的差值绝对值小于第二预定阈值,则确定初始槽位为目标槽位。可以理解的是,上述根据差异确定目标槽位的方法仅作为示例以利于理解本公开,本公开对此不做限定。
本公开实施例通过根据与多个第一备选片段的第一语言概率的差异来确定目标槽位,可以对入槽槽位进行进一步的筛选,剔除被扩展的几率小的槽位,从而在保证预测精度的同时,降低计算量,提高解码得到目标文本片段的计算效率。
图8是根据本公开第三实施例的确定目标文本片段的原理示意图。
根据本公开的实施例,在跳入小图进行解码时,例如也可以结合预定文本列表中为文本分配的识别权重,来筛选得到目标文本片段。如此,可以利于语音识别方法识别出识别难度高的文本,强化该语音识别方法对高难度文本的识别能力。
例如,在得到上文描述的第四语言概率之后,或在任意时机,根据第一文本片段查询预定文本列表,得到第二目标文本和多个第一备选片段中的第二指定片段。具体地,可以先将第一文本片段和指示各第一备选片段对应槽位的第二备选片段拼接,得到多个拼接后文本。随后根据拼接后文本查询预定文本列表,确定包括多个拼接后文本中任一文本的预定文本为第二目标文本,并将任一文本中包括的指示槽位对应的第一备选片段作为第二指定片段。为了便于描述,该实施例可以将指示第二指定片段对应槽位的第二备选片段,作为目标备选片段。
随后,该实施例可以根据第二目标文本的识别权重和目标备选片段的第一语言概率,来确定目标备选片段的初始概率。例如,可以将第二目标文本的识别权重与目标备选片段的第一语言概率相乘,将乘积作为初始概率。或者,将第二目标文本的识别权重的对数值与目标备选片段的第一语言概率的对数值相加,得到初始概率,本公开对此不做限定。
在得到初始概率后,该实施例可以根据初始概率和第二指定片段的第四语言概率,来确定该第二指定片段作为目标槽位中第一个文本片段的概率,例如,可以将初始概率与第二指定片段的第四语言概率的对数值相加,得到第二指定片段作为目标槽位中第一个文本片段的概率。该概率可以替代上文描述的Q*N’个概率中的相应概率。
以下将结合图8,通过一实例来对本公开实施例中解码得到目标文本片段的原理进行详细描述。
如图8所示,在该实施例800中,设定采用束搜索方式进行解码,从而得到文本序列时,设定beam为M,则除第一循环外,解码过程中的每个循环中的第一文本片段的个数为M个。设定备选字的个数为N’个,备选槽位包括Q’个入槽槽位和一个出槽槽位。针对M个第一文本片段中的文本片段801,该实施例可以采用声学模型810得到N’个声学概率802。采用语言模型820,可以得到分别对应N’个备选字的N’个语言概率,分别对应Q’个入槽槽位的入槽概率,对应出槽槽位的出槽概率,总计(N’+Q’+1)个语言概率803。
同时,该实施例可以根据文本片段801查询预定文本列表830,查询得到信息804,该信息804可以包括上文描述的第一目标文本及其识别权重w1,以及上文描述的第二目标文本及其识别权重w2。该实施例可以根据查询得到的信息804对预测得到的语言概率所对应的文本片段进行筛选,从而得到可扩展的字805、上文描述的目标槽位806和出槽槽位807。可以理解的是,可扩展的字可以为上文描述的第一指定片段。在出槽槽位807的出槽概率远小于目标槽位和可扩展的字的概率时,还可以将出槽槽位剔除。其中,可扩展的字805的扩展概率可以由可扩展的字的声学概率的对数值、可扩展的字的语言概率的对数值与可扩展的字所对应的第一目标文本的识别权重w1的和来表示。目标槽位806的可扩展初始概率可以由目标槽位806的入槽概率的对数值和目标槽位对应的第二目标文本的识别权重w2的和来表示。出槽槽位的扩展初始概率由出槽概率的对数值来表示。
该实施例可以将可扩展的字805作为候选文本片段,将候选文本片段与文本片段801拼接,将拼接后文本加入针对文本片段801的第一候选池808中。
针对目标槽位,该实施例可以采用上文描述的类似方法,将文本起始标识符的嵌入特征和目标槽位的标识特征输入语言模型820,跳入小图进行解码操作,从而得到上文描述的第四语言概率。针对出槽槽位,该实施例可以采用上文描述的类似方法,将第一文本片段所属垂类的标识特征和第一文本片段中最后位置的文本片段所对应槽位的标识特征输入语言模型830,跳入大图进行解码操作,从而得到上文描述的第五语言概率。随后,该实施例可以通过查询预定文本列表,依据列表中的文本对第四语言概率和第五语言概率进行约束,筛选得到属于预定文本列表中文本的文本片段,将该文本片段与文本片段801拼接后加入第一候选池808。
基于类似的原理,针对M个第一文本片段中的每个文本片段,可以得到M个候选池。该实施例可以从M个候选池中选择概率总值最大的M个候选文本片段,作为下一次循环中的M个第一文本片段。直至选择得到的M个候选文本片段中均包括文本的结束标识符<EOS>,或者M个候选文本片段中文本片段的个数均达到预定个数。
综上可知,本公开实施例在单次循环中,通常需要采用语言模型进行两次计算。为了提高计算效率,该实施例可以在采用语言模型对第一目标特征进行处理的次数达到预定次数时,将语言模型对第一目标特征进行处理所得到的语言概率存入缓存中, 以备后续调用。相应地,在确定需要采用语言模型对某个目标特征(例如第二目标特征)进行处理时,可以先查询缓存,确定缓存中是否存储有针对第二目标特征的语言概率,若有,则从缓存中直接读取该语言概率,完成语言模型对第二目标特征的处理,而无需采用语言模型进行复杂的计算。
可以理解的是,第一目标特征和第二目标特征可以包括以下特征中的任意一个特征:第一文本片段的文本嵌入特征;文本嵌入特征和垂类的标识特征融合后的特征;文本嵌入特征和数据的来源的标识特征融合后的特征;文本嵌入特征和槽位的标识特征融合后的特征。即,该第一目标特征和第二目标特征可以为上文描述的输入语言模型中隐藏层的任意特征,本公开对此不做限定。
在一实施例中,还可以采用图形处理器GPU等高性能处理器来执行确定目标文本片段的操作,以使得针对M个第一文本片段的计算或确定目标文本片段的过程中所涉及到的任意可并行地计算,可以由GPU等并行地执行,以进一步提高解码效率,提高语音识别效率。
根据本公开的实施例,可以针对备选槽位维护文本片段表,将属于该备选槽位的文本片段添加至该文本片段表中。该实施例在识别得到文本序列后,例如可以将文本序列中属于备选槽位的槽位文本片段与备选槽位的文本片段表中的文本片段进行比较。具体可以响应于文本序列中包括属于备选槽位的槽位文本片段,根据该槽位文本片段查询针对备选槽位的文本片段表。若槽位文本片段不属于针对备选槽位的文本片段表,则可以将该槽位文本片段与针对槽位的文本片段表中的各个文本片段进行比较,将文本片段表中与槽位文本片段的相似度最大的文本片段,作为候补片段。随后,采用该候补片段替换文本序列中的槽位文本片段,将替换后的文本片段作为针对待识别语音数据的识别结果。
通过该方式,可以保证文本序列中备选槽位处的文本片段为文本片段表中的文本片段,可以确保生成的识别结果中的文本片段为合理的片段。例如,若槽位文本片段为“朋果”,通过查询,可以采用“苹果”替换“朋果”,从而是的生成的识别结果合理,提高识别结果的精度。
以下将结合图9对训练语言模型中的约束子模型时所采用的样本的生成进行扩展和限定,以使得约束子模型可以学习到闭集的识别任务中多个备选文本片段之间的可 扩展关系,从而利于提高任务的完成效果,利于下游任务的实施。
图9是根据本公开实施例的用于训练约束子模型的负样本的生成原理示意图。
根据本公开的实施例,训练约束子模型的样本例如可以包括正样本和负样本。其中,正样本可以包括预定文本集中的文本,而负样本可以为除预定文本集中文本外的任意文本。通过该方式,在约束子模型生成的表征可扩展关系的向量的基础上,可以在解码过程中对不属于预定文本集中文本的文本生成路径进行裁剪。
在一实施例中,可以根据多个备选文本片段中与预定文本中目标位置处的文本片段不一致的第二文本片段,对预定文本片段进行调整,将调整后的文本作为负样本。其中,目标位置可以为预定文本中的任意位置。通过该方式来生成负样本,由于负样本和正样本的区别仅在于目标位置处的文本片段,可以提高约束子模型的学习能力。
例如,如图9所示,该实施例900可以从预定文本集910随机抽取出一个预定文本,作为正样本911。该实施例还可以将该抽取的预定文本中排在最后位置的预定数量个文本片段去除,将得到的文本也作为正样本。
在抽取得到预定文本后,可以采用上文描述的第二文本片段920替换该预定文本中目标位置处的文本片段,从而得到负样本930。
在一实施例中,目标位置例如可以为预定文本的最后位置,如此,可以使得负样本和正样本具有相同的前缀树,在解码过程中,可以有效对最后一个循环中不属于预定文本集中文本的文本生成路径进行裁剪。
在一实施例中,目标位置可以为任意位置,该实施例在采用第二文本片段920替换抽取的预定文本中目标位置处的文本片段后,可以去除该预定文本中位于该目标位置之后的文本片段,从而得到负样本。
该实施例通过去除目标位置之后的文本片段来得到负样本,可以使得所有的负样本都与正样本具有相同的前缀。通过选择目标位置为任意位置,可以使得约束子模型可以学习到预定文本中任意两个文本片段之间的可扩展关系,利于提高解码路径的裁剪精度和有效性。
在一实施例中,在采用第二文本片段对预定文本进行调整时,例如可以先根据该第二文本片段与预定文本中目标位置处的文本片段之间的混淆关系,确定第二文本片段中的待替换片段。随后,可以采用待替换片段替换预定文本中目标位置处的文本片 段,将替换后的文本作为负样本。通过该方式,可以使得生成的负样本为易与预定文本(即正样本)混淆的文本,利于提高约束子模型的辨别能力。再者,通过该实施例中待替换片段的选择,可以有效减少负样本的数量和负样本的针对性,利于提高约束子模型的训练效率。
其中,混淆关系例如可以由文本片段之间的文本相似度、音节相似度等来表示,相似度越高,则越易混淆。
在一实施例中,在生成负样本时,例如可以先采用第二文本片段替换预定文本中目标位置处的文本片段,将得到的文本片段作为备选样本。随后,可以采用预先训练得到的、上文描述的第一语言子模型来对每个备选样本进行处理,得到第一语言子模型生成该每个备选样本的语言概率,该语言概率可以为依次生成该每个备选样本中多个文本片段的多个语言概率的乘积。随后,该实施例可以根据该第六语言概率来对备选样本进行筛选,将第六语言概率高于概率阈值的备选样本作为负样本。或者,将第六语言概率较高的若干个备选样本作为负样本。通过该方式,可以使得负样本的规模可控,且可以保证负样本的生成路径为第一语言子模型解码得到文本序列的可选路径,从而可以达到对约束子模型进行针对性的训练,提高约束子模型的训练效率和训练得到的约束子模型的精度。
在一实施例中,可以结合第六语言概率和混淆关系来控制负样本的规模,并因此提高约束子模型的训练效率和训练效果。
基于本公开提供的语音识别方法,本公开还提供了一种语音识别装置。以下将结合图10对该装置进行详细描述。
图10是根据本公开实施例的语音识别装置的结构框图。
如图10所示,该实施例的语音识别装置1000可以包括声学概率获得模块1010、初始概率获得模块1020、扩展关系获得模块1030、概率调整模块1040和文本确定模块1050。
声学概率获得模块1010用于采用声学模型对待识别语音数据和已识别得到的第一文本片段进行处理,得到多个备选文本片段各自的声学概率。在一实施例中,声学概率获得模块1010可以用于执行上文描述的操作S210,在此不再赘述。
初始概率获得模块1020用于采用语言模型中的第一语言子模型对第一文本片段进 行处理,得到多个备选文本片段各自的初始语言概率。扩展关系获得模块1030用于采用语言模型中的约束子模型对第一文本片段进行处理,得到多个备选文本片段各自针对第一文本片段的可扩展关系。概率调整模块1040用于根据可扩展关系,对备选文本片段的初始语言概率进行调整,得到多个备选文本片段各自的第一语言概率。其中,约束子模型是基于预定文本集中的文本训练得到的。在一实施例中,初始概率获得模块1020、扩展关系获得模块1030和概率调整模块1040可以分别用于执行上文描述的操作S220~操作S240,在此不再赘述。
文本确定模块1050用于根据第一语言概率和声学概率,确定多个备选文本片段中的目标文本片段,以得到针对待识别语音数据的文本序列。在一实施例中,文本确定模块1050可以用于执行上文描述的操作S250,在此不再赘述。
根据本公开的实施例,上述初始概率获得模块1020可以包括:嵌入处理子模块,用于对第一文本片段进行嵌入处理,得到文本嵌入特征;特征确定子模块,用于确定第一文本片段所属垂类的第一标识特征;以及第一概率确定子模块,用于采用第一语言子模型对文本嵌入特征和第一标识特征融合后的特征进行处理,得到多个备选文本片段各自的初始语言概率。
根据本公开的实施例,语言模型还包括与第一语言子模型并列设置的第二语言子模型。上述装置还包括:隐式表征获得模块,用于将文本嵌入特征输入第二语言子模型,得到第一文本片段的第一隐式表征。上述第一语言子模型包括第一特征提取网络和第一预测网络。上述第一概率确定子模块可以包括:隐式表征获得单元,用于将文本嵌入特征和第一标识特征融合后的特征输入第一特征提取网络,得到第二隐式表征;以及第一概率获得单元,用于将融合第一隐式表征和第二隐式表征所得到的特征输入第一预测网络,得到多个备选文本片段各自的初始语言概率。其中,第二语言子模型是采用多个预定垂类的样本文本训练得到的。
根据本公开的实施例,上述第二语言子模型包括第二特征提取网络和第二预测网络。上述隐式表征获得模块用于将文本嵌入特征输入第二特征提取网络,得到第二隐式表征。上述装置1000还可以包括:第一概率获得模块,用于将第二隐式表征输入第二预测网络,得到多个备选文本片段各自的第二语言概率。上述文本确定模块1050还用于根据第二语言概率、第一语言概率和声学概率,确定目标文本片段。
根据本公开的实施例,语言模型还包括与第一语言子模型并列设置的第三语言子模型。上述装置1000还可以包括:标识特征确定模块,用于确定表征待识别语音数据的来源的第二标识特征;第二概率获得模块,用于采用第三语言子模型对文本嵌入特征和第二标识特征融合后的特征进行处理,得到多个备选文本片段各自的第三语言概率。上述文本确定模块1050还用于根据第三语言概率、第一语言概率和声学概率,确定目标文本片段。
根据本公开的实施例,第三语言子模型包括第三特征提取网络和第三预测网络。上述第二概率获得模块可以包括:隐式表征获得子模块,用于将文本嵌入特征和第二标识特征融合后的特征输入第三特征提取网络,得到第三隐式表征;以及第一概率获得子模块,用于将融合第一隐式表征和第三隐式表征所得到的特征输入第三预测网络,得到多个备选文本片段各自的第三语言概率。
根据本公开的实施例,在第一文本片段为文本的起始标识符的情况下,第一文本片段所属垂类包括多个预定垂类。上述第一概率确定子模块可以包括:特征融合单元,用于针对每个预定垂类,融合文本嵌入特征和每个预定垂类的标识特征,得到第一融合特征;第二概率获得单元,用于采用第一语言子模型对第一融合特征进行处理,得到多个备选文本片段各自的初始语言概率。
根据本公开的实施例,多个备选文本片段包括指示备选字的多个第一备选片段。上述文本确定模块1050可以包括:指定片段确定子模块,用于根据第一文本片段查询预定文本列表,确定多个第一备选片段中的第一指定片段,第一文本片段和第一指定片段拼接成的文本属于预定文本列表;第一片段确定子模块,用于根据第一指定片段的第一语言概率和声学概率,确定多个第一备选片段中的目标文本片段。
根据本公开的实施例,预定文本列表中包括多个文本及多个文本中每个文本的识别权重,识别权重指示文本的识别难易程度。上述第一片段确定子模块包括:第一确定单元,用于确定预定文本列表中第一文本片段和第一指定片段拼接成的文本所属的第一目标文本;以及第二确定单元,用于根据第一目标文本的识别权重、第一指定片段的第一语言概率和声学概率,确定多个备选文本片段中的目标文本片段。
根据本公开的实施例,多个备选文本片段还包括指示备选槽位的多个第二备选片段;备选槽位包括入槽槽位。上述文本确定模块1050可以包括:槽位确定子模块,用 于确定入槽槽位中属于预定文本列表的目标槽位;第二概率确定子模块,用于采用语言模型对根据目标槽位的第三标识特征和文本的起始标识符得到的特征进行处理,得到多个第一备选片段各自针对目标槽位的第四语言概率;以及第二片段确定子模块,用于根据第四语言概率、第一语言概率和声学概率,确定多个第一备选片段中的目标文本片段。
根据本公开的实施例,备选槽位还包括出槽槽位。上述文本确定模块1050还可以包括:融合子模块,用于融合第一文本片段所属垂类的第一标识特征,和第一文本片段中最后位置的文本片段所对应槽位的第四标识特征,得到第二融合特征;第二概率确定子模块,用于采用语言模型对第二融合特征进行处理,得到多个第一备选片段各自针对出槽槽位的第五语言概率;以及第三片段确定子模块,用于根据第五语言概率、第四语言概率、第一语言概率和声学概率,确定多个第一备选片段中的目标文本片段。
根据本公开的实施例,上述槽位确定子模块可以包括:初始槽位确定单元,用于确定入槽槽位中属于预定文本列表的槽位,得到初始槽位;目标槽位确定单元,用于根据指示初始槽位的第二备选片段的第一语言概率与多个第一备选片段的第一语言概率的差异,确定初始槽位中的目标槽位。其中,指示目标槽位的第二备选片段的第一语言概率大于指示初始槽位中除目标槽位外其他槽位的第二备选片段的第一语言概率。
根据本公开的实施例,上述第二片段确定子模块可以包括:第三确定单元,用于根据第一文本片段查询预定文本列表,得到第二目标文本和多个第一备选片段中的第二指定片段;第一文本片段和指示第二指定片段对应目标槽位的目标备选片段拼接成的文本属于第二目标文本;概率确定单元,用于根据第二目标文本的识别权重和目标备选片段的第一语言概率,得到目标备选片段的初始概率;以及片段确定单元,用于根据初始概率和第二指定片段的第四语言概率,确定第二指定片段中的目标文本片段。
根据本公开的实施例,上述装置1000还可以包括:表查询模块,用于响应于文本序列中包括属于备选槽位的槽位文本片段,根据槽位文本片段查询针对备选槽位的文本片段表;候补片段确定模块,用于响应于槽位文本片段不属于文本片段表,确定文 本片段表中与槽位文本片段的相似度最大的文本片段,作为候补片段;以及识别结果获得模块,用于采用候补片段替换文本序列中的槽位文本片段,得到针对待识别语音数据的识别结果。
根据本公开的实施例,上述装置1000还可以包括:概率存储模块,用于响应于采用语言模型对第一目标特征进行处理的次数达到预定次数,将语言模型对第一目标特征进行处理所得到的语言概率存入缓存中;缓存查询模块,用于响应于需要采用语言模型对第二目标特征进行处理,根据第二目标特征查询缓存;以及概率读取模块,用于响应于缓存中存储有针对第二目标特征的语言概率,从缓存中读取针对第二目标特征的语言概率,完成采用语言模型对第二目标特征的处理,其中,第一目标特征和第二目标特征包括以下特征中的任意一个特征:第一文本片段的文本嵌入特征;文本嵌入特征和垂类的标识特征融合后的特征;文本嵌入特征和数据的来源的标识特征融合后的特征;文本嵌入特征和槽位的标识特征融合后的特征。
根据本公开的实施例,根据第一语言概率和声学概率,确定多个备选文本片段中的目标文本片段的操作是由设置于电子设备上的图形处理器执行的。
根据本公开的实施例,训练约束子模型的样本包括正样本和负样本,其中,正样本包括预定文本集中的文本。上述装置还包括:负样本获得模块,用于根据多个备选文本片段中与预定文本中目标位置处的文本片段不一致的第二文本片段,对预定文本进行调整,得到负样本。
根据本公开的实施例,上述负样本获得模块包括:第四片段确定子模块,用于根据第二文本片段与预定文本中目标位置处的文本片段之间的混淆关系,确定第二文本片段中的待替换片段;以及第一替换子模块,用于采用待替换片段替换预定文本中目标位置处的文本片段,得到负样本。
根据本公开的实施例,上述负样本获得模块包括:第二替换子模块,用于采用第二文本片段替换预定文本中目标位置处的文本片段,得到备选样本;第二概率获得子模块,用于针对备选样本中的各样本,采用第一语言子模型进行处理,得到各样本的第六语言概率;以及样本筛选子模块,用于根据第六语言概率对备选样本进行筛选,得到负样本。
根据本公开的实施例,上述负样本获得模块可以包括:第三替换子模块,用于采 用第二文本片段替换预定文本中目标位置处的文本片段,得到初始文本;以及片段去除子模块,用于去除初始文本中目标位置之后的文本片段,得到负样本。
需要说明的是,本公开的技术方案中,所涉及的用户个人信息的收集、存储、使用、加工、传输、提供、公开和应用等处理,均符合相关法律法规的规定,采取了必要保密措施,且不违背公序良俗。在本公开的技术方案中,在获取或采集用户个人信息之前,均获取了用户的授权或同意。
根据本公开的实施例,本公开还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。
图11示出了可以用来实施本公开实施例的语音识别方法的示例电子设备1100的示意性框图。电子设备旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本公开的实现。
如图11所示,设备1100包括计算单元1101,其可以根据存储在只读存储器(ROM)1102中的计算机程序或者从存储单元1108加载到随机访问存储器(RAM)1103中的计算机程序,来执行各种适当的动作和处理。在RAM 1103中,还可存储设备1100操作所需的各种程序和数据。计算单元1101、ROM 1102以及RAM 1103通过总线1104彼此相连。输入/输出(I/O)接口1105也连接至总线1104。
设备1100中的多个部件连接至I/O接口1105,包括:输入单元1106,例如键盘、鼠标等;输出单元1107,例如各种类型的显示器、扬声器等;存储单元1108,例如磁盘、光盘等;以及通信单元1109,例如网卡、调制解调器、无线通信收发机等。通信单元1109允许设备1100通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。
计算单元1101可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元1101的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信 号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元1101执行上文所描述的各个方法和处理,例如语音识别方法。例如,在一些实施例中,语音识别方法可被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存储单元1108。在一些实施例中,计算机程序的部分或者全部可以经由ROM 1102和/或通信单元1109而被载入和/或安装到设备1100上。当计算机程序加载到RAM 1103并由计算单元1101执行时,可以执行上文描述的语音识别方法的一个或多个步骤。备选地,在其他实施例中,计算单元1101可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行语音识别方法。
本文中以上描述的***和技术的各种实施方式可以在数字电子电路***、集成电路***、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上***的***(SOC)、复杂可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程***上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储***、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储***、该至少一个输入装置、和该至少一个输出装置。
用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器,使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行***、装置或设备使用或与指令执行***、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体***、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器 (ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
为了提供与用户的交互,可以在计算机上实施此处描述的***和技术,该计算机具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。
可以将此处描述的***和技术实施在包括后台部件的计算***(例如,作为数据服务器)、或者包括中间件部件的计算***(例如,应用服务器)、或者包括前端部件的计算***(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的***和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算***中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将***的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)和互联网。
计算机***可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。其中,服务器可以是云服务器,又称为云计算服务器或云主机,是云计算服务体系中的一项主机产品,以解决了传统物理主机与VPS服务(″Virtual Private Server″,或简称″VPS″)中,存在的管理难度大,业务扩展性弱的缺陷。服务器也可以为分布式***的服务器,或者是结合了区块链的服务器。
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本发公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本公开公开的技术方案所期望的结果,本文在此不进行限制。
上述具体实施方式,并不构成对本公开保护范围的限制。本领域技术人员应该明白的是,根据设计要求和其他因素,可以进行各种修改、组合、子组合和替代。任何 在本公开的精神和原则之内所作的修改、等同替换和改进等,均应包含在本公开保护范围之内。

Claims (24)

  1. 一种语音识别方法,包括:
    采用声学模型对待识别语音数据和已识别得到的第一文本片段进行处理,得到多个备选文本片段各自的声学概率;
    采用语言模型中的第一语言子模型对所述第一文本片段进行处理,得到多个所述备选文本片段各自的初始语言概率;
    采用所述语言模型中的约束子模型对所述第一文本片段进行处理,得到多个所述备选文本片段各自针对所述第一文本片段的可扩展关系;
    根据所述可扩展关系,对所述备选文本片段的初始语言概率进行调整,得到多个所述备选文本片段各自的第一语言概率;以及
    根据所述第一语言概率和所述声学概率,确定多个所述备选文本片段中的目标文本片段,以得到针对所述待识别语音数据的文本序列,
    其中,所述约束子模型是基于预定文本集中的文本训练得到的。
  2. 根据权利要求1所述的方法,其中,所述采用语言模型中的第一语言子模型对所述第一文本片段进行处理,得到多个所述备选文本片段各自的初始语言概率包括:
    对所述第一文本片段进行嵌入处理,得到文本嵌入特征;
    确定所述第一文本片段所属垂类的第一标识特征;以及
    采用所述第一语言子模型对所述文本嵌入特征和所述第一标识特征融合后的特征进行处理,得到多个所述备选文本片段各自的初始语言概率。
  3. 根据权利要求2所述的方法,其中,所述语言模型还包括与所述第一语言子模型并列设置的第二语言子模型;所述方法还包括:
    将所述文本嵌入特征输入所述第二语言子模型,得到所述第一文本片段的第一隐式表征;
    其中,所述第一语言子模型包括第一特征提取网络和第一预测网络;所述采用所述第一语言子模型对所述文本嵌入特征和所述第一标识特征融合后的特征进行处理,得到多个所述备选文本片段各自的初始语言概率包括:
    将所述文本嵌入特征和所述第一标识特征融合后的特征输入所述第一特征提取网络,得到第二隐式表征;以及
    将融合所述第一隐式表征和所述第二隐式表征所得到的特征输入所述第一预测网络,得到多个所述备选文本片段各自的初始语言概率,
    其中,所述第二语言子模型是采用多个预定垂类的样本文本训练得到的。
  4. 根据权利要求3所述的方法,其中,所述第二语言子模型包括第二特征提取网络和第二预测网络;其中:
    所述将所述文本嵌入特征输入所述第二语言子模型,得到所述第一文本片段的第一隐式表征包括:将所述文本嵌入特征输入所述第二特征提取网络,得到所述第二隐式表征;
    所述方法还包括:
    将所述第二隐式表征输入所述第二预测网络,得到多个所述备选文本片段各自的第二语言概率;以及
    根据所述第二语言概率、所述第一语言概率和所述声学概率,确定所述目标文本片段。
  5. 根据权利要求3所述的方法,其中,所述语言模型还包括与所述第一语言子模型并列设置的第三语言子模型;所述方法还包括:
    确定表征所述待识别语音数据的来源的第二标识特征;
    采用所述第三语言子模型对所述文本嵌入特征和所述第二标识特征融合后的特征进行处理,得到多个所述备选文本片段各自的第三语言概率;以及
    根据所述第三语言概率、所述第一语言概率和所述声学概率,确定所述目标文本片段。
  6. 根据权利要求5所述的方法,其中,所述第三语言子模型包括第三特征提取网络和第三预测网络;所述采用所述第三语言子模型对所述文本嵌入特征和所述第二标识特征融合后的特征进行处理,得到多个所述备选文本片段各自的第三语言概率包括:
    将所述文本嵌入特征和所述第二标识特征融合后的特征输入所述第三特征提取网络,得到第三隐式表征;以及
    将融合所述第一隐式表征和所述第三隐式表征所得到的特征输入所述第三预测网络,得到多个所述备选文本片段各自的第三语言概率。
  7. 根据权利要求2所述的方法,其中,在所述第一文本片段为文本的起始标识符的情况下,所述第一文本片段所属垂类包括多个预定垂类;所述采用所述第一语言子模型对所述文本嵌入特征和所述第一标识特征融合后的特征进行处理,得到多个所述备选文本片段各自的初始语言概率包括:
    针对每个预定垂类,融合所述文本嵌入特征和所述每个预定垂类的标识特征,得 到第一融合特征;以及
    采用所述第一语言子模型对所述第一融合特征进行处理,得到多个所述备选文本片段各自的初始语言概率。
  8. 根据权利要求1所述的方法,其中,多个所述备选文本片段包括指示备选字的多个第一备选片段;其中,所述根据所述第一语言概率和所述声学概率,确定多个所述备选文本片段中的目标文本片段,以得到针对所述待识别语音数据的文本序列包括:
    根据所述第一文本片段查询预定文本列表,确定所述多个第一备选片段中的第一指定片段,所述第一文本片段和所述第一指定片段拼接成的文本属于所述预定文本列表;以及
    根据所述第一指定片段的第一语言概率和所述声学概率,确定多个所述第一备选片段中的目标文本片段。
  9. 根据权利要求8所述的方法,其中,所述预定文本列表包括多个文本及所述多个文本中每个文本的识别权重,所述识别权重指示文本的识别难易程度;所述根据所述第一指定片段的第一语言概率和所述声学概率,确定多个所述第一备选片段中的目标文本片段包括:
    确定所述预定文本列表中所述第一文本片段和所述第一指定片段拼接成的文本所属的第一目标文本;以及
    根据所述第一目标文本的识别权重、所述第一指定片段的第一语言概率和所述声学概率,确定多个所述备选文本片段中的目标文本片段。
  10. 根据权利要求8所述的方法,其中,多个所述备选文本片段还包括指示备选槽位的多个第二备选片段;所述备选槽位包括入槽槽位;其中,所述根据所述第一语言概率和所述声学概率,确定多个所述备选文本片段中的目标文本片段,以得到针对所述待识别语音数据的文本序列还包括:
    确定所述入槽槽位中属于所述预定文本列表的目标槽位;
    采用所述语言模型对根据所述目标槽位的第三标识特征和文本的起始标识符得到的特征进行处理,得到多个所述第一备选片段各自针对所述目标槽位的第四语言概率;以及
    根据所述第四语言概率、所述第一语言概率和所述声学概率,确定多个所述第一备选片段中的目标文本片段。
  11. 根据权利要求10所述的方法,其中,所述备选槽位还包括出槽槽位;其中, 所述根据所述第一语言概率和所述声学概率,确定多个所述备选文本片段中的目标文本片段,以得到针对所述待识别语音数据的文本序列还包括:
    融合所述第一文本片段所属垂类的第一标识特征,和所述第一文本片段中最后位置的文本片段所对应槽位的第四标识特征,得到第二融合特征;
    采用所述语言模型对所述第二融合特征进行处理,得到多个所述第一备选片段各自针对所述出槽槽位的第五语言概率;以及
    根据所述第五语言概率、所述第四语言概率、所述第一语言概率和所述声学概率,确定多个所述第一备选片段中的目标文本片段。
  12. 根据权利要求10所述的方法,其中,确定所述入槽槽位中属于所述预定文本列表的目标槽位包括:
    确定所述入槽槽位中属于所述预定文本列表的槽位,得到初始槽位;以及
    根据指示所述初始槽位的第二备选片段的第一语言概率与多个所述第一备选片段的第一语言概率的差异,确定所述初始槽位中的目标槽位,
    其中,指示所述目标槽位的第二备选片段的第一语言概率大于指示所述初始槽位中除所述目标槽位外其他槽位的第二备选片段的第一语言概率。
  13. 根据权利要求10所述的方法,其中,所述根据所述第四语言概率、所述第一语言概率和所述声学概率,确定多个所述第一备选片段中的目标文本片段包括:
    根据所述第一文本片段查询所述预定文本列表,得到第二目标文本和所述多个第一备选片段中的第二指定片段;所述第一文本片段和指示所述第二指定片段对应目标槽位的目标备选片段拼接成的文本属于所述第二目标文本;
    根据所述第二目标文本的识别权重和所述目标备选片段的第一语言概率,得到所述目标备选片段的初始概率;以及
    根据所述初始概率和所述第二指定片段的第四语言概率,确定所述第二指定片段中的目标文本片段。
  14. 根据权利要求10所述的方法,还包括:
    响应于所述文本序列中包括属于所述备选槽位的槽位文本片段,根据所述槽位文本片段查询针对所述备选槽位的文本片段表;
    响应于所述槽位文本片段不属于所述文本片段表,确定所述文本片段表中与所述槽位文本片段的相似度最大的文本片段,作为候补片段;以及
    采用所述候补片段替换所述文本序列中的所述槽位文本片段,得到针对所述待识 别语音数据的识别结果。
  15. 根据权利要求10~13中任一项所述的方法,还包括:
    响应于采用所述语言模型对第一目标特征进行处理的次数达到预定次数,将所述语言模型对所述第一目标特征进行处理所得到的语言概率存入缓存中;
    响应于需要采用所述语言模型对第二目标特征进行处理,根据所述第二目标特征查询所述缓存;以及
    响应于所述缓存中存储有针对所述第二目标特征的语言概率,从所述缓存中读取针对所述第二目标特征的语言概率,完成采用所述语言模型对所述第二目标特征的处理,
    其中,所述第一目标特征和所述第二目标特征包括以下特征中的任意一个特征:所述第一文本片段的文本嵌入特征;所述文本嵌入特征和垂类的标识特征融合后的特征;所述文本嵌入特征和数据的来源的标识特征融合后的特征;所述文本嵌入特征和槽位的标识特征融合后的特征。
  16. 根据权利要求10~13中任一项所述的方法,其中:
    所述根据所述第一语言概率和所述声学概率,确定多个所述备选文本片段中的目标文本片段的操作是由设置于电子设备上的图形处理器执行的。
  17. 根据权利要求1所述的方法,其中,训练所述约束子模型的样本包括正样本和负样本,其中,所述正样本包括所述预定文本集中的文本,所述负样本通过以下方式得到:
    根据多个所述备选文本片段中与所述预定文本中目标位置处的文本片段不一致的第二文本片段,对所述预定文本进行调整,得到所述负样本。
  18. 根据权利要求17所述的方法,其中,所述根据多个所述备选文本片段中与所述预定文本中目标位置处的文本片段不一致的文本片段,对所述预定文本进行调整,得到所述负样本包括:
    根据所述第二文本片段与所述预定文本中所述目标位置处的文本片段之间的混淆关系,确定所述第二文本片段中的待替换片段;以及
    采用所述待替换片段替换所述预定文本中所述目标位置处的文本片段,得到所述负样本。
  19. 根据权利要求17所述的方法,其中,所述根据多个所述备选文本片段中与所述预定文本中目标位置处的文本片段不一致的文本片段,对所述预定文本进行调整, 得到所述负样本包括:
    采用所述第二文本片段替换所述预定文本中所述目标位置处的文本片段,得到备选样本;
    针对所述备选样本中的各样本,采用所述第一语言子模型进行处理,得到所述各样本的第六语言概率;以及
    根据所述第六语言概率对所述备选样本进行筛选,得到所述负样本。
  20. 根据权利要求17所述的方法,其中,所述根据多个所述备选文本片段中与所述预定文本中目标位置处的文本片段不一致的第二文本片段,对所述预定文本进行调整,得到所述负样本包括:
    采用所述第二文本片段替换所述预定文本中所述目标位置处的文本片段,得到初始文本;以及
    去除所述初始文本中所述目标位置之后的文本片段,得到所述负样本。
  21. 一种语音识别装置,包括:
    声学概率获得模块,用于采用声学模型对待识别语音数据和已识别得到的第一文本片段进行处理,得到多个备选文本片段各自的声学概率;
    初始概率获得模块,用于采用语言模型中的第一语言子模型对所述第一文本片段进行处理,得到多个所述备选文本片段各自的初始语言概率;
    扩展关系获得模块,用于采用所述语言模型中的约束子模型对所述第一文本片段进行处理,得到多个所述备选文本片段各自针对所述第一文本片段的可扩展关系;
    概率调整模块,用于根据所述可扩展关系,对所述备选文本片段的初始语言概率进行调整,得到多个所述备选文本片段各自的第一语言概率;以及
    文本确定模块,用于根据所述第一语言概率和所述声学概率,确定多个所述备选文本片段中的目标文本片段,以得到针对所述待识别语音数据的文本序列,
    其中,所述约束子模型是基于预定文本集中的文本训练得到的。
  22. 一种电子设备,包括:
    至少一个处理器;以及
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1~20中任一项所述的方法。
  23. 一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行根据权利要求1~20中任一项所述的方法。
  24. 一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现根据权利要求1至20中任一项所述的方法。
PCT/CN2023/072417 2022-09-01 2023-01-16 语音识别方法、装置、设备和介质 WO2024045475A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020247014438A KR20240067971A (ko) 2022-09-01 2023-01-16 음성 인식 방법, 음성 인식 장치, 전자장비, 저장매체 및 컴퓨터 프로그램

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211064891.8A CN115132209B (zh) 2022-09-01 2022-09-01 语音识别方法、装置、设备和介质
CN202211064891.8 2022-09-01

Publications (1)

Publication Number Publication Date
WO2024045475A1 true WO2024045475A1 (zh) 2024-03-07

Family

ID=83387371

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/072417 WO2024045475A1 (zh) 2022-09-01 2023-01-16 语音识别方法、装置、设备和介质

Country Status (3)

Country Link
KR (1) KR20240067971A (zh)
CN (1) CN115132209B (zh)
WO (1) WO2024045475A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118095209A (zh) * 2024-04-12 2024-05-28 清华大学 针对大语言模型的动态猜测解码方法、装置、设备及介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115132209B (zh) * 2022-09-01 2022-11-08 北京百度网讯科技有限公司 语音识别方法、装置、设备和介质
CN115662397B (zh) * 2022-12-29 2023-04-18 北京百度网讯科技有限公司 语音信号的处理方法、装置、电子设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109313896A (zh) * 2016-06-08 2019-02-05 谷歌有限责任公司 可扩展的动态类语言建模
CN110263158A (zh) * 2019-05-24 2019-09-20 阿里巴巴集团控股有限公司 一种数据的处理方法、装置及设备
CN110291582A (zh) * 2017-02-14 2019-09-27 谷歌有限责任公司 语言模型偏置***
CN112767921A (zh) * 2021-01-07 2021-05-07 国网浙江省电力有限公司 一种基于缓存语言模型的语音识别自适应方法和***
CN114218945A (zh) * 2021-11-22 2022-03-22 深圳价值在线信息科技股份有限公司 实体识别方法、装置、服务器及存储介质
CN115132209A (zh) * 2022-09-01 2022-09-30 北京百度网讯科技有限公司 语音识别方法、装置、设备和介质

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5268990A (en) * 1991-01-31 1993-12-07 Sri International Method for recognizing speech using linguistically-motivated hidden Markov models
US10789539B2 (en) * 2015-12-31 2020-09-29 Nuance Communications, Inc. Probabilistic ranking for natural language understanding
US10056083B2 (en) * 2016-10-18 2018-08-21 Yen4Ken, Inc. Method and system for processing multimedia content to dynamically generate text transcript
CN108492820B (zh) * 2018-03-20 2021-08-10 华南理工大学 基于循环神经网络语言模型和深度神经网络声学模型的中文语音识别方法
EP3979121A1 (en) * 2020-10-01 2022-04-06 Naver Corporation Method and system for controlling distributions of attributes in language models for text generation
CN113129870B (zh) * 2021-03-23 2022-03-25 北京百度网讯科技有限公司 语音识别模型的训练方法、装置、设备和存储介质
CN114187914A (zh) * 2021-12-17 2022-03-15 广东电网有限责任公司 一种语音识别方法及***

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109313896A (zh) * 2016-06-08 2019-02-05 谷歌有限责任公司 可扩展的动态类语言建模
CN110291582A (zh) * 2017-02-14 2019-09-27 谷歌有限责任公司 语言模型偏置***
CN110263158A (zh) * 2019-05-24 2019-09-20 阿里巴巴集团控股有限公司 一种数据的处理方法、装置及设备
CN112767921A (zh) * 2021-01-07 2021-05-07 国网浙江省电力有限公司 一种基于缓存语言模型的语音识别自适应方法和***
CN114218945A (zh) * 2021-11-22 2022-03-22 深圳价值在线信息科技股份有限公司 实体识别方法、装置、服务器及存储介质
CN115132209A (zh) * 2022-09-01 2022-09-30 北京百度网讯科技有限公司 语音识别方法、装置、设备和介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118095209A (zh) * 2024-04-12 2024-05-28 清华大学 针对大语言模型的动态猜测解码方法、装置、设备及介质

Also Published As

Publication number Publication date
CN115132209A (zh) 2022-09-30
CN115132209B (zh) 2022-11-08
KR20240067971A (ko) 2024-05-17

Similar Documents

Publication Publication Date Title
EP3648099B1 (en) Voice recognition method, device, apparatus, and storage medium
CN107301170B (zh) 基于人工智能的切分语句的方法和装置
WO2024045475A1 (zh) 语音识别方法、装置、设备和介质
JP7301922B2 (ja) 意味検索方法、装置、電子機器、記憶媒体およびコンピュータプログラム
CN113470619B (zh) 语音识别方法、装置、介质及设备
WO2022121251A1 (zh) 文本处理模型训练方法、装置、计算机设备和存储介质
CN112395385B (zh) 基于人工智能的文本生成方法、装置、计算机设备及介质
JP4930379B2 (ja) 類似文検索方法、類似文検索システム及び類似文検索用プログラム
WO2023024975A1 (zh) 文本处理方法、装置和电子设备
CN112784581B (zh) 文本纠错方法、装置、介质及电子设备
CN112017643B (zh) 语音识别模型训练方法、语音识别方法及相关装置
CN110263218B (zh) 视频描述文本生成方法、装置、设备和介质
CN114154487A (zh) 文本自动纠错方法、装置、电子设备及存储介质
JP2022120024A (ja) オーディオ信号処理方法、モデルトレーニング方法、並びにそれらの装置、電子機器、記憶媒体及びコンピュータプログラム
US11996084B2 (en) Speech synthesis method and apparatus, device and computer storage medium
CN112100339A (zh) 用于智能语音机器人的用户意图识别方法、装置和电子设备
US20210233520A1 (en) Contextual multi-channel speech to text
CN111046217A (zh) 组合歌曲生成方法、装置、设备以及存储介质
CN113343692A (zh) 搜索意图的识别方法、模型训练方法、装置、介质及设备
CN112925912A (zh) 文本处理方法、同义文本召回方法及装置
WO2023207690A1 (zh) 一种文本生成方法、装置、电子设备及介质
CN114880520B (zh) 视频标题生成方法、装置、电子设备和介质
CN113191140B (zh) 文本处理方法、装置、电子设备及存储介质
CN115309994A (zh) 地点检索方法、电子设备以及存储介质
CN112925889A (zh) 自然语言处理方法、装置、电子设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23858518

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 20247014438

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2023858518

Country of ref document: EP

Effective date: 20240426