WO2023273578A1 - Speech recognition method and apparatus, and medium and device - Google Patents

Speech recognition method and apparatus, and medium and device Download PDF

Info

Publication number
WO2023273578A1
WO2023273578A1 PCT/CN2022/089595 CN2022089595W WO2023273578A1 WO 2023273578 A1 WO2023273578 A1 WO 2023273578A1 CN 2022089595 W CN2022089595 W CN 2022089595W WO 2023273578 A1 WO2023273578 A1 WO 2023273578A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
training
feature vector
words
context
Prior art date
Application number
PCT/CN2022/089595
Other languages
French (fr)
Chinese (zh)
Inventor
董林昊
韩明伦
马泽君
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2023273578A1 publication Critical patent/WO2023273578A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training

Definitions

  • the present disclosure relates to the field of computer technology, and in particular, to a speech recognition method, device, medium and equipment.
  • the present disclosure provides a speech recognition method, the method comprising:
  • the hot word information and the speech recognition model, the target text corresponding to the speech data to be recognized is obtained; wherein the hot word information includes a plurality of hot words corresponding to the text sequence and phonetic sequence; the The speech recognition model includes a speech recognition sub-model and a context recognition sub-model, and the context recognition sub-model is trained based on training words, phonetic symbols sequences, text sequences and training labels of the training words.
  • the context recognition sub-model includes a pronunciation feature encoder, a text feature encoder, an attention module and a context feature decoder;
  • the speech recognition sub-model and the speech data to be recognized obtain the character acoustic vector and text probability distribution of each predicted character corresponding to the speech data to be recognized;
  • a target text corresponding to the data to be recognized is determined according to the text probability distribution and the context probability distribution.
  • the obtaining the context feature vector of each predicted character according to the attention module, the pronunciation feature vector, the text feature vector and the character acoustic vector includes:
  • the attention module For each of the predicted characters, in the attention module, determine the context corresponding to the predicted character according to the character acoustic vector of the predicted character, the fusion feature vector and the text feature vector corresponding to each of the hot words Feature vector.
  • the determining the contextual feature vector corresponding to the predicted character according to the character acoustic vector of the predicted character, the fusion feature vector and the text feature vector corresponding to each of the hot words includes:
  • the dot product of the character acoustic vector and the fusion feature vector corresponding to each of the hot words is determined as the initial weight corresponding to the hot word;
  • the text feature vectors of the hot words are weighted and calculated according to the target weight corresponding to each of the hot words to obtain the context feature vector.
  • the determining the contextual feature vector corresponding to the predicted character according to the character acoustic vector of the predicted character, the fusion feature vector and the text feature vector corresponding to each of the hot words further includes:
  • the text feature vector of the hot word is weighted and calculated according to the target weight corresponding to each of the hot words, and the context feature vector is obtained, including:
  • the text feature vector of the hot word is weighted and calculated according to the updated target weight corresponding to each of the hot words to obtain the context feature vector.
  • the obtaining the context probability distribution of each predicted character according to the context feature decoder and the context feature vector includes:
  • the target feature vector is decoded according to the context feature decoder to obtain a context probability distribution corresponding to each predicted character.
  • determine the phonetic sequence, text sequence and training label of the training word in the following manner:
  • For each of the training words determine the phonetic sequence of the training words according to the language of the training words, and determine the text sequence from the training marked text;
  • texts other than the training words in the training tagged text corresponding to the training words are replaced with preset labels, so as to generate training labels corresponding to the training words.
  • the present disclosure provides a speech recognition device, the device comprising:
  • a receiving module configured to receive voice data to be recognized
  • a processing module configured to obtain target text corresponding to the speech data to be recognized according to the speech data to be recognized, hot word information and a speech recognition model; wherein, the hot word information includes a plurality of text sequences corresponding to hot words and phonetic symbol sequence; the speech recognition model includes a speech recognition sub-model and a context recognition sub-model, and the context recognition sub-model is trained based on training words and phonetic symbol sequences, text sequences and training labels of the training words.
  • the present disclosure provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of any one of the methods described in the first aspect are implemented.
  • an electronic device including:
  • a processing device configured to execute the computer program in the storage device to implement the steps of any one of the methods in the first aspect.
  • FIG. 1 is a flow chart of a speech recognition method provided according to an embodiment of the present disclosure
  • FIG. 2 is a schematic structural diagram of a speech recognition model provided according to an embodiment of the present disclosure
  • Fig. 4 is a block diagram of a speech recognition device provided according to an embodiment of the present disclosure.
  • the term “comprise” and its variations are open-ended, ie “including but not limited to”.
  • the term “based on” is “based at least in part on”.
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one further embodiment”; the term “some embodiments” means “at least some embodiments.” Relevant definitions of other terms will be given in the description below.
  • step 11 voice data to be recognized is received.
  • step 12 according to the speech data to be recognized, the hot word information and the speech recognition model, the target text corresponding to the speech data to be recognized is obtained.
  • the hot word information includes text sequences and phonetic symbol sequences corresponding to multiple hot words.
  • the hot word information may be a hot word corresponding to a specific application context, so as to provide prior context knowledge for the recognition process of the speech data to be recognized.
  • the phonetic symbol sequence is used to indicate the pronunciation of the hot word. If the hot word is in Chinese, its corresponding phonetic symbol sequence contains the tonal pinyin; when the hot word is in English, its corresponding phonetic symbol sequence includes British phonetic symbols or American phonetic symbols, etc. .
  • the text sequence of the hot word may be the hot word text itself.
  • the context recognition sub-model is trained based on training words and phonetic symbol sequences, text sequences and training labels of the training words. Therefore, in the training process of the context recognition sub-model, the pronunciation features and text features corresponding to the training words can be combined for training at the same time, so that the training features of the context recognition sub-model are more comprehensive and rich, and improve the accuracy of subsequent hot word determination. And comprehensive data support.
  • the hot word information contains the hot words "Zhang Zhiguo" and "Zhang Zhiguo"
  • the real text corresponding to the voice data to be recognized should be “Zhang Zhiguo said that she is getting married recently”. If there are hot words with similar pronunciation in , then when speech recognition is performed, the recognition text "Zhang Zhiguo said that she is going to get married recently" will appear, that is, confusing recognition of hot words will appear.
  • the phonetic sequence corresponding to "Zhang Zhiguo” is “zhang1zhi4guo2", and the phonetic sequence corresponding to "Zhang Zhiguo" If the phonetic sequence is "zhang1zhi1guo3", hot words with similar pronunciation can be distinguished, and the recognition result "Zhang Zhiguo said that she is getting married recently" can be obtained to improve the accuracy of speech recognition.
  • the speech recognition model used to recognize the speech data to be recognized may include a speech recognition sub-model and a context recognition sub-model, then in the process of speech recognition, speech recognition may be performed based on the speech recognition sub-model.
  • the context recognition sub-model can be combined to improve the accuracy of hot word recognition in the speech data to be recognized, thereby improving the accuracy of speech recognition.
  • the context recognition sub-model is trained, it is trained in combination with the pronunciation features and text features of the training data. Based on the pronunciation features, hot words with similar spelling or pronunciation can be accurately distinguished. Therefore, when identifying hot words , can combine the multiple features to perform accurate recognition from multiple hot words, avoid confusing recognition of hot words with similar spelling or pronunciation, further improve the accuracy of speech recognition, and improve user experience.
  • the speech recognition model 10 may include a speech recognition submodel 100 and a context recognition submodel 200, and the context recognition submodel 200 includes a pronunciation feature encoder 201 , a textual feature encoder 202 , an attention module 203 and a contextual feature decoder 204 .
  • the exemplary implementation of obtaining the target text corresponding to the speech data to be recognized is as follows, as shown in Figure 3, this step may include:
  • step 31 the phonetic symbol sequence of the hot word is encoded according to the pronunciation feature encoder to obtain the pronunciation feature vector of the hot word, and the text sequence of the hot word is encoded according to the text feature encoder to obtain the text feature vector of the hot word.
  • step 32 according to the speech recognition sub-model and the speech data to be recognized, the character acoustic vector and text probability distribution of each predicted character corresponding to the speech data to be recognized are obtained.
  • the speech recognition sub-model may further include an encoder 101, a prediction sub-model 102 and a decoder 103, wherein the prediction sub-model may be CIF (Continuous Integrate-and-Fire, continuous integration and delivery) Model.
  • CIF Continuous Integrate-and-Fire, continuous integration and delivery
  • the voice data per second can be divided into multiple audio frames, so as to perform data processing based on the audio frames.
  • the voice data per second can be divided into 100 audio frames for processing.
  • the obtained acoustic vector sequence H can be expressed as:
  • H ⁇ H1,H2,...,HU ⁇
  • U is used to represent the number of audio frames in the input speech data to be recognized, that is, the length of the acoustic vector sequence.
  • the acoustic vector can be input into the predicting sub-model, and the predicting sub-model can predict the amount of information on the acoustic vector to obtain the amount of information corresponding to the audio frame.
  • the acoustic vectors of the audio frames may be combined according to the amount of information of multiple audio frames to obtain character acoustic vectors.
  • the amount of information corresponding to each predicted character is the same, so the amount of information corresponding to the audio frame can be accumulated from left to right.
  • the preset threshold it is considered that the amount of information at this time
  • the audio frame corresponding to the accumulated information amount is formed into a prediction character, and a prediction character corresponds to one or more audio frames.
  • the preset threshold may be set according to actual application scenarios and experience, for example, the preset threshold may be set to 1, which is not limited in the present disclosure.
  • the acoustic vectors of the audio frames may be combined according to the amount of information of multiple audio frames in the following manner:
  • the information volume of the second audio frame can be divided into two parts, that is, a part of the information volume belongs to the current predicted character, and the remaining part of the information volume belongs to the next predicted character.
  • the amount W3 is accumulated until it reaches the preset threshold ⁇ , and the audio frame corresponding to the next predicted character is obtained.
  • the amount of information of the subsequent audio frames can be deduced by analogy, and combined in the above manner to obtain each predicted character corresponding to the plurality of audio frames.
  • the weighted sum of the acoustic vectors of each audio frame corresponding to the predicted character can be determined as the corresponding Character acoustic vector.
  • the weight of the acoustic vector of each audio frame corresponding to the predicted character is the corresponding information amount of the audio frame in the predicted character. If the audio frame all belongs to the predicted character, the weight of the acoustic vector of the audio frame is the information amount of the audio frame, and if the audio frame part belongs to the predicted character, the weight of the acoustic vector of the audio frame is the audio frame The amount of information in this section.
  • the character acoustic vector C1 corresponding to the predicted character can be expressed as:
  • the character acoustic vector C2 corresponding to the predicted character can be expressed as:
  • the character acoustic vector of each predicted character can be decoded based on the decoder, so as to obtain the text probability distribution of the predicted character.
  • step 33 according to the attention module, the pronunciation feature vector, the text feature vector and the character acoustic vector, the context feature vector of each predicted character is obtained.
  • this step when determining the contextual feature vector of the predicted character, it is possible to comprehensively consider multiple features of each hot word by combining the corresponding pronunciation features and text features of each hot word, so as to improve language The richness and accuracy of features in the environment feature vector.
  • the pronunciation feature vector, text feature vector and character acoustic vector are combined in the attention module to ensure the matching between each hot word and the voice data when performing attention calculation. Its specific calculation method is described below.
  • step 34 the context probability distribution of each predicted character is obtained according to the context feature decoder and the context feature vector.
  • the context feature vector of each predicted character can be decoded based on the context feature decoder, so that the context probability distribution of the predicted character can be obtained.
  • step 35 the target text corresponding to the data to be recognized is determined according to the text probability distribution and the context probability distribution.
  • each predicted character its text probability distribution and context probability distribution may be weighted and summed, so as to obtain a comprehensive probability distribution corresponding to the predicted character. Then, based on the comprehensive probability distribution, the recognition character corresponding to each predicted character can be determined through a Greedy Search algorithm or a Beam Search algorithm, so as to obtain the target text.
  • the above-mentioned search algorithm is a common method in the art, and will not be repeated here.
  • the pronunciation feature vector, the text feature vector and the character acoustic vector an example of the context feature vector of each predicted character is obtained.
  • This step may include:
  • For each hot word according to the pronunciation feature vector and the text feature vector of the hot word, determine the fusion feature vector corresponding to the hot word.
  • the fusion feature vector can be obtained by concatenating the pronunciation feature vector and the text feature vector.
  • the attention module For each of the predicted characters, in the attention module, determine the context corresponding to the predicted character according to the character acoustic vector of the predicted character, the fusion feature vector and the text feature vector corresponding to each of the hot words Feature vector.
  • the attention module can use the character acoustic vector and each fused feature vector to determine the degree of attention of the current predicted character to each hot word, so as to provide data support for subsequent recognition and judgment of hot words.
  • the exemplary method of determining the contextual feature vector corresponding to the predicted character according to the character acoustic vector of the predicted character, the fusion feature vector and the text feature vector corresponding to each of the hot words The implementation is as follows, and this step may include:
  • the dot product of the character acoustic vector and the fused feature vector corresponding to each hot word is determined as the initial weight corresponding to the hot word.
  • the dot product attention of Ci and the fusion feature vector can be calculated based on multi-head attention, and then the average of the obtained multiple dot product attention will be calculated
  • the value is used as the comprehensive weight corresponding to the feature fusion vector, that is, the initial weight of the hot word corresponding to the feature fusion vector.
  • the initial weight can be used to represent the degree of attention to each hot word in the character acoustic vector of the predicted character.
  • the initial weight corresponding to each of the hot words is normalized to obtain the target weight corresponding to each of the hot words.
  • the initial weights Q1-Qn corresponding to each hot word can be normalized, for example, softmax calculations can be performed on the Q1-Qn, and each The weights of hot words are mapped to a unified standard for measurement, which is convenient for comparing the target weights corresponding to each hot word, so that the hot word that is more likely to correspond to the predicted character can be determined.
  • the text feature vectors of the hot words are weighted and calculated according to the target weight corresponding to each of the hot words to obtain the context feature vector.
  • the product of the target weight corresponding to each hot word and its corresponding text feature vector can be accumulated to obtain the context feature vector, then the text feature vector corresponding to a higher target weight will be included in the context feature vector Corresponding to a more explicit feature representation in the environment feature vector.
  • the exemplary implementation is as follows, on the basis of the previous embodiment, it also includes:
  • the target weights after sorting M according to the target weights from large to small are updated to zero, and M is a positive integer.
  • M can be set according to actual usage scenarios, for example, M can be set to 20.
  • the target weight is used to represent the attention degree of the current predicted character to each hot word, then when the target weight corresponding to the hot word is smaller and ranked later, it means that the possibility that the predicted character corresponds to the hot word is low, so You can directly set the target weight of the hot word to 0, and focus on hot words with higher probability when judging hot words.
  • the target weights are sorted in descending order, and the first M target weights are retained, and the target weights after the sorting M are set to zero.
  • the text feature vectors of the hot words are weighted and calculated, and an exemplary implementation manner of obtaining the context feature vector may include:
  • the text feature vector of the hot word is weighted and calculated according to the updated target weight corresponding to each of the hot words to obtain the context feature vector.
  • hot words with less possibility can be directly excluded and recognized based on the target weight, the amount of calculation data can be reduced to a certain extent, and the recognition efficiency of hot words enhanced recognition can be improved while ensuring hot words Enhance the accuracy of recognition.
  • an example implementation manner of obtaining the context probability distribution of each predicted character is as follows, and this step may include:
  • a target feature vector of the predicted character is obtained according to the acoustic character vector of the predicted character and the contextual feature vector.
  • the acoustic character vector of the predicted character and the context feature vector may be concatenated to obtain the target feature vector.
  • the target feature vector is decoded according to the context feature decoder to obtain a context probability distribution corresponding to each predicted character.
  • the contextual feature decoder when the contextual feature decoder performs decoding, it can be decoded based on the target feature vector containing the audio features of the input voice data and the relevant features of each hot word, and the context probability distribution and the target feature vector to be improved are improved. Identify the matching degree between speech data and hot words, provide accurate and comprehensive data support for the subsequent determination of target text, further improve the diversity of features in the speech recognition process, and thus improve the accuracy of speech recognition results.
  • the phonetic sequence, text sequence and training label of the training word are determined in the following manner:
  • an N-gram word may be randomly extracted from the text, and this word may be used as a candidate word. Part of the words can then be randomly selected from the candidate words as the training words.
  • candidate texts may be determined from the training labeled text of the training sample, and then for each candidate text, an N-gram word may be randomly extracted from the candidate text as a training word. This ensures the diversity and randomness of the training words.
  • the phonetic sequence of the training word is determined according to the language of the training word, and the text sequence is determined from the training marked text.
  • the training word itself extracted from the training marked text can be directly used as the text sequence, and the phonetic symbol sequence types corresponding to different languages can be preset, such as setting the phonetic symbol sequence corresponding to Chinese as a pinyin sequence, and setting the phonetic symbol sequence corresponding to English It is a sequence of British phonetic symbols.
  • the corresponding phonetic sequence can be determined by querying based on an electronic dictionary.
  • a Chinese dictionary can be queried for each character in the word to obtain the tonal pinyin corresponding to each character in the word, and then Concatenate the toned pinyin of each character to obtain the phonetic sequence of the word; you can also directly query based on the word, so as to directly obtain the phonetic sequence of the word, such as for the training word "convex optimization theory", which corresponds to the phonetic sequence Expressed as "tu1you1hua4li3lun4".
  • texts other than the training words in the training tagged text corresponding to the training words are replaced with preset labels, so as to generate training labels corresponding to the training words.
  • the training words extracted from it are "convex optimization theory"
  • the training words corresponding to In the training label text the text other than the training words is replaced with the preset label, that is, the "is an important course” in the training label text "convex optimization theory is an important course” is replaced to obtain the training label "Convex Optimization Theory *****”.
  • the preset label has no actual meaning, and it is used to represent that the corresponding text here has no corresponding prior contextual knowledge.
  • the training samples can be automatically processed to obtain training data for training the context recognition sub-model, and when the training data is obtained, the text features and pronunciation corresponding to the training words can be extracted at the same time features, so that each word can be identified more accurately, so as to distinguish words with similar spelling or pronunciation, so that it can provide more comprehensive and reliable feature information for the training of the context recognition sub-model, and improve the training to a certain extent. Accuracy of the resulting context recognition submodel.
  • the speech recognition model can be trained in the following manner:
  • the phonetic sequence, text sequence and training labels of the training words in the training samples are determined.
  • the speech recognition sub-model and the context recognition sub-model may be trained separately.
  • the speech recognition sub-model can be trained by using the training speech data as input and the corresponding training marked text as the target output.
  • the training may be performed based on commonly used training methods in this field, which will not be repeated here.
  • the context recognition sub-model can be further trained based on the trained speech recognition sub-model.
  • the context recognition sub-model is used to determine the prior context knowledge corresponding to the hot words, therefore, in the embodiment of the present disclosure, the corresponding training words can be obtained directly from the training samples, so that based on the prior knowledge of the training words Contextual knowledge is used for training to ensure the diversity of training words, which can improve the stability and generalization of the model.
  • the phonetic sequence of the training word is used to represent the pronunciation of the training word. If the training sample is Chinese, its corresponding phonetic sequence contains tonal pinyin; when the training sample is English, its corresponding phonetic sequence contains British Phonetic symbols or American phonetic symbols, etc.
  • the text sequence of the training word may be the recognition text corresponding to the training word.
  • the training label of the training word is used to represent the target output corresponding to the training word.
  • the pronunciation feature vector, the text feature vector and the character acoustic vector, the context feature vector of each predicted character is obtained;
  • the target loss of the context recognition sub-model can be determined according to the output text and the training labels corresponding to the training words.
  • the cross-entropy loss can be calculated according to the output text and the training label, and the cross-entropy loss can be determined as the target loss.
  • the update condition may be that the target loss is greater than a preset loss threshold, which means that the recognition accuracy of the context recognition sub-model is insufficient.
  • the update condition may be that the number of iterations is less than a preset number threshold, and at this time, it is considered that the number of iterations of the context recognition sub-model is relatively small, and its recognition accuracy is insufficient.
  • the model parameters of the context recognition sub-model can be updated according to the target loss.
  • the method of updating the model parameters based on the determined loss may adopt a commonly used updating method in the field, such as the gradient descent method, which will not be repeated here.
  • the training process can be stopped to obtain the trained context recognition sub-model, and then obtain the trained speech Identify the model.
  • the model parameters of the trained speech recognition sub-model are kept unchanged.
  • the context recognition sub-model can be added on the basis of the trained speech recognition sub-model to realize the speech recognition model, improve the scalability and application range of the training method, and simultaneously On the basis of ensuring the accuracy of the speech recognition sub-model, more accurate prior context knowledge is provided to improve the recognition accuracy of the speech recognition model.
  • the present disclosure also provides a speech recognition device, as shown in FIG. 4 , the device 40 includes:
  • the receiving module 41 is used to receive voice data to be recognized
  • the processing module 42 is used to obtain the target text corresponding to the speech data to be recognized according to the speech data to be recognized, the hot word information and the speech recognition model; wherein, the hot word information includes a text sequence corresponding to a plurality of hot words and phonetic symbol sequence; the speech recognition model includes a speech recognition sub-model and a context recognition sub-model, and the context recognition sub-model is trained based on training words and phonetic symbol sequences, text sequences and training labels of the training words.
  • the context recognition sub-model includes a pronunciation feature encoder, a text feature encoder, an attention module and a context feature decoder;
  • the processing modules include:
  • the first processing submodule is used to obtain the character acoustic vector and text probability distribution of each predicted character corresponding to the speech data to be recognized according to the speech recognition submodel and the speech data to be recognized;
  • the first determining submodule is configured to determine the target text corresponding to the data to be recognized according to the text probability distribution and the context probability distribution.
  • the second processing submodule includes:
  • the fourth determining submodule is used to determine the dot product of the character acoustic vector and the fused feature vector corresponding to each of the hot words as the initial weight corresponding to the hot word;
  • the third processing submodule is used to normalize the initial weight corresponding to each of the hot words, and obtain the target weight corresponding to each of the hot words;
  • the calculation sub-module is used for:
  • the text feature vector of the hot word is weighted and calculated according to the updated target weight corresponding to each of the hot words to obtain the context feature vector.
  • the first decoding submodule includes:
  • the fourth processing submodule is used to obtain the target feature vector of the predicted character according to the acoustic character vector of the predicted character and the context feature vector for each predicted character;
  • the second decoding submodule is configured to decode the target feature vector according to the context feature decoder to obtain a context probability distribution corresponding to each of the predicted characters.
  • determine the phonetic sequence, text sequence and training label of the training word in the following manner:
  • For each of the training words determine the phonetic sequence of the training words according to the language of the training words, and determine the text sequence from the training marked text;
  • texts other than the training words in the training tagged text corresponding to the training words are replaced with preset labels, so as to generate training labels corresponding to the training words.
  • FIG. 5 it shows a schematic structural diagram of an electronic device 600 suitable for implementing the embodiments of the present disclosure.
  • the terminal equipment in the embodiment of the present disclosure may include but not limited to such as mobile phone, notebook computer, digital broadcast receiver, PDA (personal digital assistant), PAD (tablet computer), PMP (portable multimedia player), vehicle terminal (such as mobile terminals such as car navigation terminals) and fixed terminals such as digital TVs, desktop computers and the like.
  • the electronic device shown in FIG. 5 is only an example, and should not limit the functions and scope of use of the embodiments of the present disclosure.
  • an electronic device 600 may include a processing device (such as a central processing unit, a graphics processing unit, etc.) 601, which may be randomly accessed according to a program stored in a read-only memory (ROM) 602 or loaded from a storage device 608. Various appropriate actions and processes are executed by programs in the memory (RAM) 603 . In the RAM 603, various programs and data necessary for the operation of the electronic device 600 are also stored.
  • the processing device 601, ROM 602, and RAM 603 are connected to each other through a bus 604.
  • An input/output (I/O) interface 605 is also connected to the bus 604 .
  • the following devices can be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speaker, vibration an output device 607 such as a computer; a storage device 608 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 609.
  • the communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While FIG. 5 shows electronic device 600 having various means, it should be understood that implementing or having all of the means shown is not a requirement. More or fewer means may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product, which includes a computer program carried on a non-transitory computer readable medium, where the computer program includes program code for executing the method shown in the flowchart.
  • the computer program may be downloaded and installed from a network via communication means 609, or from storage means 608, or from ROM 602.
  • the processing device 601 When the computer program is executed by the processing device 601, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are performed.
  • the above-mentioned computer-readable medium in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two.
  • a computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can transmit, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device .
  • Program code embodied on a computer readable medium may be transmitted by any appropriate medium, including but not limited to wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.
  • the client and the server can communicate using any currently known or future network protocols such as HTTP (HyperText Transfer Protocol, Hypertext Transfer Protocol), and can communicate with digital data in any form or medium
  • HTTP HyperText Transfer Protocol
  • the communication eg, communication network
  • Examples of communication networks include local area networks (“LANs”), wide area networks (“WANs”), internetworks (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network of.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device, or may exist independently without being incorporated into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs.
  • the electronic device receives the speech data to be recognized; A speech recognition model, which obtains the target text corresponding to the speech data to be recognized; wherein, the hot word information includes text sequences and phonetic symbol sequences corresponding to a plurality of hot words; the speech recognition model includes a speech recognition sub-model and context recognition A sub-model, the context recognition sub-model is trained based on the training words and the phonetic symbol sequences, text sequences and training labels of the training words.
  • Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, or combinations thereof, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and Includes conventional procedural programming languages - such as "C" or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (for example, using an Internet service provider to connected via the Internet).
  • LAN local area network
  • WAN wide area network
  • Internet service provider for example, using an Internet service provider to connected via the Internet.
  • each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.
  • the modules involved in the embodiments described in the present disclosure may be implemented by software or by hardware. Wherein, the name of the module does not constitute a limitation on the module itself under certain circumstances, for example, the receiving module can also be described as "a module that receives speech data to be recognized".
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSPs Application Specific Standard Products
  • SOCs System on Chips
  • CPLD Complex Programmable Logical device
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • Example 1 provides a speech recognition method, the method comprising:
  • the hot word information and the speech recognition model, the target text corresponding to the speech data to be recognized is obtained; wherein the hot word information includes a plurality of hot words corresponding to the text sequence and phonetic sequence; the The speech recognition model includes a speech recognition sub-model and a context recognition sub-model, and the context recognition sub-model is trained based on training words, phonetic symbols sequences, text sequences and training labels of the training words.
  • Example 2 provides the method of Example 1, the context recognition sub-model includes a pronunciation feature encoder, a text feature encoder, an attention module and a context feature decoder;
  • the speech recognition sub-model and the speech data to be recognized obtain the character acoustic vector and text probability distribution of each predicted character corresponding to the speech data to be recognized;
  • a target text corresponding to the data to be recognized is determined according to the text probability distribution and the context probability distribution.
  • Example 4 provides the method of Example 3, wherein the character acoustic vectors of the predicted characters, the fusion feature vectors and text feature vectors corresponding to each of the hot words are used to determine the Predict the contextual feature vector corresponding to the character, including:
  • the dot product of the character acoustic vector and the fusion feature vector corresponding to each of the hot words is determined as the initial weight corresponding to the hot word;
  • the text feature vectors of the hot words are weighted and calculated according to the target weight corresponding to each of the hot words to obtain the context feature vector.
  • Example 5 provides the method of Example 4, the determination of the The contextual feature vector corresponding to the predicted character also includes:
  • the text feature vector of the hot word is weighted and calculated according to the target weight corresponding to each of the hot words, and the context feature vector is obtained, including:
  • the text feature vectors of the hot words are weighted and calculated according to the updated target weights corresponding to each of the hot words to obtain the context feature vector.
  • Example 6 provides the method of Example 2, the context probability distribution of each of the predicted characters is obtained according to the context feature decoder and the context feature vector ,include:
  • the target feature vector is decoded according to the context feature decoder to obtain a context probability distribution corresponding to each predicted character.
  • Example 7 provides the method of any one of Examples 1-6, which determines the phonetic sequence, text sequence and training label of the training word in the following manner:
  • texts other than the training words in the training labeled text corresponding to the training words are replaced with preset labels, so as to generate training labels corresponding to the training words.
  • Example 8 provides a speech recognition device, the device comprising:
  • a receiving module configured to receive voice data to be recognized
  • a processing module configured to obtain target text corresponding to the speech data to be recognized according to the speech data to be recognized, hot word information and a speech recognition model; wherein, the hot word information includes a plurality of text sequences corresponding to hot words and phonetic symbol sequence; the speech recognition model includes a speech recognition sub-model and a context recognition sub-model, and the context recognition sub-model is trained based on training words and phonetic symbol sequences, text sequences and training labels of the training words.
  • Example 9 provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of any one of the methods described in Examples 1-7 are implemented .
  • Example 10 provides an electronic device, comprising:
  • a processing device configured to execute the computer program in the storage device, so as to implement the steps of the method in any one of examples 1-7.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

A speech recognition method and apparatus, and a computer-readable medium and a device. The method comprises: receiving speech data to be recognized (11); and according to said speech data, hot word information and a speech recognition model, obtaining target text corresponding to said speech data (12), wherein the hot word information includes text sequences and phonetic symbol sequences corresponding to a plurality of hot words, the speech recognition model comprises a speech recognition sub-model and a context recognition sub-model, and the context recognition sub-model is trained on the basis of a training word and a phonetic symbol sequence, a text sequence and a training label of the training word. Therefore, when a context recognition sub-model is trained, training is performed in view of a pronunciation feature and a text feature of training data, such that hot words which are similar in spelling or pronunciation can be accurately distinguished on the basis of the pronunciation feature, thereby avoiding confused recognition of the hot words when the hot words are recognized, and further improving the accuracy of speech recognition.

Description

语音识别方法、装置、介质及设备Speech recognition method, device, medium and equipment
相关申请的交叉引用Cross References to Related Applications
本申请要求于2021年06月30日提交的,申请号为202110735672.7、发明名称为“语音识别方法、装置、介质及设备”的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202110735672.7 and the title of the invention "Speech Recognition Method, Device, Medium and Equipment" filed on June 30, 2021, the entire content of which is incorporated herein by reference Applying.
技术领域technical field
本公开涉及计算机技术领域,具体地,涉及一种语音识别方法、装置、介质及设备。The present disclosure relates to the field of computer technology, and in particular, to a speech recognition method, device, medium and equipment.
背景技术Background technique
随着深度学习的兴起,各种完全依赖于神经网络进行端到端建模的方法逐渐兴起,逐步发展为自动语音识别(ASR)技术中的主流。通过自动语音识别,可直接将原始的语音数据转换为对应的文本结果。相关技术中通常采用基于热词的先验语境知识进行语音识别的方式提高语音识别的准确性。然而相关技术中采用热词先验语境知识时,容易出现对拼写或发音相近的热词的混淆识别,从而导致语音识别的准确度不足。With the rise of deep learning, various end-to-end modeling methods that rely entirely on neural networks have gradually emerged, and gradually developed into the mainstream of automatic speech recognition (ASR) technology. Through automatic speech recognition, the original speech data can be directly converted into corresponding text results. In related technologies, speech recognition based on prior contextual knowledge of hot words is usually used to improve the accuracy of speech recognition. However, when prior contextual knowledge of hot words is used in related technologies, it is easy to confuse and recognize hot words with similar spelling or pronunciation, resulting in insufficient accuracy of speech recognition.
发明内容Contents of the invention
提供该发明内容部分以便以简要的形式介绍构思,这些构思将在后面的具体实施方式部分被详细描述。该发明内容部分并不旨在标识要求保护的技术方案的关键特征或必要特征,也不旨在用于限制所要求的保护的技术方案的范围。This Summary is provided to introduce a simplified form of concepts that are described in detail later in the Detailed Description. This summary of the invention is not intended to identify key features or essential features of the claimed technical solution, nor is it intended to be used to limit the scope of the claimed technical solution.
第一方面,本公开提供一种语音识别方法,所述方法包括:In a first aspect, the present disclosure provides a speech recognition method, the method comprising:
接收待识别语音数据;Receive voice data to be recognized;
根据所述待识别语音数据、热词信息和语音识别模型,获得所述待识别语音数据对应的目标文本;其中,所述热词信息包含多个热词对应的文本序列和音标序列;所述语音识别模型包括语音识别子模型和语境识别子模型,所述语境识别子模型是基于训练词语以及所述训练词语的音标序列、文本序列以及训练标签进行训练的。According to the speech data to be recognized, the hot word information and the speech recognition model, the target text corresponding to the speech data to be recognized is obtained; wherein the hot word information includes a plurality of hot words corresponding to the text sequence and phonetic sequence; the The speech recognition model includes a speech recognition sub-model and a context recognition sub-model, and the context recognition sub-model is trained based on training words, phonetic symbols sequences, text sequences and training labels of the training words.
可选地,所述语境识别子模型包括发音特征编码器、文本特征编码器、注意力模块和语境特征解码器;Optionally, the context recognition sub-model includes a pronunciation feature encoder, a text feature encoder, an attention module and a context feature decoder;
所述根据所述待识别语音数据、热词信息和语音识别模型,获得所述待识别语音数据对应的目标文本,包括:The obtaining the target text corresponding to the speech data to be recognized according to the speech data to be recognized, the hot word information and the speech recognition model includes:
根据所述发音特征编码器对所述热词的音标序列进行编码,获得所述热词的发音特征 向量,并根据所述文本特征编码器对所述热词的文本序列进行编码,获得所述热词的文本特征向量;Encode the phonetic sequence of the hot word according to the pronunciation feature encoder to obtain the pronunciation feature vector of the hot word, and encode the text sequence of the hot word according to the text feature encoder to obtain the The text feature vector of hot words;
根据所述语音识别子模型和所述待识别语音数据,获得所述待识别语音数据对应的每一预测字符的字符声学向量和文本概率分布;According to the speech recognition sub-model and the speech data to be recognized, obtain the character acoustic vector and text probability distribution of each predicted character corresponding to the speech data to be recognized;
根据所述注意力模块、所述发音特征向量、所述文本特征向量以及所述字符声学向量,获得每一所述预测字符的语境特征向量;Obtaining a context feature vector of each predicted character according to the attention module, the pronunciation feature vector, the text feature vector, and the character acoustic vector;
根据所述语境特征解码器和所述语境特征向量,获得每一所述预测字符的语境概率分布;Obtaining a contextual probability distribution of each of the predicted characters according to the contextual feature decoder and the contextual feature vector;
根据所述文本概率分布和所述语境概率分布,确定所述待识别数据对应的目标文本。A target text corresponding to the data to be recognized is determined according to the text probability distribution and the context probability distribution.
可选地,所述根据所述注意力模块、所述发音特征向量、所述文本特征向量以及所述字符声学向量,获得每一所述预测字符的语境特征向量,包括:Optionally, the obtaining the context feature vector of each predicted character according to the attention module, the pronunciation feature vector, the text feature vector and the character acoustic vector includes:
针对每一所述热词,根据该热词的所述发音特征向量和所述文本特征向量,确定该热词对应的融合特征向量;For each of the hot words, according to the pronunciation feature vector and the text feature vector of the hot word, determine the fusion feature vector corresponding to the hot word;
针对每一所述预测字符,在所述注意力模块中,根据所述预测字符的字符声学向量、各个所述热词对应的融合特征向量和文本特征向量,确定所述预测字符对应的语境特征向量。For each of the predicted characters, in the attention module, determine the context corresponding to the predicted character according to the character acoustic vector of the predicted character, the fusion feature vector and the text feature vector corresponding to each of the hot words Feature vector.
可选地,所述根据所述预测字符的字符声学向量、各个所述热词对应的融合特征向量和文本特征向量,确定所述预测字符对应的语境特征向量,包括:Optionally, the determining the contextual feature vector corresponding to the predicted character according to the character acoustic vector of the predicted character, the fusion feature vector and the text feature vector corresponding to each of the hot words includes:
将所述字符声学向量与每一所述热词对应的融合特征向量的点积,确定为该热词对应的初始权重;The dot product of the character acoustic vector and the fusion feature vector corresponding to each of the hot words is determined as the initial weight corresponding to the hot word;
将每一所述热词对应的初始权重进行归一化处理,获得每一所述热词对应的目标权重;performing normalization processing on the initial weight corresponding to each of the hot words to obtain a target weight corresponding to each of the hot words;
根据每一所述热词对应的目标权重对所述热词的文本特征向量进行加权和计算,获得所述语境特征向量。The text feature vectors of the hot words are weighted and calculated according to the target weight corresponding to each of the hot words to obtain the context feature vector.
可选地,所述根据所述预测字符的字符声学向量、各个所述热词对应的融合特征向量和文本特征向量,确定所述预测字符对应的语境特征向量,还包括:Optionally, the determining the contextual feature vector corresponding to the predicted character according to the character acoustic vector of the predicted character, the fusion feature vector and the text feature vector corresponding to each of the hot words, further includes:
将按照所述目标权重由大至小的顺序排序M之后的目标权重更新为零,M为正整数;Update the target weights after sorting M according to the order of the target weights from large to small to zero, and M is a positive integer;
所述根据每一所述热词对应的目标权重对所述热词的文本特征向量进行加权和计算,获得所述语境特征向量,包括:The text feature vector of the hot word is weighted and calculated according to the target weight corresponding to each of the hot words, and the context feature vector is obtained, including:
根据每一所述热词对应的更新后的目标权重对所述热词的文本特征向量进行加权和计算,获得所述语境特征向量。The text feature vector of the hot word is weighted and calculated according to the updated target weight corresponding to each of the hot words to obtain the context feature vector.
可选地,所述根据所述语境特征解码器和所述语境特征向量,获得每一所述预测字符的语境概率分布,包括:Optionally, the obtaining the context probability distribution of each predicted character according to the context feature decoder and the context feature vector includes:
针对每一所述预测字符,根据所述预测字符的声学字符向量和所述语境特征向量,获得预测字符的目标特征向量;For each of the predicted characters, according to the acoustic character vector of the predicted character and the context feature vector, obtain a target feature vector of the predicted character;
根据所述语境特征解码器对所述目标特征向量进行解码,获得每一所述预测字符对应的语境概率分布。The target feature vector is decoded according to the context feature decoder to obtain a context probability distribution corresponding to each predicted character.
可选地,通过以下方式确定所述训练词语的音标序列、文本序列以及训练标签:Optionally, determine the phonetic sequence, text sequence and training label of the training word in the following manner:
从每一训练样本的训练标注文本中确定所述训练词语;determining said training words from the training labeled text of each training sample;
针对每一所述训练词语,根据所述训练词语的语种确定所述训练词语的音标序列,并从所述训练标注文本中确定所述文本序列;For each of the training words, determine the phonetic sequence of the training words according to the language of the training words, and determine the text sequence from the training marked text;
针对每一所述训练词语,将所述训练词语对应的训练标注文本中、除所述训练词语之外的文本替换为预设标签,以生成所述训练词语对应的训练标签。For each of the training words, texts other than the training words in the training tagged text corresponding to the training words are replaced with preset labels, so as to generate training labels corresponding to the training words.
第二方面,本公开提供一种语音识别装置,所述装置包括:In a second aspect, the present disclosure provides a speech recognition device, the device comprising:
接收模块,用于接收待识别语音数据;A receiving module, configured to receive voice data to be recognized;
处理模块,用于根据所述待识别语音数据、热词信息和语音识别模型,获得所述待识别语音数据对应的目标文本;其中,所述热词信息包含多个热词对应的文本序列和音标序列;所述语音识别模型包括语音识别子模型和语境识别子模型,所述语境识别子模型是基于训练词语以及所述训练词语的音标序列、文本序列以及训练标签进行训练的。A processing module, configured to obtain target text corresponding to the speech data to be recognized according to the speech data to be recognized, hot word information and a speech recognition model; wherein, the hot word information includes a plurality of text sequences corresponding to hot words and phonetic symbol sequence; the speech recognition model includes a speech recognition sub-model and a context recognition sub-model, and the context recognition sub-model is trained based on training words and phonetic symbol sequences, text sequences and training labels of the training words.
第三方面,本公开提供一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现第一方面任一所述方法的步骤。In a third aspect, the present disclosure provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of any one of the methods described in the first aspect are implemented.
第四方面,提供一种电子设备,包括:In a fourth aspect, an electronic device is provided, including:
存储装置,其上存储有计算机程序;a storage device on which a computer program is stored;
处理装置,用于执行所述存储装置中的所述计算机程序,以实现第一方面任一所述方法的步骤。A processing device configured to execute the computer program in the storage device to implement the steps of any one of the methods in the first aspect.
由此,在上述技术方案,用于对待识别语音数据进行识别的语音识别模型中可以包含语音识别子模型和语境识别子模型,则在语音识别的过程中可以基于该语音识别子模型进行语音识别,同时可以结合该语境识别子模型提高对待识别语音数据中的热词识别的准确性,进而提高语音识别的准确性。并且,在语境识别子模型进行训练时是结合训练数据的发音特征和文本特征进行训练的,可以基于该发音特征对拼写或发音相近的各个热词进行准确区分,因此对热词进行识别时,可以结合该多个特征从多个热词中进行准确的识别,避免对拼写或发音相近的热词的混淆识别,进一步提高语音识别的准确性,提升用户使用体验。Thus, in the above-mentioned technical solution, the speech recognition model used to recognize the speech data to be recognized may include a speech recognition sub-model and a context recognition sub-model, then in the process of speech recognition, speech recognition may be performed based on the speech recognition sub-model. At the same time, the context recognition sub-model can be combined to improve the accuracy of hot word recognition in the speech data to be recognized, thereby improving the accuracy of speech recognition. Moreover, when the context recognition sub-model is trained, it is trained in combination with the pronunciation features and text features of the training data. Based on the pronunciation features, hot words with similar spelling or pronunciation can be accurately distinguished. Therefore, when identifying hot words , can combine the multiple features to perform accurate recognition from multiple hot words, avoid confusing recognition of hot words with similar spelling or pronunciation, further improve the accuracy of speech recognition, and improve user experience.
本公开的其他特征和优点将在随后的具体实施方式部分予以详细说明。Other features and advantages of the present disclosure will be described in detail in the detailed description that follows.
附图说明Description of drawings
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。在附图中:The above and other features, advantages and aspects of the various embodiments of the present disclosure will become more apparent with reference to the following detailed description in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numerals denote the same or similar elements. It should be understood that the drawings are schematic and that elements and elements are not necessarily drawn to scale. In the attached picture:
图1为根据本公开的一种实施方式提供的一种语音识别方法的流程图;FIG. 1 is a flow chart of a speech recognition method provided according to an embodiment of the present disclosure;
图2是根据本公开的一种实施方式提供的一种语音识别模型的结构示意图;FIG. 2 is a schematic structural diagram of a speech recognition model provided according to an embodiment of the present disclosure;
图3是根据待识别语音数据、热词信息和语音识别模型,获得待识别语音数据对应的目标文本的示例性实现方式的流程图;Fig. 3 is a flow chart of an exemplary implementation of obtaining the target text corresponding to the speech data to be recognized according to the speech data to be recognized, the hot word information and the speech recognition model;
图4是根据本公开的一种实施方式提供的一种语音识别装置的框图;Fig. 4 is a block diagram of a speech recognition device provided according to an embodiment of the present disclosure;
图5示出了适于用来实现本公开实施例的电子设备的结构示意图。FIG. 5 shows a schematic structural diagram of an electronic device suitable for implementing the embodiments of the present disclosure.
具体实施方式detailed description
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although certain embodiments of the present disclosure are shown in the drawings, it should be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein; A more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for exemplary purposes only, and are not intended to limit the protection scope of the present disclosure.
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。It should be understood that the various steps described in the method implementations of the present disclosure may be executed in different orders, and/or executed in parallel. Additionally, method embodiments may include additional steps and/or omit performing illustrated steps. The scope of the present disclosure is not limited in this respect.
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。As used herein, the term "comprise" and its variations are open-ended, ie "including but not limited to". The term "based on" is "based at least in part on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one further embodiment"; the term "some embodiments" means "at least some embodiments." Relevant definitions of other terms will be given in the description below.
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。It should be noted that concepts such as "first" and "second" mentioned in this disclosure are only used to distinguish different devices, modules or units, and are not used to limit the sequence of functions performed by these devices, modules or units or interdependence.
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。It should be noted that the modifications of "one" and "multiple" mentioned in the present disclosure are illustrative and not restrictive, and those skilled in the art should understand that unless the context clearly indicates otherwise, it should be understood as "one or more" multiple".
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。The names of messages or information exchanged between multiple devices in the embodiments of the present disclosure are used for illustrative purposes only, and are not used to limit the scope of these messages or information.
图1所示,为根据本公开的一种实施方式提供的一种语音识别方法的流程图,如图1所示,所述方法可以包括:As shown in FIG. 1, it is a flowchart of a speech recognition method provided according to an embodiment of the present disclosure. As shown in FIG. 1, the method may include:
在步骤11中,接收待识别语音数据。In step 11, voice data to be recognized is received.
在步骤12中,根据待识别语音数据、热词信息和语音识别模型,获得待识别语音数据对应的目标文本。In step 12, according to the speech data to be recognized, the hot word information and the speech recognition model, the target text corresponding to the speech data to be recognized is obtained.
其中,所述热词信息包含多个热词对应的文本序列和音标序列。其中,该热词信息可以是与具体的应用语境对应的热词,以为该待识别语音数据的识别过程提供先验语境知识。音标序列用于表示热词的发音,如该热词为中文时,其对应的音标序列中则包含有调拼音,热词为英文时,其对应的音标序列则包含英式音标或者美式音标等。热词的文本序列则可以是该热词文本本身。Wherein, the hot word information includes text sequences and phonetic symbol sequences corresponding to multiple hot words. Wherein, the hot word information may be a hot word corresponding to a specific application context, so as to provide prior context knowledge for the recognition process of the speech data to be recognized. The phonetic symbol sequence is used to indicate the pronunciation of the hot word. If the hot word is in Chinese, its corresponding phonetic symbol sequence contains the tonal pinyin; when the hot word is in English, its corresponding phonetic symbol sequence includes British phonetic symbols or American phonetic symbols, etc. . The text sequence of the hot word may be the hot word text itself.
所述语音识别模型包括语音识别子模型和语境识别子模型,其中,所述语音识别子模型用于对该待识别语音数据进行语音信息识别,该语境识别子模型用于对该待识别语音数据进行语境信息识别,即用于识别该待识别语音数据中包含的热词特征。The speech recognition model includes a speech recognition sub-model and a context recognition sub-model, wherein the speech recognition sub-model is used to perform speech information recognition on the speech data to be recognized, and the context recognition sub-model is used to recognize the speech data to be recognized The voice data is subjected to contextual information recognition, that is, used to identify hot word features contained in the voice data to be recognized.
具体地,所述语境识别子模型是基于训练词语以及所述训练词语的音标序列、文本序列以及训练标签进行训练的。因此,在该语境识别子模型的训练过程中可以同时结合训练词语对应的发音特征和文本特征进行训练,使得语境识别子模型的训练特征更加全面和丰富,为后续进行热词判定提高准确且全面的数据支持。Specifically, the context recognition sub-model is trained based on training words and phonetic symbol sequences, text sequences and training labels of the training words. Therefore, in the training process of the context recognition sub-model, the pronunciation features and text features corresponding to the training words can be combined for training at the same time, so that the training features of the context recognition sub-model are more comprehensive and rich, and improve the accuracy of subsequent hot word determination. And comprehensive data support.
例如,热词信息中包含热词“***”和“章芝果”,待识别语音数据对应的真实文本应为“章芝果说她最近要结婚了”,通过相关技术,由于该热词信息中包含相近发音的热词,则在进行语音识别时,则会出现识别文本“***说她最近要结婚了”,即出现热词的混淆识别。而通过上述技术方案,在基于热词进行语音识别时,可以基于热词的发音特征对热词识别进行增强识别,如“***”对应的音标序列为“zhang1zhi4guo2”,“章芝果”对应的音标序列为“zhang1zhi1guo3”,则可以对发音相近的热词进行区分,得出识别结果“章芝果说她最近要结婚了”,提高语音识别的准确性。For example, if the hot word information contains the hot words "Zhang Zhiguo" and "Zhang Zhiguo", the real text corresponding to the voice data to be recognized should be "Zhang Zhiguo said that she is getting married recently". If there are hot words with similar pronunciation in , then when speech recognition is performed, the recognition text "Zhang Zhiguo said that she is going to get married recently" will appear, that is, confusing recognition of hot words will appear. Through the above technical solution, when performing speech recognition based on hot words, the hot word recognition can be enhanced and recognized based on the pronunciation characteristics of the hot words. For example, the phonetic sequence corresponding to "Zhang Zhiguo" is "zhang1zhi4guo2", and the phonetic sequence corresponding to "Zhang Zhiguo" If the phonetic sequence is "zhang1zhi1guo3", hot words with similar pronunciation can be distinguished, and the recognition result "Zhang Zhiguo said that she is getting married recently" can be obtained to improve the accuracy of speech recognition.
由此,在上述技术方案,用于对待识别语音数据进行识别的语音识别模型中可以包含语音识别子模型和语境识别子模型,则在语音识别的过程中可以基于该语音识别子模型进行语音识别,同时可以结合该语境识别子模型提高对待识别语音数据中的热词识别的准确性,进而提高语音识别的准确性。并且,在语境识别子模型进行训练时是结合训练数据的发音特征和文本特征进行训练的,可以基于该发音特征对拼写或发音相近的各个热词进行准确区分,因此对热词进行识别时,可以结合该多个特征从多个热词中进行准确的识别,避免对拼写或发音相近的热词的混淆识别,进一步提高语音识别的准确性,提升用户使用体验。Thus, in the above-mentioned technical solution, the speech recognition model used to recognize the speech data to be recognized may include a speech recognition sub-model and a context recognition sub-model, then in the process of speech recognition, speech recognition may be performed based on the speech recognition sub-model. At the same time, the context recognition sub-model can be combined to improve the accuracy of hot word recognition in the speech data to be recognized, thereby improving the accuracy of speech recognition. Moreover, when the context recognition sub-model is trained, it is trained in combination with the pronunciation features and text features of the training data. Based on the pronunciation features, hot words with similar spelling or pronunciation can be accurately distinguished. Therefore, when identifying hot words , can combine the multiple features to perform accurate recognition from multiple hot words, avoid confusing recognition of hot words with similar spelling or pronunciation, further improve the accuracy of speech recognition, and improve user experience.
在一种可能的实施例中,如图2所示,所述语音识别模型10可以包括语音识别子模型100和语境识别子模型200,所述语境识别子模型200包括发音特征编码器201、文本特征编码器202、注意力模块203和语境特征解码器204。相应地,在步骤12中,根据待识别 语音数据、热词信息和语音识别模型,获得待识别语音数据对应的目标文本的示例性实现方式如下,如图3所示,该步骤可以包括:In a possible embodiment, as shown in FIG. 2 , the speech recognition model 10 may include a speech recognition submodel 100 and a context recognition submodel 200, and the context recognition submodel 200 includes a pronunciation feature encoder 201 , a textual feature encoder 202 , an attention module 203 and a contextual feature decoder 204 . Correspondingly, in step 12, according to the speech data to be recognized, the hot word information and the speech recognition model, the exemplary implementation of obtaining the target text corresponding to the speech data to be recognized is as follows, as shown in Figure 3, this step may include:
在步骤31中,根据发音特征编码器对热词的音标序列进行编码,获得热词的发音特征向量,并根据文本特征编码器对热词的文本序列进行编码,获得热词的文本特征向量。In step 31, the phonetic symbol sequence of the hot word is encoded according to the pronunciation feature encoder to obtain the pronunciation feature vector of the hot word, and the text sequence of the hot word is encoded according to the text feature encoder to obtain the text feature vector of the hot word.
在步骤32中,根据语音识别子模型和待识别语音数据,获得待识别语音数据对应的每一预测字符的字符声学向量和文本概率分布。In step 32, according to the speech recognition sub-model and the speech data to be recognized, the character acoustic vector and text probability distribution of each predicted character corresponding to the speech data to be recognized are obtained.
示例地,如图2所示,该语音识别子模型可以进一步包括编码器101、预测子模型102和解码器103,其中该预测子模型可以为CIF(Continuous Integrate-and-Fire,连续整合发放)模型。Illustratively, as shown in Figure 2, the speech recognition sub-model may further include an encoder 101, a prediction sub-model 102 and a decoder 103, wherein the prediction sub-model may be CIF (Continuous Integrate-and-Fire, continuous integration and delivery) Model.
通常情况下,可以将每秒的语音数据切分为多个音频帧,从而基于音频帧进行数据处理,示例地,可以将每秒的语音数据切分为100个音频帧进行处理。相应地,通过该编码器对该待识别语音数据的音频帧进行编码,获得的声学向量序列H可以表示为:Usually, the voice data per second can be divided into multiple audio frames, so as to perform data processing based on the audio frames. For example, the voice data per second can be divided into 100 audio frames for processing. Correspondingly, by encoding the audio frame of the speech data to be recognized by the encoder, the obtained acoustic vector sequence H can be expressed as:
H:{H1,H2,…,HU},其中,U用于表示该输入待识别语音数据中的音频帧的数量,即该声学向量序列的长度。H:{H1,H2,...,HU}, where U is used to represent the number of audio frames in the input speech data to be recognized, that is, the length of the acoustic vector sequence.
之后,可以根据声学向量和所述预测子模型,获得待识别语音数据对应的字符声学向量。Afterwards, according to the acoustic vector and the prediction sub-model, the character acoustic vector corresponding to the speech data to be recognized can be obtained.
示例地,可以将所述声学向量输入预测子模型,则该预测子模型可以对声学向量进行信息量预测,获得该音频帧对应的信息量。之后可以根据多个音频帧的信息量对音频帧的声学向量进行合并,获得字符声学向量。For example, the acoustic vector can be input into the predicting sub-model, and the predicting sub-model can predict the amount of information on the acoustic vector to obtain the amount of information corresponding to the audio frame. Afterwards, the acoustic vectors of the audio frames may be combined according to the amount of information of multiple audio frames to obtain character acoustic vectors.
在本公开实施例中默认每一预测字符对应的信息量是相同的,因此可以将音频帧对应的信息量从左到右的方式进行累加,信息量累加至预设阈值时,认为此时该累加的信息量对应的音频帧形成为一个预测字符,一个预测字符对应于一个或多个音频帧。其中,该预设阈值可以根据实际应用场景和经验进行设置,示例地该预设阈值可以设置为1,本公开对此不进行限定。In the embodiment of the present disclosure, by default, the amount of information corresponding to each predicted character is the same, so the amount of information corresponding to the audio frame can be accumulated from left to right. When the amount of information is accumulated to the preset threshold, it is considered that the amount of information at this time The audio frame corresponding to the accumulated information amount is formed into a prediction character, and a prediction character corresponds to one or more audio frames. Wherein, the preset threshold may be set according to actual application scenarios and experience, for example, the preset threshold may be set to 1, which is not limited in the present disclosure.
在一种可能的实施例中,可以通过如下方式根据多个音频帧的信息量对音频帧的声学向量进行合并:In a possible embodiment, the acoustic vectors of the audio frames may be combined according to the amount of information of multiple audio frames in the following manner:
按照信息量的顺序,依次获取一音频帧i的信息量Wi;Acquire the information amount Wi of an audio frame i in sequence according to the order of the amount of information;
若Wi小于预设阈值β,则获取下一音频帧作为当前的音频帧,即i=i+1,并对遍历到的音频帧的信息量进行累加,若累加和大于该预设阈值,可以认为此时出现了字符边界,即该当前遍历到的音频帧中部分属于当前的预测字符,另一部分属于下一预测字符。If Wi is less than the preset threshold β, then obtain the next audio frame as the current audio frame, i.e. i=i+1, and accumulate the amount of information of the traversed audio frames, if the accumulated sum is greater than the preset threshold, you can It is considered that there is a character boundary at this time, that is, part of the currently traversed audio frame belongs to the current predicted character, and the other part belongs to the next predicted character.
示例地,若W1+W2大于β,则可以认为此时出现了字符边界,即第1个音频帧和第2个音频帧的部分可以对应于一个预测字符,该预测字符的边界处于第2个音频帧中。此时 可以将该第2个音频帧的信息量切分为两部分,即一部分信息量属于当前的预测字符,剩余一部分的信息量属于下一预测字符。For example, if W1+W2 is greater than β, it can be considered that there is a character boundary at this time, that is, the part of the first audio frame and the second audio frame can correspond to a predicted character, and the boundary of the predicted character is at the second in the audio frame. At this time, the information volume of the second audio frame can be divided into two parts, that is, a part of the information volume belongs to the current predicted character, and the remaining part of the information volume belongs to the next predicted character.
相应地,第2个音频帧的信息量W2中属于当前的预测字符的信息量W21可以表示为:W21=β-W1;属于下一预测字符的信息量W22可以表示为:W22=W1-W21。Correspondingly, the amount of information W21 belonging to the current predicted character in the amount of information W2 of the second audio frame can be expressed as: W21=β-W1; the amount of information W22 belonging to the next predicted character can be expressed as: W22=W1-W21 .
之后继续遍历音频帧的信息量,并从该第2个音频帧的剩余一部分的信息量继续进行信息量的累加,即将第2个音频帧中的信息量W22和第3个音频帧中的信息量W3进行累加,直至累加至预设阈值β,获得下一预测字符对应的音频帧。针对后续的音频帧的信息量以此类推,通过上述方式进行合并,获得该多个音频帧对应的各个预测字符。Then continue to traverse the information volume of the audio frame, and continue to accumulate the information volume from the remaining part of the information volume of the second audio frame, that is, the information volume W22 in the second audio frame and the information in the third audio frame The amount W3 is accumulated until it reaches the preset threshold β, and the audio frame corresponding to the next predicted character is obtained. The amount of information of the subsequent audio frames can be deduced by analogy, and combined in the above manner to obtain each predicted character corresponding to the plurality of audio frames.
基于此,在确定出该语音数据中预测字符和音频帧的对应关系后,针对每一预测字符,可以将该预测字符对应的每一音频帧的声学向量的加权和确定为该预测字符对应的字符声学向量。其中,该预测字符对应的每一音频帧的声学向量的权重为该音频帧在该预测字符中对应的信息量。若该音频帧全部属于该预测字符,则该音频帧的声学向量的权重为该音频帧的信息量,若该音频帧部分属于该预测字符,则该音频帧的声学向量的权重为该音频帧中该部分的信息量。Based on this, after determining the correspondence between the predicted character and the audio frame in the voice data, for each predicted character, the weighted sum of the acoustic vectors of each audio frame corresponding to the predicted character can be determined as the corresponding Character acoustic vector. Wherein, the weight of the acoustic vector of each audio frame corresponding to the predicted character is the corresponding information amount of the audio frame in the predicted character. If the audio frame all belongs to the predicted character, the weight of the acoustic vector of the audio frame is the information amount of the audio frame, and if the audio frame part belongs to the predicted character, the weight of the acoustic vector of the audio frame is the audio frame The amount of information in this section.
如上文所述示例,针对第一个预测字符,其包含第1个音频帧和第2个音频帧的部分,则该预测字符对应的字符声学向量C1可以表示为:As the example mentioned above, for the first predicted character, which includes the first audio frame and the second audio frame, the character acoustic vector C1 corresponding to the predicted character can be expressed as:
C1=W1*H1+W21*H2;C1=W1*H1+W21*H2;
又如示例,针对第二个预测字符,其包含第2个音频帧的部分和第3个音频帧,则该预测字符对应的字符声学向量C2可以表示为:As another example, for the second predicted character, which includes part of the second audio frame and the third audio frame, the character acoustic vector C2 corresponding to the predicted character can be expressed as:
C2=W22*H2+W3*H3。C2=W22*H2+W3*H3.
之后,可以基于解码器对每一预测字符的字符声学向量进行解码,从而获得该预测字符的文本概率分布。Afterwards, the character acoustic vector of each predicted character can be decoded based on the decoder, so as to obtain the text probability distribution of the predicted character.
转回图3,在步骤33中,根据注意力模块、发音特征向量、文本特征向量以及字符声学向量,获得每一预测字符的语境特征向量。Turning back to Fig. 3, in step 33, according to the attention module, the pronunciation feature vector, the text feature vector and the character acoustic vector, the context feature vector of each predicted character is obtained.
其中,在该步骤中,在确定预测字符的语境特征向量时,可以通过结合各个热词对应的发音特征和文本特征,从而可以对各个热词的多个特征进行综合的考量,以提高语境特征向量中特征的丰富性和准确性。同时在该注意力模块中结合发音特征向量、文本特征向量以及字符声学向量,可以在其进行注意力计算时保证各个热词与语音数据之间的匹配性。其具体计算方式在下文进行描述。Wherein, in this step, when determining the contextual feature vector of the predicted character, it is possible to comprehensively consider multiple features of each hot word by combining the corresponding pronunciation features and text features of each hot word, so as to improve language The richness and accuracy of features in the environment feature vector. At the same time, the pronunciation feature vector, text feature vector and character acoustic vector are combined in the attention module to ensure the matching between each hot word and the voice data when performing attention calculation. Its specific calculation method is described below.
在步骤34中,根据语境特征解码器和语境特征向量,获得每一预测字符的语境概率分布。In step 34, the context probability distribution of each predicted character is obtained according to the context feature decoder and the context feature vector.
之后,可以基于语境特征解码器对每一预测字符的语境特征向量进行解码,从而可以 获得该预测字符的语境概率分布。Afterwards, the context feature vector of each predicted character can be decoded based on the context feature decoder, so that the context probability distribution of the predicted character can be obtained.
在步骤35中,根据文本概率分布和语境概率分布,确定待识别数据对应的目标文本。In step 35, the target text corresponding to the data to be recognized is determined according to the text probability distribution and the context probability distribution.
作为示例,可以针对每一预测字符,将其文本概率分布和语境概率分布进行加权求和,从而可以获得该预测字符对应的综合概率分布。之后可以则可以基于该综合概率分布,通过贪心搜索(Greedy Search)的算法或者集束搜索(Beam Search)的算法确定出每一预测字符对应的识别字符,以获得该目标文本。其中上述搜索算法为本领域中的常用方式,在此不再赘述。As an example, for each predicted character, its text probability distribution and context probability distribution may be weighted and summed, so as to obtain a comprehensive probability distribution corresponding to the predicted character. Then, based on the comprehensive probability distribution, the recognition character corresponding to each predicted character can be determined through a Greedy Search algorithm or a Beam Search algorithm, so as to obtain the target text. The above-mentioned search algorithm is a common method in the art, and will not be repeated here.
由此,通过上述技术方案,可以针对待识别语音数据中的每一预测字符的语音识别过程中进行热词增强识别,以提高语音识别的精细度和准确性,也可以在一定程度上提高语音识别的实时性,提升用户使用体验。Thus, through the above-mentioned technical solution, it is possible to carry out enhanced recognition of hot words during the speech recognition process of each predicted character in the speech data to be recognized, so as to improve the fineness and accuracy of speech recognition, and also improve speech recognition to a certain extent. Real-time recognition improves user experience.
在一种可能的实施例中,所述根据所述注意力模块、所述发音特征向量、所述文本特征向量以及所述字符声学向量,获得每一所述预测字符的语境特征向量的示例性实现方式如下,该步骤可以包括:In a possible embodiment, according to the attention module, the pronunciation feature vector, the text feature vector and the character acoustic vector, an example of the context feature vector of each predicted character is obtained The way to implement the property is as follows, and this step may include:
针对每一所述热词,根据该热词的所述发音特征向量和所述文本特征向量,确定该热词对应的融合特征向量。For each hot word, according to the pronunciation feature vector and the text feature vector of the hot word, determine the fusion feature vector corresponding to the hot word.
其中,可以通过将该发音特征向量和该文本特征向量进行拼接,从而获得该融合特征向量。Wherein, the fusion feature vector can be obtained by concatenating the pronunciation feature vector and the text feature vector.
针对每一所述预测字符,在所述注意力模块中,根据所述预测字符的字符声学向量、各个所述热词对应的融合特征向量和文本特征向量,确定所述预测字符对应的语境特征向量。For each of the predicted characters, in the attention module, determine the context corresponding to the predicted character according to the character acoustic vector of the predicted character, the fusion feature vector and the text feature vector corresponding to each of the hot words Feature vector.
其中,在所述注意力模块可以通过该字符声学向量和各个融合特征向量,确定当前的预测字符对各个热词的关注度,以为后续进行热词的识别判断提供数据支持。Wherein, the attention module can use the character acoustic vector and each fused feature vector to determine the degree of attention of the current predicted character to each hot word, so as to provide data support for subsequent recognition and judgment of hot words.
在一种可能的实施例中,所述根据所述预测字符的字符声学向量、各个所述热词对应的融合特征向量和文本特征向量,确定所述预测字符对应的语境特征向量的示例性实现方式如下,该步骤可以包括:In a possible embodiment, the exemplary method of determining the contextual feature vector corresponding to the predicted character according to the character acoustic vector of the predicted character, the fusion feature vector and the text feature vector corresponding to each of the hot words The implementation is as follows, and this step may include:
将所述字符声学向量与每一所述热词对应的融合特征向量的点积,确定为该热词对应的初始权重。The dot product of the character acoustic vector and the fused feature vector corresponding to each hot word is determined as the initial weight corresponding to the hot word.
其中,作为示例,针对字符声学向量Ci,可以将该Ci与n个热词中的每一热词对应的融合特征向量T1至Tn分别计算点积,即将Ci与T1的点积Q1确定该T1的初始权重,Ci与T2的点积Q1确定T2的初始权重,其他以此类推,以确定出每一热词对应的初始权重。具体地,针对每一热词在计算初始权重时,可以基于多头注意力(multi-head attention)计算Ci和融合特征向量的点积注意力,之后将计算获得的多个点积注意力的平均值作为该 特征融合向量对应的综合权重,即该该特征融合向量对应的热词的初始权重。相应地,该初始权重则可以用于表征该在该预测字符的字符声学向量中对每一热词的关注度。Among them, as an example, for the character acoustic vector Ci, the fusion feature vectors T1 to Tn corresponding to each hot word in the n hot words can be respectively calculated for the dot product, that is, the dot product Q1 of Ci and T1 determines the T1 The initial weight of the dot product Q1 of Ci and T2 determines the initial weight of T2, and so on to determine the initial weight corresponding to each hot word. Specifically, when calculating the initial weight for each hot word, the dot product attention of Ci and the fusion feature vector can be calculated based on multi-head attention, and then the average of the obtained multiple dot product attention will be calculated The value is used as the comprehensive weight corresponding to the feature fusion vector, that is, the initial weight of the hot word corresponding to the feature fusion vector. Correspondingly, the initial weight can be used to represent the degree of attention to each hot word in the character acoustic vector of the predicted character.
将每一所述热词对应的初始权重进行归一化处理,获得每一所述热词对应的目标权重。The initial weight corresponding to each of the hot words is normalized to obtain the target weight corresponding to each of the hot words.
示例地,为了更加准确地衡量对各个热词的关注程度,可以将该各个热词对应的初始权重Q1-Qn进行归一化处理,例如可以对该Q1-Qn进行softmax计算,可以将每一热词的权重映射到统一标准下衡量,便于对各个热词对应的目标权重进行比较,从而可以确定出该预测字符更可能对应的热词。For example, in order to more accurately measure the degree of attention to each hot word, the initial weights Q1-Qn corresponding to each hot word can be normalized, for example, softmax calculations can be performed on the Q1-Qn, and each The weights of hot words are mapped to a unified standard for measurement, which is convenient for comparing the target weights corresponding to each hot word, so that the hot word that is more likely to correspond to the predicted character can be determined.
之后,根据每一所述热词对应的目标权重对所述热词的文本特征向量进行加权和计算,获得所述语境特征向量。Afterwards, the text feature vectors of the hot words are weighted and calculated according to the target weight corresponding to each of the hot words to obtain the context feature vector.
在该实施例中,可以基于每一热词对应的目标权重与其对应的文本特征向量的乘积进行累加,获得该语境特征向量,则对应目标权重更高的文本特征向量则会在在该语境特征向量中对应更加明确的特征表示。In this embodiment, the product of the target weight corresponding to each hot word and its corresponding text feature vector can be accumulated to obtain the context feature vector, then the text feature vector corresponding to a higher target weight will be included in the context feature vector Corresponding to a more explicit feature representation in the environment feature vector.
由此,通过上述技术方案,可以基于包含发音特征序列和文本特征序列的融合特征序列确定每一热词对应的目标权重,使得提供的特征更加丰富,并使得确定出的该目标权重更加准确,以提高语境特征向量的准确性,在一定程度上提高语境识别子模型对输入的热词识别的准确性,实现拼写或发音相近的热词之间的区分性,保证语音识别的准确性。Therefore, through the above technical solution, the target weight corresponding to each hot word can be determined based on the fusion feature sequence including the pronunciation feature sequence and the text feature sequence, so that the provided features are more abundant, and the determined target weight is more accurate. In order to improve the accuracy of the contextual feature vector, to a certain extent, improve the accuracy of the context recognition sub-model to recognize the input hot words, realize the distinction between hot words with similar spelling or pronunciation, and ensure the accuracy of speech recognition .
在一种可能的实施例中,所述根据所述预测字符的字符声学向量、各个所述热词对应的融合特征向量和文本特征向量,确定所述预测字符对应的语境特征向量的另一示例性实施方式如下,在上一实施例的基础上,还包括:In a possible embodiment, according to the character acoustic vector of the predicted character, the fusion feature vector and the text feature vector corresponding to each of the hot words, another one of the context feature vector corresponding to the predicted character is determined The exemplary implementation is as follows, on the basis of the previous embodiment, it also includes:
将按照所述目标权重由大至小的顺序排序M之后的目标权重更新为零,M为正整数。M可以根据实际使用场景进行设置,例如M可以设置为20。The target weights after sorting M according to the target weights from large to small are updated to zero, and M is a positive integer. M can be set according to actual usage scenarios, for example, M can be set to 20.
其中,目标权重用于表示当前的预测字符对各个热词的关注度,则在热词对应的目标权重较小排序靠后时,表示该预测字符对应于该热词的可能性较低,此时可以直接将该热词的目标权重设置为0,在进行热词判定时重点判断可能性更高的热词。Among them, the target weight is used to represent the attention degree of the current predicted character to each hot word, then when the target weight corresponding to the hot word is smaller and ranked later, it means that the possibility that the predicted character corresponds to the hot word is low, so You can directly set the target weight of the hot word to 0, and focus on hot words with higher probability when judging hot words.
具体地,在对目标权重更新时,按照所述目标权重由大至小的顺序进行排序,并保留前M的目标权重,对排序M之后的目标权重设置为零。Specifically, when updating the target weights, the target weights are sorted in descending order, and the first M target weights are retained, and the target weights after the sorting M are set to zero.
相应地,所述根据每一所述热词对应的目标权重对所述热词的文本特征向量进行加权和计算,获得所述语境特征向量的示例性实现方式可以包括:Correspondingly, according to the target weight corresponding to each of the hot words, the text feature vectors of the hot words are weighted and calculated, and an exemplary implementation manner of obtaining the context feature vector may include:
根据每一所述热词对应的更新后的目标权重对所述热词的文本特征向量进行加权和计算,获得所述语境特征向量。The text feature vector of the hot word is weighted and calculated according to the updated target weight corresponding to each of the hot words to obtain the context feature vector.
由此,通过上述技术方案,可以直接基于目标权重对可能性较小的热词进行排除识别,可以在一定程度上降低计算数据量,在提高热词增强识别的识别效率的同时,保证热词增 强识别的准确性。Therefore, through the above technical solution, hot words with less possibility can be directly excluded and recognized based on the target weight, the amount of calculation data can be reduced to a certain extent, and the recognition efficiency of hot words enhanced recognition can be improved while ensuring hot words Enhance the accuracy of recognition.
在一种可能的实施例中,根据所述语境特征解码器和所述语境特征向量,获得每一所述预测字符的语境概率分布的示例实现方式如下,该步骤可以包括:In a possible embodiment, according to the context feature decoder and the context feature vector, an example implementation manner of obtaining the context probability distribution of each predicted character is as follows, and this step may include:
针对每一所述预测字符,根据所述预测字符的声学字符向量和所述语境特征向量,获得预测字符的目标特征向量。For each predicted character, a target feature vector of the predicted character is obtained according to the acoustic character vector of the predicted character and the contextual feature vector.
作为示例,可以将该预测字符的声学字符向量和所述语境特征向量进行拼接,从而获得该目标特征向量。As an example, the acoustic character vector of the predicted character and the context feature vector may be concatenated to obtain the target feature vector.
根据所述语境特征解码器对所述目标特征向量进行解码,获得每一所述预测字符对应的语境概率分布。The target feature vector is decoded according to the context feature decoder to obtain a context probability distribution corresponding to each predicted character.
由此,通过上述技术方案,在语境特征解码器进行解码时,可以基于包含输入的语音数据的音频特征和各个热词的相关特征的目标特征向量进行解码,提高该语境概率分布与待识别语音数据以及热词之间的匹配程度,为后续进行目标文本的确定提供准确且全面的数据支持,进一步提高语音识别过程中特征的多样性,从而提高语音识别结果的准确性。Thus, through the above technical solution, when the contextual feature decoder performs decoding, it can be decoded based on the target feature vector containing the audio features of the input voice data and the relevant features of each hot word, and the context probability distribution and the target feature vector to be improved are improved. Identify the matching degree between speech data and hot words, provide accurate and comprehensive data support for the subsequent determination of target text, further improve the diversity of features in the speech recognition process, and thus improve the accuracy of speech recognition results.
在一种可能的实施例中,通过以下方式确定所述训练词语的音标序列、文本序列以及训练标签:In a possible embodiment, the phonetic sequence, text sequence and training label of the training word are determined in the following manner:
从每一训练样本的训练标注文本中确定所述训练词语。The training words are determined from the training labeled text of each training sample.
作为示例,可以针对每一训练样本的训练标注文本,从该文本中随机抽取一条N-gram词语,并将该词语作为候选词语。之后可以从该候选词语中随机选择部分词语作为该训练词语。作为另一示例,可以从所述训练样本的训练标注文本中确定候选文本,之后可以针对每一候选文本,从该候选文本中随机抽取一条N-gram词语作为训练词语。由此以保证训练词语的多样性和随机性。As an example, for each training labeled text of a training sample, an N-gram word may be randomly extracted from the text, and this word may be used as a candidate word. Part of the words can then be randomly selected from the candidate words as the training words. As another example, candidate texts may be determined from the training labeled text of the training sample, and then for each candidate text, an N-gram word may be randomly extracted from the candidate text as a training word. This ensures the diversity and randomness of the training words.
针对每一所述训练词语,根据所述训练词语的语种确定所述训练词语的音标序列,并从所述训练标注文本中确定所述文本序列。For each training word, the phonetic sequence of the training word is determined according to the language of the training word, and the text sequence is determined from the training marked text.
其中,可以直接将从训练标注文本中提取出的该训练词语本身作为该文本序列,可以预先设置不同语种对应的音标序列类型,如设置中文对应的音标序列为拼音序列,设置英文对应的音标序列为英式音标序列。作为示例,可以基于电子辞典进行查询的方式确定对应的音标序列,如针对中文词语,可以针对该词语中的每一个字符查询中文辞典,从而获得该词语中每一字符对应的有调拼音,之后将每一字符的有调拼音进行拼接获得该词语的音标序列;也可以直接基于该词语进行查询,从而直接获得该词语的音标序列,如针对训练词语“凸优化理论”,其对应为音标序列表示为“tu1you1hua4li3lun4”。Among them, the training word itself extracted from the training marked text can be directly used as the text sequence, and the phonetic symbol sequence types corresponding to different languages can be preset, such as setting the phonetic symbol sequence corresponding to Chinese as a pinyin sequence, and setting the phonetic symbol sequence corresponding to English It is a sequence of British phonetic symbols. As an example, the corresponding phonetic sequence can be determined by querying based on an electronic dictionary. For example, for a Chinese word, a Chinese dictionary can be queried for each character in the word to obtain the tonal pinyin corresponding to each character in the word, and then Concatenate the toned pinyin of each character to obtain the phonetic sequence of the word; you can also directly query based on the word, so as to directly obtain the phonetic sequence of the word, such as for the training word "convex optimization theory", which corresponds to the phonetic sequence Expressed as "tu1you1hua4li3lun4".
针对每一所述训练词语,将所述训练词语对应的训练标注文本中、除所述训练词语之外的文本替换为预设标签,以生成所述训练词语对应的训练标签。For each of the training words, texts other than the training words in the training tagged text corresponding to the training words are replaced with preset labels, so as to generate training labels corresponding to the training words.
示例地,针对训练标注文本“凸优化理论是一门重要课程”,从中提取出的训练词语为“凸优化理论”,则在确定该训练词语对应的训练标签时,可以将所述训练词语对应的训练标注文本中、除所述训练词语之外的文本替换为预设标签,即将训练标注文本“凸优化理论是一门重要课程”中的“是一门重要课程”进行替换,获得训练标签“凸优化理论*******”。其中,预设标签没有实际含义,其用于表征此处对应的文本无对应的先验语境知识。For example, for the training label text "convex optimization theory is an important course", the training words extracted from it are "convex optimization theory", then when determining the training label corresponding to the training words, the training words corresponding to In the training label text, the text other than the training words is replaced with the preset label, that is, the "is an important course" in the training label text "convex optimization theory is an important course" is replaced to obtain the training label "Convex Optimization Theory *****". Among them, the preset label has no actual meaning, and it is used to represent that the corresponding text here has no corresponding prior contextual knowledge.
由此,通过上述技术方案,可以对训练样本进行自动化处理以获得用于对语境识别子模型进行训练的训练数据,并且在获得该训练数据时,可以同时提取训练词语对应的文本特征和发音特征,便于更加精确的对每一词语进行标识,以便于对拼写或发音相近的词语进行区分,从而可以为语境识别子模型的训练提供更加全面且可靠的特征信息,在一定程度上提高训练所得的语境识别子模型的准确性。Thus, through the above technical solution, the training samples can be automatically processed to obtain training data for training the context recognition sub-model, and when the training data is obtained, the text features and pronunciation corresponding to the training words can be extracted at the same time features, so that each word can be identified more accurately, so as to distinguish words with similar spelling or pronunciation, so that it can provide more comprehensive and reliable feature information for the training of the context recognition sub-model, and improve the training to a certain extent. Accuracy of the resulting context recognition submodel.
在一种可能的实施例中,所述语音识别模型可以通过如下方式进行训练:In a possible embodiment, the speech recognition model can be trained in the following manner:
在语音识别子模型训练完成的情况下,根据训练样本,确定训练样本中的训练词语的音标序列、文本序列以及训练标签。After the training of the speech recognition sub-model is completed, according to the training samples, the phonetic sequence, text sequence and training labels of the training words in the training samples are determined.
其中,在本公开实施例中,可以分别对语音识别子模型和语境识别子模型进行训练。可以以训练语音数据作为输入,以对应的训练标注文本作为目标输出,实现对语音识别子模型的训练。其中,可以基于本领域中常用的训练方法进行训练,在此不再赘述。Wherein, in the embodiment of the present disclosure, the speech recognition sub-model and the context recognition sub-model may be trained separately. The speech recognition sub-model can be trained by using the training speech data as input and the corresponding training marked text as the target output. Wherein, the training may be performed based on commonly used training methods in this field, which will not be repeated here.
在语音识别子模型训练完成的情况下,则可以基于该训练完成的语音识别子模型进一步实现对语境识别子模型的训练。其中,该语境识别子模型用于确定热词对应的先验语境知识,因此,在本公开实施例中,可以直接从训练样本中获取对应的训练词语,以基于该训练词语的先验语境知识进行训练,保证训练词语的多样性,从而可以提高模型的稳定性和泛化性。其中,训练词语的音标序列用于表示训练词语的发音,如该训练样本为中文时,其对应的音标序列中则包含有调拼音,训练样本为英文时,其对应的音标序列则包含英式音标或者美式音标等。训练词语的文本序列则可以是该训练词语对应的识别文本。训练词语的训练标签用于表示所述训练词语对应的目标输出。After the speech recognition sub-model is trained, the context recognition sub-model can be further trained based on the trained speech recognition sub-model. Wherein, the context recognition sub-model is used to determine the prior context knowledge corresponding to the hot words, therefore, in the embodiment of the present disclosure, the corresponding training words can be obtained directly from the training samples, so that based on the prior knowledge of the training words Contextual knowledge is used for training to ensure the diversity of training words, which can improve the stability and generalization of the model. Among them, the phonetic sequence of the training word is used to represent the pronunciation of the training word. If the training sample is Chinese, its corresponding phonetic sequence contains tonal pinyin; when the training sample is English, its corresponding phonetic sequence contains British Phonetic symbols or American phonetic symbols, etc. The text sequence of the training word may be the recognition text corresponding to the training word. The training label of the training word is used to represent the target output corresponding to the training word.
之后,可以根据发音特征编码器对训练词语的音标序列进行编码,获得训练词语的发音特征向量,并根据文本特征编码器对训练词语的文本序列进行编码,获得训练词语的文本特征向量;Afterwards, the phonetic sequence of the training words can be encoded according to the pronunciation feature encoder to obtain the pronunciation feature vector of the training words, and the text sequence of the training words can be encoded according to the text feature encoder to obtain the text feature vector of the training words;
根据语音识别子模型获得训练样本中的训练语音数据对应的每一预测字符的字符声学向量;Obtaining the character acoustic vector of each predicted character corresponding to the training voice data in the training sample according to the voice recognition sub-model;
根据注意力模块、发音特征向量、文本特征向量以及字符声学向量,获得每一预测字符的语境特征向量;According to the attention module, the pronunciation feature vector, the text feature vector and the character acoustic vector, the context feature vector of each predicted character is obtained;
基于语境特征解码器对语境特征向量进行解码,每一预测字符的概率分布,之后根据 该概率分布获得语境识别子模型的输出文本。Based on the context feature decoder, the context feature vector is decoded, and the probability distribution of each predicted character is obtained, and then the output text of the context recognition sub-model is obtained according to the probability distribution.
其中,上述步骤的实现方式与上文所述针对热词和待识别语音数据的处理方式类似,在此不再赘述。Wherein, the implementation of the above steps is similar to the processing of hot words and speech data to be recognized above, and will not be repeated here.
之后,则可以根据输出文本和训练词语对应的训练标签确定语境识别子模型的目标损失。Afterwards, the target loss of the context recognition sub-model can be determined according to the output text and the training labels corresponding to the training words.
其中,可以根据该输出文本与训练标签计算交叉熵损失,并将该交叉熵损失确定为该目标损失。Wherein, the cross-entropy loss can be calculated according to the output text and the training label, and the cross-entropy loss can be determined as the target loss.
在满足更新条件的情况下,根据目标损失对语境识别子模型的模型参数进行更新。When the update condition is satisfied, the model parameters of the context recognition sub-model are updated according to the target loss.
作为示例,该更新条件可以为目标损失大于预设的损失阈值,此时表示语境识别子模型的识别准确性不足。作为另一示例,该更新条件可以是迭代次数小于预设的次数阈值,此时认为语境识别子模型迭代次数较少,其识别准确性不足。相应地,在满足更新条件的情况下,可以根据该目标损失对该语境识别子模型的模型参数进行更新。其中,基于确定出的损失对模型参数进行更新的方式可以采用本领域中常用的更新方式,如梯度下降法,在此不再赘述。As an example, the update condition may be that the target loss is greater than a preset loss threshold, which means that the recognition accuracy of the context recognition sub-model is insufficient. As another example, the update condition may be that the number of iterations is less than a preset number threshold, and at this time, it is considered that the number of iterations of the context recognition sub-model is relatively small, and its recognition accuracy is insufficient. Correspondingly, when the update condition is met, the model parameters of the context recognition sub-model can be updated according to the target loss. Wherein, the method of updating the model parameters based on the determined loss may adopt a commonly used updating method in the field, such as the gradient descent method, which will not be repeated here.
在不满足该更新条件的情况下,则可以认为该语境识别子模型的识别精确性达到训练要求,此时可以停止训练过程,获得训练完成的语境识别子模型,进而获得训练完成的语音识别模型。If the update condition is not met, it can be considered that the recognition accuracy of the context recognition sub-model meets the training requirements. At this time, the training process can be stopped to obtain the trained context recognition sub-model, and then obtain the trained speech Identify the model.
其中,需要进行说明的是,在对语境识别子模型的模型参数进行更新的过程中,保持训练完成的语音识别子模型的模型参数不变。由此,通过上述技术方案,可以在训练完成的语音识别子模型的基础上增加该语境识别子模型,以实现该语音识别模型,提高该训练方法的可扩展性和应用范围,同时可以在保证语音识别子模型的准确性的基础上,提供更加准确的先验语境知识,提高语音识别模型的识别准确率。Wherein, it should be noted that, in the process of updating the model parameters of the context recognition sub-model, the model parameters of the trained speech recognition sub-model are kept unchanged. Thus, through the above technical solution, the context recognition sub-model can be added on the basis of the trained speech recognition sub-model to realize the speech recognition model, improve the scalability and application range of the training method, and simultaneously On the basis of ensuring the accuracy of the speech recognition sub-model, more accurate prior context knowledge is provided to improve the recognition accuracy of the speech recognition model.
本公开还提供一种语音识别装置,如图4所示,所述装置40包括:The present disclosure also provides a speech recognition device, as shown in FIG. 4 , the device 40 includes:
接收模块41,用于接收待识别语音数据;The receiving module 41 is used to receive voice data to be recognized;
处理模块42,用于根据所述待识别语音数据、热词信息和语音识别模型,获得所述待识别语音数据对应的目标文本;其中,所述热词信息包含多个热词对应的文本序列和音标序列;所述语音识别模型包括语音识别子模型和语境识别子模型,所述语境识别子模型是基于训练词语以及所述训练词语的音标序列、文本序列以及训练标签进行训练的。The processing module 42 is used to obtain the target text corresponding to the speech data to be recognized according to the speech data to be recognized, the hot word information and the speech recognition model; wherein, the hot word information includes a text sequence corresponding to a plurality of hot words and phonetic symbol sequence; the speech recognition model includes a speech recognition sub-model and a context recognition sub-model, and the context recognition sub-model is trained based on training words and phonetic symbol sequences, text sequences and training labels of the training words.
可选地,所述语境识别子模型包括发音特征编码器、文本特征编码器、注意力模块和语境特征解码器;Optionally, the context recognition sub-model includes a pronunciation feature encoder, a text feature encoder, an attention module and a context feature decoder;
所述处理模块包括:The processing modules include:
编码子模块,用于根据所述发音特征编码器对所述热词的音标序列进行编码,获得所 述热词的发音特征向量,并根据所述文本特征编码器对所述热词的文本序列进行编码,获得所述热词的文本特征向量;The encoding submodule is used to encode the phonetic sequence of the hot word according to the pronunciation feature encoder, obtain the pronunciation feature vector of the hot word, and encode the text sequence of the hot word according to the text feature encoder Encoding is performed to obtain the text feature vector of the hot word;
第一处理子模块,用于根据所述语音识别子模型和所述待识别语音数据,获得所述待识别语音数据对应的每一预测字符的字符声学向量和文本概率分布;The first processing submodule is used to obtain the character acoustic vector and text probability distribution of each predicted character corresponding to the speech data to be recognized according to the speech recognition submodel and the speech data to be recognized;
第二处理子模块,用于根据所述注意力模块、所述发音特征向量、所述文本特征向量以及所述字符声学向量,获得每一所述预测字符的语境特征向量;The second processing submodule is used to obtain the context feature vector of each predicted character according to the attention module, the pronunciation feature vector, the text feature vector and the character acoustic vector;
第一解码子模块,用于根据所述语境特征解码器和所述语境特征向量,获得每一所述预测字符的语境概率分布;The first decoding submodule is used to obtain the context probability distribution of each predicted character according to the context feature decoder and the context feature vector;
第一确定子模块,用于根据所述文本概率分布和所述语境概率分布,确定所述待识别数据对应的目标文本。The first determining submodule is configured to determine the target text corresponding to the data to be recognized according to the text probability distribution and the context probability distribution.
可选地,所述第二处理子模块包括:Optionally, the second processing submodule includes:
第二确定子模块,用于针对每一所述热词,根据该热词的所述发音特征向量和所述文本特征向量,确定该热词对应的融合特征向量;The second determination submodule is used to determine the fusion feature vector corresponding to the hot word according to the pronunciation feature vector and the text feature vector of the hot word for each hot word;
第三确定子模块,用于针对每一所述预测字符,在所述注意力模块中,根据所述预测字符的字符声学向量、各个所述热词对应的融合特征向量和文本特征向量,确定所述预测字符对应的语境特征向量。The third determination sub-module is used for each of the predicted characters, in the attention module, according to the character acoustic vector of the predicted character, the fusion feature vector and the text feature vector corresponding to each of the hot words, determine The context feature vector corresponding to the predicted character.
可选地,所述第三确定子模块包括:Optionally, the third determining submodule includes:
第四确定子模块,用于将所述字符声学向量与每一所述热词对应的融合特征向量的点积,确定为该热词对应的初始权重;The fourth determining submodule is used to determine the dot product of the character acoustic vector and the fused feature vector corresponding to each of the hot words as the initial weight corresponding to the hot word;
第三处理子模块,用于将每一所述热词对应的初始权重进行归一化处理,获得每一所述热词对应的目标权重;The third processing submodule is used to normalize the initial weight corresponding to each of the hot words, and obtain the target weight corresponding to each of the hot words;
计算子模块,用于根据每一所述热词对应的目标权重对所述热词的文本特征向量进行加权和计算,获得所述语境特征向量。The calculation sub-module is configured to perform weighted sum calculation on the text feature vectors of the hot words according to the target weights corresponding to each of the hot words to obtain the context feature vector.
可选地,所述第三确定子模块还包括:Optionally, the third determining submodule also includes:
更新子模块,用于将按照所述目标权重由大至小的顺序排序M之后的目标权重更新为零,M为正整数;The update submodule is used to update the target weight after sorting M according to the order of the target weight from large to small to zero, and M is a positive integer;
所述计算子模块用于:The calculation sub-module is used for:
根据每一所述热词对应的更新后的目标权重对所述热词的文本特征向量进行加权和计算,获得所述语境特征向量。The text feature vector of the hot word is weighted and calculated according to the updated target weight corresponding to each of the hot words to obtain the context feature vector.
可选地,所述第一解码子模块包括:Optionally, the first decoding submodule includes:
第四处理子模块,用于针对每一所述预测字符,根据所述预测字符的声学字符向量和所述语境特征向量,获得预测字符的目标特征向量;The fourth processing submodule is used to obtain the target feature vector of the predicted character according to the acoustic character vector of the predicted character and the context feature vector for each predicted character;
第二解码子模块,用于根据所述语境特征解码器对所述目标特征向量进行解码,获得每一所述预测字符对应的语境概率分布。The second decoding submodule is configured to decode the target feature vector according to the context feature decoder to obtain a context probability distribution corresponding to each of the predicted characters.
可选地,通过以下方式确定所述训练词语的音标序列、文本序列以及训练标签:Optionally, determine the phonetic sequence, text sequence and training label of the training word in the following manner:
从每一训练样本的训练标注文本中确定所述训练词语;determining said training words from the training labeled text of each training sample;
针对每一所述训练词语,根据所述训练词语的语种确定所述训练词语的音标序列,并从所述训练标注文本中确定所述文本序列;For each of the training words, determine the phonetic sequence of the training words according to the language of the training words, and determine the text sequence from the training marked text;
针对每一所述训练词语,将所述训练词语对应的训练标注文本中、除所述训练词语之外的文本替换为预设标签,以生成所述训练词语对应的训练标签。For each of the training words, texts other than the training words in the training tagged text corresponding to the training words are replaced with preset labels, so as to generate training labels corresponding to the training words.
下面参考图5,其示出了适于用来实现本公开实施例的电子设备600的结构示意图。本公开实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图5示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。Referring now to FIG. 5 , it shows a schematic structural diagram of an electronic device 600 suitable for implementing the embodiments of the present disclosure. The terminal equipment in the embodiment of the present disclosure may include but not limited to such as mobile phone, notebook computer, digital broadcast receiver, PDA (personal digital assistant), PAD (tablet computer), PMP (portable multimedia player), vehicle terminal (such as mobile terminals such as car navigation terminals) and fixed terminals such as digital TVs, desktop computers and the like. The electronic device shown in FIG. 5 is only an example, and should not limit the functions and scope of use of the embodiments of the present disclosure.
如图5所示,电子设备600可以包括处理装置(例如中央处理器、图形处理器等)601,其可以根据存储在只读存储器(ROM)602中的程序或者从存储装置608加载到随机访问存储器(RAM)603中的程序而执行各种适当的动作和处理。在RAM 603中,还存储有电子设备600操作所需的各种程序和数据。处理装置601、ROM 602以及RAM 603通过总线604彼此相连。输入/输出(I/O)接口605也连接至总线604。As shown in FIG. 5, an electronic device 600 may include a processing device (such as a central processing unit, a graphics processing unit, etc.) 601, which may be randomly accessed according to a program stored in a read-only memory (ROM) 602 or loaded from a storage device 608. Various appropriate actions and processes are executed by programs in the memory (RAM) 603 . In the RAM 603, various programs and data necessary for the operation of the electronic device 600 are also stored. The processing device 601, ROM 602, and RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to the bus 604 .
通常,以下装置可以连接至I/O接口605:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置606;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置607;包括例如磁带、硬盘等的存储装置608;以及通信装置609。通信装置609可以允许电子设备600与其他设备进行无线或有线通信以交换数据。虽然图5示出了具有各种装置的电子设备600,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。Typically, the following devices can be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speaker, vibration an output device 607 such as a computer; a storage device 608 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While FIG. 5 shows electronic device 600 having various means, it should be understood that implementing or having all of the means shown is not a requirement. More or fewer means may alternatively be implemented or provided.
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置609从网络上被下载和安装,或者从存储装置608被安装,或者从ROM 602被安装。在该计算机程序被处理装置601执行时,执行本公开实施例的方法中限定的上述功能。In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product, which includes a computer program carried on a non-transitory computer readable medium, where the computer program includes program code for executing the method shown in the flowchart. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 609, or from storage means 608, or from ROM 602. When the computer program is executed by the processing device 601, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are performed.
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于 ——电、磁、光、电磁、红外线、或半导体的***、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行***、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行***、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。It should be noted that the above-mentioned computer-readable medium in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two. A computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In the present disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present disclosure, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can transmit, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted by any appropriate medium, including but not limited to wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.
在一些实施方式中,客户端、服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。In some embodiments, the client and the server can communicate using any currently known or future network protocols such as HTTP (HyperText Transfer Protocol, Hypertext Transfer Protocol), and can communicate with digital data in any form or medium The communication (eg, communication network) interconnections. Examples of communication networks include local area networks ("LANs"), wide area networks ("WANs"), internetworks (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network of.
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。The above-mentioned computer-readable medium may be included in the above-mentioned electronic device, or may exist independently without being incorporated into the electronic device.
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:接收待识别语音数据;根据所述待识别语音数据、热词信息和语音识别模型,获得所述待识别语音数据对应的目标文本;其中,所述热词信息包含多个热词对应的文本序列和音标序列;所述语音识别模型包括语音识别子模型和语境识别子模型,所述语境识别子模型是基于训练词语以及所述训练词语的音标序列、文本序列以及训练标签进行训练的。The above-mentioned computer-readable medium carries one or more programs. When the above-mentioned one or more programs are executed by the electronic device, the electronic device: receives the speech data to be recognized; A speech recognition model, which obtains the target text corresponding to the speech data to be recognized; wherein, the hot word information includes text sequences and phonetic symbol sequences corresponding to a plurality of hot words; the speech recognition model includes a speech recognition sub-model and context recognition A sub-model, the context recognition sub-model is trained based on the training words and the phonetic symbol sequences, text sequences and training labels of the training words.
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言——诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网 (LAN)或广域网(WAN)——连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, or combinations thereof, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and Includes conventional procedural programming languages - such as "C" or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In cases involving a remote computer, the remote computer may be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (for example, using an Internet service provider to connected via the Internet).
附图中的流程图和框图,图示了按照本公开各种实施例的***、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的***来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.
描述于本公开实施例中所涉及到的模块可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,模块的名称在某种情况下并不构成对该模块本身的限定,例如,接收模块还可以被描述为“接收待识别语音数据的模块”。The modules involved in the embodiments described in the present disclosure may be implemented by software or by hardware. Wherein, the name of the module does not constitute a limitation on the module itself under certain circumstances, for example, the receiving module can also be described as "a module that receives speech data to be recognized".
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上***(SOC)、复杂可编程逻辑设备(CPLD)等等。The functions described herein above may be performed at least in part by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), System on Chips (SOCs), Complex Programmable Logical device (CPLD) and so on.
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行***、装置或设备使用或与指令执行***、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体***、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
根据本公开的一个或多个实施例,示例1提供了一种语音识别方法,所述方法包括:According to one or more embodiments of the present disclosure, Example 1 provides a speech recognition method, the method comprising:
接收待识别语音数据;Receive voice data to be recognized;
根据所述待识别语音数据、热词信息和语音识别模型,获得所述待识别语音数据对应的目标文本;其中,所述热词信息包含多个热词对应的文本序列和音标序列;所述语音识别模型包括语音识别子模型和语境识别子模型,所述语境识别子模型是基于训练词语以及所述训练词语的音标序列、文本序列以及训练标签进行训练的。According to the speech data to be recognized, the hot word information and the speech recognition model, the target text corresponding to the speech data to be recognized is obtained; wherein the hot word information includes a plurality of hot words corresponding to the text sequence and phonetic sequence; the The speech recognition model includes a speech recognition sub-model and a context recognition sub-model, and the context recognition sub-model is trained based on training words, phonetic symbols sequences, text sequences and training labels of the training words.
根据本公开的一个或多个实施例,示例2提供了示例1的方法,所述语境识别子模型包括发音特征编码器、文本特征编码器、注意力模块和语境特征解码器;According to one or more embodiments of the present disclosure, Example 2 provides the method of Example 1, the context recognition sub-model includes a pronunciation feature encoder, a text feature encoder, an attention module and a context feature decoder;
所述根据所述待识别语音数据、热词信息和语音识别模型,获得所述待识别语音数据对应的目标文本,包括:The obtaining the target text corresponding to the speech data to be recognized according to the speech data to be recognized, the hot word information and the speech recognition model includes:
根据所述发音特征编码器对所述热词的音标序列进行编码,获得所述热词的发音特征向量,并根据所述文本特征编码器对所述热词的文本序列进行编码,获得所述热词的文本特征向量;Encode the phonetic sequence of the hot word according to the pronunciation feature encoder to obtain the pronunciation feature vector of the hot word, and encode the text sequence of the hot word according to the text feature encoder to obtain the The text feature vector of hot words;
根据所述语音识别子模型和所述待识别语音数据,获得所述待识别语音数据对应的每一预测字符的字符声学向量和文本概率分布;According to the speech recognition sub-model and the speech data to be recognized, obtain the character acoustic vector and text probability distribution of each predicted character corresponding to the speech data to be recognized;
根据所述注意力模块、所述发音特征向量、所述文本特征向量以及所述字符声学向量,获得每一所述预测字符的语境特征向量;Obtaining a context feature vector of each predicted character according to the attention module, the pronunciation feature vector, the text feature vector, and the character acoustic vector;
根据所述语境特征解码器和所述语境特征向量,获得每一所述预测字符的语境概率分布;Obtaining a contextual probability distribution of each of the predicted characters according to the contextual feature decoder and the contextual feature vector;
根据所述文本概率分布和所述语境概率分布,确定所述待识别数据对应的目标文本。A target text corresponding to the data to be recognized is determined according to the text probability distribution and the context probability distribution.
根据本公开的一个或多个实施例,示例3提供了示例2的方法,所述根据所述注意力模块、所述发音特征向量、所述文本特征向量以及所述字符声学向量,获得每一所述预测字符的语境特征向量,包括:According to one or more embodiments of the present disclosure, Example 3 provides the method of Example 2, according to the attention module, the pronunciation feature vector, the text feature vector and the character acoustic vector, each The contextual feature vector of the predicted character includes:
针对每一所述热词,根据该热词的所述发音特征向量和所述文本特征向量,确定该热词对应的融合特征向量;For each of the hot words, according to the pronunciation feature vector and the text feature vector of the hot word, determine the fusion feature vector corresponding to the hot word;
针对每一所述预测字符,在所述注意力模块中,根据所述预测字符的字符声学向量、各个所述热词对应的融合特征向量和文本特征向量,确定所述预测字符对应的语境特征向量。For each of the predicted characters, in the attention module, determine the context corresponding to the predicted character according to the character acoustic vector of the predicted character, the fusion feature vector and the text feature vector corresponding to each of the hot words Feature vector.
根据本公开的一个或多个实施例,示例4提供了示例3的方法,所述根据所述预测字符的字符声学向量、各个所述热词对应的融合特征向量和文本特征向量,确定所述预测字符对应的语境特征向量,包括:According to one or more embodiments of the present disclosure, Example 4 provides the method of Example 3, wherein the character acoustic vectors of the predicted characters, the fusion feature vectors and text feature vectors corresponding to each of the hot words are used to determine the Predict the contextual feature vector corresponding to the character, including:
将所述字符声学向量与每一所述热词对应的融合特征向量的点积,确定为该热词对应的初始权重;The dot product of the character acoustic vector and the fusion feature vector corresponding to each of the hot words is determined as the initial weight corresponding to the hot word;
将每一所述热词对应的初始权重进行归一化处理,获得每一所述热词对应的目标权重;performing normalization processing on the initial weight corresponding to each of the hot words to obtain a target weight corresponding to each of the hot words;
根据每一所述热词对应的目标权重对所述热词的文本特征向量进行加权和计算,获得所述语境特征向量。The text feature vectors of the hot words are weighted and calculated according to the target weight corresponding to each of the hot words to obtain the context feature vector.
根据本公开的一个或多个实施例,示例5提供了示例4的方法,所述根据所述预测字符的字符声学向量、各个所述热词对应的融合特征向量和文本特征向量,确定所述预测字 符对应的语境特征向量,还包括:According to one or more embodiments of the present disclosure, Example 5 provides the method of Example 4, the determination of the The contextual feature vector corresponding to the predicted character also includes:
将按照所述目标权重由大至小的顺序排序M之后的目标权重更新为零,M为正整数;Update the target weights after sorting M according to the order of the target weights from large to small to zero, and M is a positive integer;
所述根据每一所述热词对应的目标权重对所述热词的文本特征向量进行加权和计算,获得所述语境特征向量,包括:The text feature vector of the hot word is weighted and calculated according to the target weight corresponding to each of the hot words, and the context feature vector is obtained, including:
根据每一所述热词对应的更新后的目标权重对所述热词的文本特征向量进行加权和计算,获得所述语境特征向量。The text feature vectors of the hot words are weighted and calculated according to the updated target weights corresponding to each of the hot words to obtain the context feature vector.
根据本公开的一个或多个实施例,示例6提供了示例2的方法,所述根据所述语境特征解码器和所述语境特征向量,获得每一所述预测字符的语境概率分布,包括:According to one or more embodiments of the present disclosure, Example 6 provides the method of Example 2, the context probability distribution of each of the predicted characters is obtained according to the context feature decoder and the context feature vector ,include:
针对每一所述预测字符,根据所述预测字符的声学字符向量和所述语境特征向量,获得预测字符的目标特征向量;For each of the predicted characters, according to the acoustic character vector of the predicted character and the context feature vector, obtain a target feature vector of the predicted character;
根据所述语境特征解码器对所述目标特征向量进行解码,获得每一所述预测字符对应的语境概率分布。The target feature vector is decoded according to the context feature decoder to obtain a context probability distribution corresponding to each predicted character.
根据本公开的一个或多个实施例,示例7提供了示例1-6中任一示例的方法,通过以下方式确定所述训练词语的音标序列、文本序列以及训练标签:According to one or more embodiments of the present disclosure, Example 7 provides the method of any one of Examples 1-6, which determines the phonetic sequence, text sequence and training label of the training word in the following manner:
从每一训练样本的训练标注文本中确定所述训练词语;determining said training words from the training labeled text of each training sample;
针对每一所述训练词语,根据所述训练词语的语种确定所述训练词语的音标序列,并从所述训练标注文本中确定所述文本序列;For each of the training words, determine the phonetic sequence of the training words according to the language of the training words, and determine the text sequence from the training marked text;
针对每一所述训练词语,将所述训练词语对应的训练标注文本中、除所述训练词语之外的文本替换为预设标签,以生成所述训练词语对应的训练标签。For each of the training words, texts other than the training words in the training labeled text corresponding to the training words are replaced with preset labels, so as to generate training labels corresponding to the training words.
根据本公开的一个或多个实施例,示例8提供了一种语音识别装置,所述装置包括:According to one or more embodiments of the present disclosure, Example 8 provides a speech recognition device, the device comprising:
接收模块,用于接收待识别语音数据;A receiving module, configured to receive voice data to be recognized;
处理模块,用于根据所述待识别语音数据、热词信息和语音识别模型,获得所述待识别语音数据对应的目标文本;其中,所述热词信息包含多个热词对应的文本序列和音标序列;所述语音识别模型包括语音识别子模型和语境识别子模型,所述语境识别子模型是基于训练词语以及所述训练词语的音标序列、文本序列以及训练标签进行训练的。A processing module, configured to obtain target text corresponding to the speech data to be recognized according to the speech data to be recognized, hot word information and a speech recognition model; wherein, the hot word information includes a plurality of text sequences corresponding to hot words and phonetic symbol sequence; the speech recognition model includes a speech recognition sub-model and a context recognition sub-model, and the context recognition sub-model is trained based on training words and phonetic symbol sequences, text sequences and training labels of the training words.
根据本公开的一个或多个实施例,示例9提供了一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现示例1-7中任一示例所述方法的步骤。According to one or more embodiments of the present disclosure, Example 9 provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of any one of the methods described in Examples 1-7 are implemented .
根据本公开的一个或多个实施例,示例10提供了一种电子设备,包括:According to one or more embodiments of the present disclosure, Example 10 provides an electronic device, comprising:
存储装置,其上存储有计算机程序;a storage device on which a computer program is stored;
处理装置,用于执行所述存储装置中的所述计算机程序,以实现示例1-7中任一示例所述方法的步骤。A processing device configured to execute the computer program in the storage device, so as to implement the steps of the method in any one of examples 1-7.
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应 当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。The above description is only a preferred embodiment of the present disclosure and an illustration of the applied technical principle. Those skilled in the art should understand that the disclosure scope involved in this disclosure is not limited to the technical solution formed by the specific combination of the above-mentioned technical features, but also covers the technical solutions formed by the above-mentioned technical features or Other technical solutions formed by any combination of equivalent features. For example, a technical solution formed by replacing the above-mentioned features with (but not limited to) technical features with similar functions disclosed in this disclosure.
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。In addition, while operations are depicted in a particular order, this should not be understood as requiring that the operations be performed in the particular order shown or performed in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while the above discussion contains several specific implementation details, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are merely example forms of implementing the claims. Regarding the apparatus in the foregoing embodiments, the specific manner in which each module executes operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

Claims (10)

  1. 一种语音识别方法,其特征在于,所述方法包括:A speech recognition method, characterized in that the method comprises:
    接收待识别语音数据;Receive voice data to be recognized;
    根据所述待识别语音数据、热词信息和语音识别模型,获得所述待识别语音数据对应的目标文本;其中,所述热词信息包含多个热词对应的文本序列和音标序列;所述语音识别模型包括语音识别子模型和语境识别子模型,所述语境识别子模型是基于训练词语以及所述训练词语的音标序列、文本序列以及训练标签进行训练的。According to the speech data to be recognized, the hot word information and the speech recognition model, the target text corresponding to the speech data to be recognized is obtained; wherein the hot word information includes a plurality of hot words corresponding to the text sequence and phonetic sequence; the The speech recognition model includes a speech recognition sub-model and a context recognition sub-model, and the context recognition sub-model is trained based on training words, phonetic symbols sequences, text sequences and training labels of the training words.
  2. 根据权利要求1所述的方法,其特征在于,所述语境识别子模型包括发音特征编码器、文本特征编码器、注意力模块和语境特征解码器;The method according to claim 1, wherein the context recognition sub-model comprises a pronunciation feature encoder, a text feature encoder, an attention module and a context feature decoder;
    所述根据所述待识别语音数据、热词信息和语音识别模型,获得所述待识别语音数据对应的目标文本,包括:The obtaining the target text corresponding to the speech data to be recognized according to the speech data to be recognized, the hot word information and the speech recognition model includes:
    根据所述发音特征编码器对所述热词的音标序列进行编码,获得所述热词的发音特征向量,并根据所述文本特征编码器对所述热词的文本序列进行编码,获得所述热词的文本特征向量;Encode the phonetic sequence of the hot word according to the pronunciation feature encoder to obtain the pronunciation feature vector of the hot word, and encode the text sequence of the hot word according to the text feature encoder to obtain the The text feature vector of hot words;
    根据所述语音识别子模型和所述待识别语音数据,获得所述待识别语音数据对应的每一预测字符的字符声学向量和文本概率分布;According to the speech recognition sub-model and the speech data to be recognized, obtain the character acoustic vector and text probability distribution of each predicted character corresponding to the speech data to be recognized;
    根据所述注意力模块、所述发音特征向量、所述文本特征向量以及所述字符声学向量,获得每一所述预测字符的语境特征向量;Obtaining a context feature vector of each predicted character according to the attention module, the pronunciation feature vector, the text feature vector, and the character acoustic vector;
    根据所述语境特征解码器和所述语境特征向量,获得每一所述预测字符的语境概率分布;Obtaining a contextual probability distribution of each of the predicted characters according to the contextual feature decoder and the contextual feature vector;
    根据所述文本概率分布和所述语境概率分布,确定所述待识别数据对应的目标文本。A target text corresponding to the data to be recognized is determined according to the text probability distribution and the context probability distribution.
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述注意力模块、所述发音特征向量、所述文本特征向量以及所述字符声学向量,获得每一所述预测字符的语境特征向量,包括:The method according to claim 2, wherein the context of each predicted character is obtained according to the attention module, the pronunciation feature vector, the text feature vector and the character acoustic vector Character vectors, including:
    针对每一所述热词,根据该热词的所述发音特征向量和所述文本特征向量,确定该热词对应的融合特征向量;For each of the hot words, according to the pronunciation feature vector and the text feature vector of the hot word, determine the fusion feature vector corresponding to the hot word;
    针对每一所述预测字符,在所述注意力模块中,根据所述预测字符的字符声学向量、各个所述热词对应的融合特征向量和文本特征向量,确定所述预测字符对应的语境特征向量。For each of the predicted characters, in the attention module, determine the context corresponding to the predicted character according to the character acoustic vector of the predicted character, the fusion feature vector and the text feature vector corresponding to each of the hot words Feature vector.
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述预测字符的字符声学向量、各个所述热词对应的融合特征向量和文本特征向量,确定所述预测字符对应的语境特征向量,包括:The method according to claim 3, wherein, according to the character acoustic vector of the predicted character, the fusion feature vector and the text feature vector corresponding to each of the hot words, the contextual feature corresponding to the predicted character is determined vector, including:
    将所述字符声学向量与每一所述热词对应的融合特征向量的点积,确定为该热词对应 的初始权重;The dot product of the fusion feature vector corresponding to the character acoustic vector and each described hot word is determined as the initial weight corresponding to the hot word;
    将每一所述热词对应的初始权重进行归一化处理,获得每一所述热词对应的目标权重;performing normalization processing on the initial weight corresponding to each of the hot words to obtain a target weight corresponding to each of the hot words;
    根据每一所述热词对应的目标权重对所述热词的文本特征向量进行加权和计算,获得所述语境特征向量。The text feature vectors of the hot words are weighted and calculated according to the target weight corresponding to each of the hot words to obtain the context feature vector.
  5. 根据权利要求4所述的方法,其特征在于,所述根据所述预测字符的字符声学向量、各个所述热词对应的融合特征向量和文本特征向量,确定所述预测字符对应的语境特征向量,还包括:The method according to claim 4, wherein, according to the character acoustic vector of the predicted character, the fusion feature vector and the text feature vector corresponding to each of the hot words, the contextual feature corresponding to the predicted character is determined vector, also includes:
    将按照所述目标权重由大至小的顺序排序M之后的目标权重更新为零,M为正整数;Update the target weights after sorting M according to the order of the target weights from large to small to zero, and M is a positive integer;
    所述根据每一所述热词对应的目标权重对所述热词的文本特征向量进行加权和计算,获得所述语境特征向量,包括:The text feature vector of the hot word is weighted and calculated according to the target weight corresponding to each of the hot words, and the context feature vector is obtained, including:
    根据每一所述热词对应的更新后的目标权重对所述热词的文本特征向量进行加权和计算,获得所述语境特征向量。The text feature vector of the hot word is weighted and calculated according to the updated target weight corresponding to each of the hot words to obtain the context feature vector.
  6. 根据权利要求2所述的方法,其特征在于,所述根据所述语境特征解码器和所述语境特征向量,获得每一所述预测字符的语境概率分布,包括:The method according to claim 2, wherein said obtaining the context probability distribution of each predicted character according to said contextual feature decoder and said contextual feature vector comprises:
    针对每一所述预测字符,根据所述预测字符的声学字符向量和所述语境特征向量,获得预测字符的目标特征向量;For each of the predicted characters, according to the acoustic character vector of the predicted character and the context feature vector, obtain a target feature vector of the predicted character;
    根据所述语境特征解码器对所述目标特征向量进行解码,获得每一所述预测字符对应的语境概率分布。The target feature vector is decoded according to the context feature decoder to obtain a context probability distribution corresponding to each predicted character.
  7. 根据权利要求1-6中任一项所述的方法,其特征在于,通过以下方式确定所述训练词语的音标序列、文本序列以及训练标签:The method according to any one of claims 1-6, wherein the phonetic sequence, text sequence and training label of the training word are determined in the following manner:
    从每一训练样本的训练标注文本中确定所述训练词语;determining said training words from the training labeled text of each training sample;
    针对每一所述训练词语,根据所述训练词语的语种确定所述训练词语的音标序列,并从所述训练标注文本中确定所述文本序列;For each of the training words, determine the phonetic sequence of the training words according to the language of the training words, and determine the text sequence from the training marked text;
    针对每一所述训练词语,将所述训练词语对应的训练标注文本中、除所述训练词语之外的文本替换为预设标签,以生成所述训练词语对应的训练标签。For each of the training words, texts other than the training words in the training tagged text corresponding to the training words are replaced with preset labels, so as to generate training labels corresponding to the training words.
  8. 一种语音识别装置,其特征在于,所述装置包括:A speech recognition device, characterized in that the device comprises:
    接收模块,用于接收待识别语音数据;A receiving module, configured to receive voice data to be recognized;
    处理模块,用于根据所述待识别语音数据、热词信息和语音识别模型,获得所述待识别语音数据对应的目标文本;其中,所述热词信息包含多个热词对应的文本序列和音标序列;所述语音识别模型包括语音识别子模型和语境识别子模型,所述语境识别子模型是基于训练词语以及所述训练词语的音标序列、文本序列以及训练标签进行训练的。A processing module, configured to obtain target text corresponding to the speech data to be recognized according to the speech data to be recognized, hot word information and a speech recognition model; wherein, the hot word information includes a plurality of text sequences corresponding to hot words and phonetic symbol sequence; the speech recognition model includes a speech recognition sub-model and a context recognition sub-model, and the context recognition sub-model is trained based on training words and phonetic symbol sequences, text sequences and training labels of the training words.
  9. 一种计算机可读介质,其上存储有计算机程序,其特征在于,该程序被处理装置执 行时实现权利要求1-7中任一项所述方法的步骤。A computer-readable medium on which a computer program is stored, wherein the program is executed by a processing device to implement the steps of the method according to any one of claims 1-7.
  10. 一种电子设备,其特征在于,包括:An electronic device, characterized in that it comprises:
    存储装置,其上存储有计算机程序;a storage device on which a computer program is stored;
    处理装置,用于执行所述存储装置中的所述计算机程序,以实现权利要求1-7中任一项所述方法的步骤。A processing device configured to execute the computer program in the storage device to implement the steps of the method according to any one of claims 1-7.
PCT/CN2022/089595 2021-06-30 2022-04-27 Speech recognition method and apparatus, and medium and device WO2023273578A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110735672.7 2021-06-30
CN202110735672.7A CN113470619B (en) 2021-06-30 2021-06-30 Speech recognition method, device, medium and equipment

Publications (1)

Publication Number Publication Date
WO2023273578A1 true WO2023273578A1 (en) 2023-01-05

Family

ID=77876448

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/089595 WO2023273578A1 (en) 2021-06-30 2022-04-27 Speech recognition method and apparatus, and medium and device

Country Status (2)

Country Link
CN (1) CN113470619B (en)
WO (1) WO2023273578A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116705058A (en) * 2023-08-04 2023-09-05 贝壳找房(北京)科技有限公司 Processing method of multimode voice task, electronic equipment and readable storage medium
CN117437909A (en) * 2023-12-20 2024-01-23 慧言科技(天津)有限公司 Speech recognition model construction method based on hotword feature vector self-attention mechanism

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113470619B (en) * 2021-06-30 2023-08-18 北京有竹居网络技术有限公司 Speech recognition method, device, medium and equipment
CN114036959A (en) * 2021-11-25 2022-02-11 北京房江湖科技有限公司 Method, apparatus, computer program product and storage medium for determining a context of a conversation
CN115713939B (en) * 2023-01-06 2023-04-21 阿里巴巴达摩院(杭州)科技有限公司 Voice recognition method and device and electronic equipment
CN116110378B (en) * 2023-04-12 2023-07-18 中国科学院自动化研究所 Model training method, voice recognition device and electronic equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20160023424A (en) * 2014-08-22 2016-03-03 현대자동차주식회사 Apparatus of voice recognition, vehicle and having the same, method of controlling the vehicle
CN110706690A (en) * 2019-09-16 2020-01-17 平安科技(深圳)有限公司 Speech recognition method and device
CN111583909A (en) * 2020-05-18 2020-08-25 科大讯飞股份有限公司 Voice recognition method, device, equipment and storage medium
CN111816165A (en) * 2020-07-07 2020-10-23 北京声智科技有限公司 Voice recognition method and device and electronic equipment
CN111933129A (en) * 2020-09-11 2020-11-13 腾讯科技(深圳)有限公司 Audio processing method, language model training method and device and computer equipment
CN112489646A (en) * 2020-11-18 2021-03-12 北京华宇信息技术有限公司 Speech recognition method and device
CN113470619A (en) * 2021-06-30 2021-10-01 北京有竹居网络技术有限公司 Speech recognition method, apparatus, medium, and device

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9202465B2 (en) * 2011-03-25 2015-12-01 General Motors Llc Speech recognition dependent on text message content
US9305554B2 (en) * 2013-07-17 2016-04-05 Samsung Electronics Co., Ltd. Multi-level speech recognition
CN105719649B (en) * 2016-01-19 2019-07-05 百度在线网络技术(北京)有限公司 Audio recognition method and device
IL252071A0 (en) * 2017-05-03 2017-07-31 Google Inc Contextual language translation
CN110689881B (en) * 2018-06-20 2022-07-12 深圳市北科瑞声科技股份有限公司 Speech recognition method, speech recognition device, computer equipment and storage medium
CN110896664B (en) * 2018-06-25 2023-12-26 谷歌有限责任公司 Hotword aware speech synthesis
CN109815322B (en) * 2018-12-27 2021-03-12 东软集团股份有限公司 Response method and device, storage medium and electronic equipment
CN110517692A (en) * 2019-08-30 2019-11-29 苏州思必驰信息科技有限公司 Hot word audio recognition method and device
CN110544477A (en) * 2019-09-29 2019-12-06 北京声智科技有限公司 Voice recognition method, device, equipment and medium
CN110879839A (en) * 2019-11-27 2020-03-13 北京声智科技有限公司 Hot word recognition method, device and system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20160023424A (en) * 2014-08-22 2016-03-03 현대자동차주식회사 Apparatus of voice recognition, vehicle and having the same, method of controlling the vehicle
CN110706690A (en) * 2019-09-16 2020-01-17 平安科技(深圳)有限公司 Speech recognition method and device
CN111583909A (en) * 2020-05-18 2020-08-25 科大讯飞股份有限公司 Voice recognition method, device, equipment and storage medium
CN111816165A (en) * 2020-07-07 2020-10-23 北京声智科技有限公司 Voice recognition method and device and electronic equipment
CN111933129A (en) * 2020-09-11 2020-11-13 腾讯科技(深圳)有限公司 Audio processing method, language model training method and device and computer equipment
CN112489646A (en) * 2020-11-18 2021-03-12 北京华宇信息技术有限公司 Speech recognition method and device
CN113470619A (en) * 2021-06-30 2021-10-01 北京有竹居网络技术有限公司 Speech recognition method, apparatus, medium, and device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116705058A (en) * 2023-08-04 2023-09-05 贝壳找房(北京)科技有限公司 Processing method of multimode voice task, electronic equipment and readable storage medium
CN116705058B (en) * 2023-08-04 2023-10-27 贝壳找房(北京)科技有限公司 Processing method of multimode voice task, electronic equipment and readable storage medium
CN117437909A (en) * 2023-12-20 2024-01-23 慧言科技(天津)有限公司 Speech recognition model construction method based on hotword feature vector self-attention mechanism
CN117437909B (en) * 2023-12-20 2024-03-05 慧言科技(天津)有限公司 Speech recognition model construction method based on hotword feature vector self-attention mechanism

Also Published As

Publication number Publication date
CN113470619A (en) 2021-10-01
CN113470619B (en) 2023-08-18

Similar Documents

Publication Publication Date Title
WO2023273578A1 (en) Speech recognition method and apparatus, and medium and device
JP7112536B2 (en) Method and apparatus for mining entity attention points in text, electronic device, computer-readable storage medium and computer program
CN111583903B (en) Speech synthesis method, vocoder training method, device, medium, and electronic device
WO2023273611A1 (en) Speech recognition model training method and apparatus, speech recognition method and apparatus, medium, and device
CN111368559A (en) Voice translation method and device, electronic equipment and storage medium
WO2023273612A1 (en) Training method and apparatus for speech recognition model, speech recognition method and apparatus, medium, and device
CN112509562B (en) Method, apparatus, electronic device and medium for text post-processing
WO2022247562A1 (en) Multi-modal data retrieval method and apparatus, and medium and electronic device
WO2020182123A1 (en) Method and device for pushing statement
US20240029709A1 (en) Voice generation method and apparatus, device, and computer readable medium
CN112883967B (en) Image character recognition method, device, medium and electronic equipment
CN112883968B (en) Image character recognition method, device, medium and electronic equipment
WO2023273610A1 (en) Speech recognition method and apparatus, medium, and electronic device
WO2023143016A1 (en) Feature extraction model generation method and apparatus, and image feature extraction method and apparatus
WO2023005763A1 (en) Information processing method and apparatus, and electronic device
CN111368560A (en) Text translation method and device, electronic equipment and storage medium
CN115270717A (en) Method, device, equipment and medium for detecting vertical position
CN114765025A (en) Method for generating and recognizing speech recognition model, device, medium and equipment
CN111090993A (en) Attribute alignment model training method and device
CN113140012B (en) Image processing method, device, medium and electronic equipment
CN113033707A (en) Video classification method and device, readable medium and electronic equipment
CN112069786A (en) Text information processing method and device, electronic equipment and medium
CN114625876B (en) Method for generating author characteristic model, method and device for processing author information
CN116244431A (en) Text classification method, device, medium and electronic equipment
CN113947060A (en) Text conversion method, device, medium and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22831406

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22831406

Country of ref document: EP

Kind code of ref document: A1