WO2023273578A1 - 语音识别方法、装置、介质及设备 - Google Patents

语音识别方法、装置、介质及设备 Download PDF

Info

Publication number
WO2023273578A1
WO2023273578A1 PCT/CN2022/089595 CN2022089595W WO2023273578A1 WO 2023273578 A1 WO2023273578 A1 WO 2023273578A1 CN 2022089595 W CN2022089595 W CN 2022089595W WO 2023273578 A1 WO2023273578 A1 WO 2023273578A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
training
feature vector
words
context
Prior art date
Application number
PCT/CN2022/089595
Other languages
English (en)
French (fr)
Inventor
董林昊
韩明伦
马泽君
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2023273578A1 publication Critical patent/WO2023273578A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training

Definitions

  • the present disclosure relates to the field of computer technology, and in particular, to a speech recognition method, device, medium and equipment.
  • the present disclosure provides a speech recognition method, the method comprising:
  • the hot word information and the speech recognition model, the target text corresponding to the speech data to be recognized is obtained; wherein the hot word information includes a plurality of hot words corresponding to the text sequence and phonetic sequence; the The speech recognition model includes a speech recognition sub-model and a context recognition sub-model, and the context recognition sub-model is trained based on training words, phonetic symbols sequences, text sequences and training labels of the training words.
  • the context recognition sub-model includes a pronunciation feature encoder, a text feature encoder, an attention module and a context feature decoder;
  • the speech recognition sub-model and the speech data to be recognized obtain the character acoustic vector and text probability distribution of each predicted character corresponding to the speech data to be recognized;
  • a target text corresponding to the data to be recognized is determined according to the text probability distribution and the context probability distribution.
  • the obtaining the context feature vector of each predicted character according to the attention module, the pronunciation feature vector, the text feature vector and the character acoustic vector includes:
  • the attention module For each of the predicted characters, in the attention module, determine the context corresponding to the predicted character according to the character acoustic vector of the predicted character, the fusion feature vector and the text feature vector corresponding to each of the hot words Feature vector.
  • the determining the contextual feature vector corresponding to the predicted character according to the character acoustic vector of the predicted character, the fusion feature vector and the text feature vector corresponding to each of the hot words includes:
  • the dot product of the character acoustic vector and the fusion feature vector corresponding to each of the hot words is determined as the initial weight corresponding to the hot word;
  • the text feature vectors of the hot words are weighted and calculated according to the target weight corresponding to each of the hot words to obtain the context feature vector.
  • the determining the contextual feature vector corresponding to the predicted character according to the character acoustic vector of the predicted character, the fusion feature vector and the text feature vector corresponding to each of the hot words further includes:
  • the text feature vector of the hot word is weighted and calculated according to the target weight corresponding to each of the hot words, and the context feature vector is obtained, including:
  • the text feature vector of the hot word is weighted and calculated according to the updated target weight corresponding to each of the hot words to obtain the context feature vector.
  • the obtaining the context probability distribution of each predicted character according to the context feature decoder and the context feature vector includes:
  • the target feature vector is decoded according to the context feature decoder to obtain a context probability distribution corresponding to each predicted character.
  • determine the phonetic sequence, text sequence and training label of the training word in the following manner:
  • For each of the training words determine the phonetic sequence of the training words according to the language of the training words, and determine the text sequence from the training marked text;
  • texts other than the training words in the training tagged text corresponding to the training words are replaced with preset labels, so as to generate training labels corresponding to the training words.
  • the present disclosure provides a speech recognition device, the device comprising:
  • a receiving module configured to receive voice data to be recognized
  • a processing module configured to obtain target text corresponding to the speech data to be recognized according to the speech data to be recognized, hot word information and a speech recognition model; wherein, the hot word information includes a plurality of text sequences corresponding to hot words and phonetic symbol sequence; the speech recognition model includes a speech recognition sub-model and a context recognition sub-model, and the context recognition sub-model is trained based on training words and phonetic symbol sequences, text sequences and training labels of the training words.
  • the present disclosure provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of any one of the methods described in the first aspect are implemented.
  • an electronic device including:
  • a processing device configured to execute the computer program in the storage device to implement the steps of any one of the methods in the first aspect.
  • FIG. 1 is a flow chart of a speech recognition method provided according to an embodiment of the present disclosure
  • FIG. 2 is a schematic structural diagram of a speech recognition model provided according to an embodiment of the present disclosure
  • Fig. 4 is a block diagram of a speech recognition device provided according to an embodiment of the present disclosure.
  • the term “comprise” and its variations are open-ended, ie “including but not limited to”.
  • the term “based on” is “based at least in part on”.
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one further embodiment”; the term “some embodiments” means “at least some embodiments.” Relevant definitions of other terms will be given in the description below.
  • step 11 voice data to be recognized is received.
  • step 12 according to the speech data to be recognized, the hot word information and the speech recognition model, the target text corresponding to the speech data to be recognized is obtained.
  • the hot word information includes text sequences and phonetic symbol sequences corresponding to multiple hot words.
  • the hot word information may be a hot word corresponding to a specific application context, so as to provide prior context knowledge for the recognition process of the speech data to be recognized.
  • the phonetic symbol sequence is used to indicate the pronunciation of the hot word. If the hot word is in Chinese, its corresponding phonetic symbol sequence contains the tonal pinyin; when the hot word is in English, its corresponding phonetic symbol sequence includes British phonetic symbols or American phonetic symbols, etc. .
  • the text sequence of the hot word may be the hot word text itself.
  • the context recognition sub-model is trained based on training words and phonetic symbol sequences, text sequences and training labels of the training words. Therefore, in the training process of the context recognition sub-model, the pronunciation features and text features corresponding to the training words can be combined for training at the same time, so that the training features of the context recognition sub-model are more comprehensive and rich, and improve the accuracy of subsequent hot word determination. And comprehensive data support.
  • the hot word information contains the hot words "Zhang Zhiguo" and "Zhang Zhiguo"
  • the real text corresponding to the voice data to be recognized should be “Zhang Zhiguo said that she is getting married recently”. If there are hot words with similar pronunciation in , then when speech recognition is performed, the recognition text "Zhang Zhiguo said that she is going to get married recently" will appear, that is, confusing recognition of hot words will appear.
  • the phonetic sequence corresponding to "Zhang Zhiguo” is “zhang1zhi4guo2", and the phonetic sequence corresponding to "Zhang Zhiguo" If the phonetic sequence is "zhang1zhi1guo3", hot words with similar pronunciation can be distinguished, and the recognition result "Zhang Zhiguo said that she is getting married recently" can be obtained to improve the accuracy of speech recognition.
  • the speech recognition model used to recognize the speech data to be recognized may include a speech recognition sub-model and a context recognition sub-model, then in the process of speech recognition, speech recognition may be performed based on the speech recognition sub-model.
  • the context recognition sub-model can be combined to improve the accuracy of hot word recognition in the speech data to be recognized, thereby improving the accuracy of speech recognition.
  • the context recognition sub-model is trained, it is trained in combination with the pronunciation features and text features of the training data. Based on the pronunciation features, hot words with similar spelling or pronunciation can be accurately distinguished. Therefore, when identifying hot words , can combine the multiple features to perform accurate recognition from multiple hot words, avoid confusing recognition of hot words with similar spelling or pronunciation, further improve the accuracy of speech recognition, and improve user experience.
  • the speech recognition model 10 may include a speech recognition submodel 100 and a context recognition submodel 200, and the context recognition submodel 200 includes a pronunciation feature encoder 201 , a textual feature encoder 202 , an attention module 203 and a contextual feature decoder 204 .
  • the exemplary implementation of obtaining the target text corresponding to the speech data to be recognized is as follows, as shown in Figure 3, this step may include:
  • step 31 the phonetic symbol sequence of the hot word is encoded according to the pronunciation feature encoder to obtain the pronunciation feature vector of the hot word, and the text sequence of the hot word is encoded according to the text feature encoder to obtain the text feature vector of the hot word.
  • step 32 according to the speech recognition sub-model and the speech data to be recognized, the character acoustic vector and text probability distribution of each predicted character corresponding to the speech data to be recognized are obtained.
  • the speech recognition sub-model may further include an encoder 101, a prediction sub-model 102 and a decoder 103, wherein the prediction sub-model may be CIF (Continuous Integrate-and-Fire, continuous integration and delivery) Model.
  • CIF Continuous Integrate-and-Fire, continuous integration and delivery
  • the voice data per second can be divided into multiple audio frames, so as to perform data processing based on the audio frames.
  • the voice data per second can be divided into 100 audio frames for processing.
  • the obtained acoustic vector sequence H can be expressed as:
  • H ⁇ H1,H2,...,HU ⁇
  • U is used to represent the number of audio frames in the input speech data to be recognized, that is, the length of the acoustic vector sequence.
  • the acoustic vector can be input into the predicting sub-model, and the predicting sub-model can predict the amount of information on the acoustic vector to obtain the amount of information corresponding to the audio frame.
  • the acoustic vectors of the audio frames may be combined according to the amount of information of multiple audio frames to obtain character acoustic vectors.
  • the amount of information corresponding to each predicted character is the same, so the amount of information corresponding to the audio frame can be accumulated from left to right.
  • the preset threshold it is considered that the amount of information at this time
  • the audio frame corresponding to the accumulated information amount is formed into a prediction character, and a prediction character corresponds to one or more audio frames.
  • the preset threshold may be set according to actual application scenarios and experience, for example, the preset threshold may be set to 1, which is not limited in the present disclosure.
  • the acoustic vectors of the audio frames may be combined according to the amount of information of multiple audio frames in the following manner:
  • the information volume of the second audio frame can be divided into two parts, that is, a part of the information volume belongs to the current predicted character, and the remaining part of the information volume belongs to the next predicted character.
  • the amount W3 is accumulated until it reaches the preset threshold ⁇ , and the audio frame corresponding to the next predicted character is obtained.
  • the amount of information of the subsequent audio frames can be deduced by analogy, and combined in the above manner to obtain each predicted character corresponding to the plurality of audio frames.
  • the weighted sum of the acoustic vectors of each audio frame corresponding to the predicted character can be determined as the corresponding Character acoustic vector.
  • the weight of the acoustic vector of each audio frame corresponding to the predicted character is the corresponding information amount of the audio frame in the predicted character. If the audio frame all belongs to the predicted character, the weight of the acoustic vector of the audio frame is the information amount of the audio frame, and if the audio frame part belongs to the predicted character, the weight of the acoustic vector of the audio frame is the audio frame The amount of information in this section.
  • the character acoustic vector C1 corresponding to the predicted character can be expressed as:
  • the character acoustic vector C2 corresponding to the predicted character can be expressed as:
  • the character acoustic vector of each predicted character can be decoded based on the decoder, so as to obtain the text probability distribution of the predicted character.
  • step 33 according to the attention module, the pronunciation feature vector, the text feature vector and the character acoustic vector, the context feature vector of each predicted character is obtained.
  • this step when determining the contextual feature vector of the predicted character, it is possible to comprehensively consider multiple features of each hot word by combining the corresponding pronunciation features and text features of each hot word, so as to improve language The richness and accuracy of features in the environment feature vector.
  • the pronunciation feature vector, text feature vector and character acoustic vector are combined in the attention module to ensure the matching between each hot word and the voice data when performing attention calculation. Its specific calculation method is described below.
  • step 34 the context probability distribution of each predicted character is obtained according to the context feature decoder and the context feature vector.
  • the context feature vector of each predicted character can be decoded based on the context feature decoder, so that the context probability distribution of the predicted character can be obtained.
  • step 35 the target text corresponding to the data to be recognized is determined according to the text probability distribution and the context probability distribution.
  • each predicted character its text probability distribution and context probability distribution may be weighted and summed, so as to obtain a comprehensive probability distribution corresponding to the predicted character. Then, based on the comprehensive probability distribution, the recognition character corresponding to each predicted character can be determined through a Greedy Search algorithm or a Beam Search algorithm, so as to obtain the target text.
  • the above-mentioned search algorithm is a common method in the art, and will not be repeated here.
  • the pronunciation feature vector, the text feature vector and the character acoustic vector an example of the context feature vector of each predicted character is obtained.
  • This step may include:
  • For each hot word according to the pronunciation feature vector and the text feature vector of the hot word, determine the fusion feature vector corresponding to the hot word.
  • the fusion feature vector can be obtained by concatenating the pronunciation feature vector and the text feature vector.
  • the attention module For each of the predicted characters, in the attention module, determine the context corresponding to the predicted character according to the character acoustic vector of the predicted character, the fusion feature vector and the text feature vector corresponding to each of the hot words Feature vector.
  • the attention module can use the character acoustic vector and each fused feature vector to determine the degree of attention of the current predicted character to each hot word, so as to provide data support for subsequent recognition and judgment of hot words.
  • the exemplary method of determining the contextual feature vector corresponding to the predicted character according to the character acoustic vector of the predicted character, the fusion feature vector and the text feature vector corresponding to each of the hot words The implementation is as follows, and this step may include:
  • the dot product of the character acoustic vector and the fused feature vector corresponding to each hot word is determined as the initial weight corresponding to the hot word.
  • the dot product attention of Ci and the fusion feature vector can be calculated based on multi-head attention, and then the average of the obtained multiple dot product attention will be calculated
  • the value is used as the comprehensive weight corresponding to the feature fusion vector, that is, the initial weight of the hot word corresponding to the feature fusion vector.
  • the initial weight can be used to represent the degree of attention to each hot word in the character acoustic vector of the predicted character.
  • the initial weight corresponding to each of the hot words is normalized to obtain the target weight corresponding to each of the hot words.
  • the initial weights Q1-Qn corresponding to each hot word can be normalized, for example, softmax calculations can be performed on the Q1-Qn, and each The weights of hot words are mapped to a unified standard for measurement, which is convenient for comparing the target weights corresponding to each hot word, so that the hot word that is more likely to correspond to the predicted character can be determined.
  • the text feature vectors of the hot words are weighted and calculated according to the target weight corresponding to each of the hot words to obtain the context feature vector.
  • the product of the target weight corresponding to each hot word and its corresponding text feature vector can be accumulated to obtain the context feature vector, then the text feature vector corresponding to a higher target weight will be included in the context feature vector Corresponding to a more explicit feature representation in the environment feature vector.
  • the exemplary implementation is as follows, on the basis of the previous embodiment, it also includes:
  • the target weights after sorting M according to the target weights from large to small are updated to zero, and M is a positive integer.
  • M can be set according to actual usage scenarios, for example, M can be set to 20.
  • the target weight is used to represent the attention degree of the current predicted character to each hot word, then when the target weight corresponding to the hot word is smaller and ranked later, it means that the possibility that the predicted character corresponds to the hot word is low, so You can directly set the target weight of the hot word to 0, and focus on hot words with higher probability when judging hot words.
  • the target weights are sorted in descending order, and the first M target weights are retained, and the target weights after the sorting M are set to zero.
  • the text feature vectors of the hot words are weighted and calculated, and an exemplary implementation manner of obtaining the context feature vector may include:
  • the text feature vector of the hot word is weighted and calculated according to the updated target weight corresponding to each of the hot words to obtain the context feature vector.
  • hot words with less possibility can be directly excluded and recognized based on the target weight, the amount of calculation data can be reduced to a certain extent, and the recognition efficiency of hot words enhanced recognition can be improved while ensuring hot words Enhance the accuracy of recognition.
  • an example implementation manner of obtaining the context probability distribution of each predicted character is as follows, and this step may include:
  • a target feature vector of the predicted character is obtained according to the acoustic character vector of the predicted character and the contextual feature vector.
  • the acoustic character vector of the predicted character and the context feature vector may be concatenated to obtain the target feature vector.
  • the target feature vector is decoded according to the context feature decoder to obtain a context probability distribution corresponding to each predicted character.
  • the contextual feature decoder when the contextual feature decoder performs decoding, it can be decoded based on the target feature vector containing the audio features of the input voice data and the relevant features of each hot word, and the context probability distribution and the target feature vector to be improved are improved. Identify the matching degree between speech data and hot words, provide accurate and comprehensive data support for the subsequent determination of target text, further improve the diversity of features in the speech recognition process, and thus improve the accuracy of speech recognition results.
  • the phonetic sequence, text sequence and training label of the training word are determined in the following manner:
  • an N-gram word may be randomly extracted from the text, and this word may be used as a candidate word. Part of the words can then be randomly selected from the candidate words as the training words.
  • candidate texts may be determined from the training labeled text of the training sample, and then for each candidate text, an N-gram word may be randomly extracted from the candidate text as a training word. This ensures the diversity and randomness of the training words.
  • the phonetic sequence of the training word is determined according to the language of the training word, and the text sequence is determined from the training marked text.
  • the training word itself extracted from the training marked text can be directly used as the text sequence, and the phonetic symbol sequence types corresponding to different languages can be preset, such as setting the phonetic symbol sequence corresponding to Chinese as a pinyin sequence, and setting the phonetic symbol sequence corresponding to English It is a sequence of British phonetic symbols.
  • the corresponding phonetic sequence can be determined by querying based on an electronic dictionary.
  • a Chinese dictionary can be queried for each character in the word to obtain the tonal pinyin corresponding to each character in the word, and then Concatenate the toned pinyin of each character to obtain the phonetic sequence of the word; you can also directly query based on the word, so as to directly obtain the phonetic sequence of the word, such as for the training word "convex optimization theory", which corresponds to the phonetic sequence Expressed as "tu1you1hua4li3lun4".
  • texts other than the training words in the training tagged text corresponding to the training words are replaced with preset labels, so as to generate training labels corresponding to the training words.
  • the training words extracted from it are "convex optimization theory"
  • the training words corresponding to In the training label text the text other than the training words is replaced with the preset label, that is, the "is an important course” in the training label text "convex optimization theory is an important course” is replaced to obtain the training label "Convex Optimization Theory *****”.
  • the preset label has no actual meaning, and it is used to represent that the corresponding text here has no corresponding prior contextual knowledge.
  • the training samples can be automatically processed to obtain training data for training the context recognition sub-model, and when the training data is obtained, the text features and pronunciation corresponding to the training words can be extracted at the same time features, so that each word can be identified more accurately, so as to distinguish words with similar spelling or pronunciation, so that it can provide more comprehensive and reliable feature information for the training of the context recognition sub-model, and improve the training to a certain extent. Accuracy of the resulting context recognition submodel.
  • the speech recognition model can be trained in the following manner:
  • the phonetic sequence, text sequence and training labels of the training words in the training samples are determined.
  • the speech recognition sub-model and the context recognition sub-model may be trained separately.
  • the speech recognition sub-model can be trained by using the training speech data as input and the corresponding training marked text as the target output.
  • the training may be performed based on commonly used training methods in this field, which will not be repeated here.
  • the context recognition sub-model can be further trained based on the trained speech recognition sub-model.
  • the context recognition sub-model is used to determine the prior context knowledge corresponding to the hot words, therefore, in the embodiment of the present disclosure, the corresponding training words can be obtained directly from the training samples, so that based on the prior knowledge of the training words Contextual knowledge is used for training to ensure the diversity of training words, which can improve the stability and generalization of the model.
  • the phonetic sequence of the training word is used to represent the pronunciation of the training word. If the training sample is Chinese, its corresponding phonetic sequence contains tonal pinyin; when the training sample is English, its corresponding phonetic sequence contains British Phonetic symbols or American phonetic symbols, etc.
  • the text sequence of the training word may be the recognition text corresponding to the training word.
  • the training label of the training word is used to represent the target output corresponding to the training word.
  • the pronunciation feature vector, the text feature vector and the character acoustic vector, the context feature vector of each predicted character is obtained;
  • the target loss of the context recognition sub-model can be determined according to the output text and the training labels corresponding to the training words.
  • the cross-entropy loss can be calculated according to the output text and the training label, and the cross-entropy loss can be determined as the target loss.
  • the update condition may be that the target loss is greater than a preset loss threshold, which means that the recognition accuracy of the context recognition sub-model is insufficient.
  • the update condition may be that the number of iterations is less than a preset number threshold, and at this time, it is considered that the number of iterations of the context recognition sub-model is relatively small, and its recognition accuracy is insufficient.
  • the model parameters of the context recognition sub-model can be updated according to the target loss.
  • the method of updating the model parameters based on the determined loss may adopt a commonly used updating method in the field, such as the gradient descent method, which will not be repeated here.
  • the training process can be stopped to obtain the trained context recognition sub-model, and then obtain the trained speech Identify the model.
  • the model parameters of the trained speech recognition sub-model are kept unchanged.
  • the context recognition sub-model can be added on the basis of the trained speech recognition sub-model to realize the speech recognition model, improve the scalability and application range of the training method, and simultaneously On the basis of ensuring the accuracy of the speech recognition sub-model, more accurate prior context knowledge is provided to improve the recognition accuracy of the speech recognition model.
  • the present disclosure also provides a speech recognition device, as shown in FIG. 4 , the device 40 includes:
  • the receiving module 41 is used to receive voice data to be recognized
  • the processing module 42 is used to obtain the target text corresponding to the speech data to be recognized according to the speech data to be recognized, the hot word information and the speech recognition model; wherein, the hot word information includes a text sequence corresponding to a plurality of hot words and phonetic symbol sequence; the speech recognition model includes a speech recognition sub-model and a context recognition sub-model, and the context recognition sub-model is trained based on training words and phonetic symbol sequences, text sequences and training labels of the training words.
  • the context recognition sub-model includes a pronunciation feature encoder, a text feature encoder, an attention module and a context feature decoder;
  • the processing modules include:
  • the first processing submodule is used to obtain the character acoustic vector and text probability distribution of each predicted character corresponding to the speech data to be recognized according to the speech recognition submodel and the speech data to be recognized;
  • the first determining submodule is configured to determine the target text corresponding to the data to be recognized according to the text probability distribution and the context probability distribution.
  • the second processing submodule includes:
  • the fourth determining submodule is used to determine the dot product of the character acoustic vector and the fused feature vector corresponding to each of the hot words as the initial weight corresponding to the hot word;
  • the third processing submodule is used to normalize the initial weight corresponding to each of the hot words, and obtain the target weight corresponding to each of the hot words;
  • the calculation sub-module is used for:
  • the text feature vector of the hot word is weighted and calculated according to the updated target weight corresponding to each of the hot words to obtain the context feature vector.
  • the first decoding submodule includes:
  • the fourth processing submodule is used to obtain the target feature vector of the predicted character according to the acoustic character vector of the predicted character and the context feature vector for each predicted character;
  • the second decoding submodule is configured to decode the target feature vector according to the context feature decoder to obtain a context probability distribution corresponding to each of the predicted characters.
  • determine the phonetic sequence, text sequence and training label of the training word in the following manner:
  • For each of the training words determine the phonetic sequence of the training words according to the language of the training words, and determine the text sequence from the training marked text;
  • texts other than the training words in the training tagged text corresponding to the training words are replaced with preset labels, so as to generate training labels corresponding to the training words.
  • FIG. 5 it shows a schematic structural diagram of an electronic device 600 suitable for implementing the embodiments of the present disclosure.
  • the terminal equipment in the embodiment of the present disclosure may include but not limited to such as mobile phone, notebook computer, digital broadcast receiver, PDA (personal digital assistant), PAD (tablet computer), PMP (portable multimedia player), vehicle terminal (such as mobile terminals such as car navigation terminals) and fixed terminals such as digital TVs, desktop computers and the like.
  • the electronic device shown in FIG. 5 is only an example, and should not limit the functions and scope of use of the embodiments of the present disclosure.
  • an electronic device 600 may include a processing device (such as a central processing unit, a graphics processing unit, etc.) 601, which may be randomly accessed according to a program stored in a read-only memory (ROM) 602 or loaded from a storage device 608. Various appropriate actions and processes are executed by programs in the memory (RAM) 603 . In the RAM 603, various programs and data necessary for the operation of the electronic device 600 are also stored.
  • the processing device 601, ROM 602, and RAM 603 are connected to each other through a bus 604.
  • An input/output (I/O) interface 605 is also connected to the bus 604 .
  • the following devices can be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speaker, vibration an output device 607 such as a computer; a storage device 608 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 609.
  • the communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While FIG. 5 shows electronic device 600 having various means, it should be understood that implementing or having all of the means shown is not a requirement. More or fewer means may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product, which includes a computer program carried on a non-transitory computer readable medium, where the computer program includes program code for executing the method shown in the flowchart.
  • the computer program may be downloaded and installed from a network via communication means 609, or from storage means 608, or from ROM 602.
  • the processing device 601 When the computer program is executed by the processing device 601, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are performed.
  • the above-mentioned computer-readable medium in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two.
  • a computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can transmit, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device .
  • Program code embodied on a computer readable medium may be transmitted by any appropriate medium, including but not limited to wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.
  • the client and the server can communicate using any currently known or future network protocols such as HTTP (HyperText Transfer Protocol, Hypertext Transfer Protocol), and can communicate with digital data in any form or medium
  • HTTP HyperText Transfer Protocol
  • the communication eg, communication network
  • Examples of communication networks include local area networks (“LANs”), wide area networks (“WANs”), internetworks (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network of.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device, or may exist independently without being incorporated into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs.
  • the electronic device receives the speech data to be recognized; A speech recognition model, which obtains the target text corresponding to the speech data to be recognized; wherein, the hot word information includes text sequences and phonetic symbol sequences corresponding to a plurality of hot words; the speech recognition model includes a speech recognition sub-model and context recognition A sub-model, the context recognition sub-model is trained based on the training words and the phonetic symbol sequences, text sequences and training labels of the training words.
  • Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, or combinations thereof, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and Includes conventional procedural programming languages - such as "C" or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (for example, using an Internet service provider to connected via the Internet).
  • LAN local area network
  • WAN wide area network
  • Internet service provider for example, using an Internet service provider to connected via the Internet.
  • each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.
  • the modules involved in the embodiments described in the present disclosure may be implemented by software or by hardware. Wherein, the name of the module does not constitute a limitation on the module itself under certain circumstances, for example, the receiving module can also be described as "a module that receives speech data to be recognized".
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSPs Application Specific Standard Products
  • SOCs System on Chips
  • CPLD Complex Programmable Logical device
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • Example 1 provides a speech recognition method, the method comprising:
  • the hot word information and the speech recognition model, the target text corresponding to the speech data to be recognized is obtained; wherein the hot word information includes a plurality of hot words corresponding to the text sequence and phonetic sequence; the The speech recognition model includes a speech recognition sub-model and a context recognition sub-model, and the context recognition sub-model is trained based on training words, phonetic symbols sequences, text sequences and training labels of the training words.
  • Example 2 provides the method of Example 1, the context recognition sub-model includes a pronunciation feature encoder, a text feature encoder, an attention module and a context feature decoder;
  • the speech recognition sub-model and the speech data to be recognized obtain the character acoustic vector and text probability distribution of each predicted character corresponding to the speech data to be recognized;
  • a target text corresponding to the data to be recognized is determined according to the text probability distribution and the context probability distribution.
  • Example 4 provides the method of Example 3, wherein the character acoustic vectors of the predicted characters, the fusion feature vectors and text feature vectors corresponding to each of the hot words are used to determine the Predict the contextual feature vector corresponding to the character, including:
  • the dot product of the character acoustic vector and the fusion feature vector corresponding to each of the hot words is determined as the initial weight corresponding to the hot word;
  • the text feature vectors of the hot words are weighted and calculated according to the target weight corresponding to each of the hot words to obtain the context feature vector.
  • Example 5 provides the method of Example 4, the determination of the The contextual feature vector corresponding to the predicted character also includes:
  • the text feature vector of the hot word is weighted and calculated according to the target weight corresponding to each of the hot words, and the context feature vector is obtained, including:
  • the text feature vectors of the hot words are weighted and calculated according to the updated target weights corresponding to each of the hot words to obtain the context feature vector.
  • Example 6 provides the method of Example 2, the context probability distribution of each of the predicted characters is obtained according to the context feature decoder and the context feature vector ,include:
  • the target feature vector is decoded according to the context feature decoder to obtain a context probability distribution corresponding to each predicted character.
  • Example 7 provides the method of any one of Examples 1-6, which determines the phonetic sequence, text sequence and training label of the training word in the following manner:
  • texts other than the training words in the training labeled text corresponding to the training words are replaced with preset labels, so as to generate training labels corresponding to the training words.
  • Example 8 provides a speech recognition device, the device comprising:
  • a receiving module configured to receive voice data to be recognized
  • a processing module configured to obtain target text corresponding to the speech data to be recognized according to the speech data to be recognized, hot word information and a speech recognition model; wherein, the hot word information includes a plurality of text sequences corresponding to hot words and phonetic symbol sequence; the speech recognition model includes a speech recognition sub-model and a context recognition sub-model, and the context recognition sub-model is trained based on training words and phonetic symbol sequences, text sequences and training labels of the training words.
  • Example 9 provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of any one of the methods described in Examples 1-7 are implemented .
  • Example 10 provides an electronic device, comprising:
  • a processing device configured to execute the computer program in the storage device, so as to implement the steps of the method in any one of examples 1-7.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

一种语音识别方法、装置、计算机可读介质及设备,其中该方法包括:接收待识别语音数据(11);根据该待识别语音数据、热词信息和语音识别模型,获得该待识别语音数据对应的目标文本(12);其中,热词信息包含多个热词对应的文本序列和音标序列;语音识别模型包括语音识别子模型和语境识别子模型,其中该语境识别子模型是基于训练词语以及训练词语的音标序列、文本序列以及训练标签进行训练的。由此,在语境识别子模型进行训练时是结合训练数据的发音特征和文本特征进行训练的,可以基于该发音特征对拼写或发音相近的各个热词进行准确区分,因此对热词进行识别时,避免对热词的混淆识别,进一步提高语音识别的准确性。

Description

语音识别方法、装置、介质及设备
相关申请的交叉引用
本申请要求于2021年06月30日提交的,申请号为202110735672.7、发明名称为“语音识别方法、装置、介质及设备”的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本公开涉及计算机技术领域,具体地,涉及一种语音识别方法、装置、介质及设备。
背景技术
随着深度学习的兴起,各种完全依赖于神经网络进行端到端建模的方法逐渐兴起,逐步发展为自动语音识别(ASR)技术中的主流。通过自动语音识别,可直接将原始的语音数据转换为对应的文本结果。相关技术中通常采用基于热词的先验语境知识进行语音识别的方式提高语音识别的准确性。然而相关技术中采用热词先验语境知识时,容易出现对拼写或发音相近的热词的混淆识别,从而导致语音识别的准确度不足。
发明内容
提供该发明内容部分以便以简要的形式介绍构思,这些构思将在后面的具体实施方式部分被详细描述。该发明内容部分并不旨在标识要求保护的技术方案的关键特征或必要特征,也不旨在用于限制所要求的保护的技术方案的范围。
第一方面,本公开提供一种语音识别方法,所述方法包括:
接收待识别语音数据;
根据所述待识别语音数据、热词信息和语音识别模型,获得所述待识别语音数据对应的目标文本;其中,所述热词信息包含多个热词对应的文本序列和音标序列;所述语音识别模型包括语音识别子模型和语境识别子模型,所述语境识别子模型是基于训练词语以及所述训练词语的音标序列、文本序列以及训练标签进行训练的。
可选地,所述语境识别子模型包括发音特征编码器、文本特征编码器、注意力模块和语境特征解码器;
所述根据所述待识别语音数据、热词信息和语音识别模型,获得所述待识别语音数据对应的目标文本,包括:
根据所述发音特征编码器对所述热词的音标序列进行编码,获得所述热词的发音特征 向量,并根据所述文本特征编码器对所述热词的文本序列进行编码,获得所述热词的文本特征向量;
根据所述语音识别子模型和所述待识别语音数据,获得所述待识别语音数据对应的每一预测字符的字符声学向量和文本概率分布;
根据所述注意力模块、所述发音特征向量、所述文本特征向量以及所述字符声学向量,获得每一所述预测字符的语境特征向量;
根据所述语境特征解码器和所述语境特征向量,获得每一所述预测字符的语境概率分布;
根据所述文本概率分布和所述语境概率分布,确定所述待识别数据对应的目标文本。
可选地,所述根据所述注意力模块、所述发音特征向量、所述文本特征向量以及所述字符声学向量,获得每一所述预测字符的语境特征向量,包括:
针对每一所述热词,根据该热词的所述发音特征向量和所述文本特征向量,确定该热词对应的融合特征向量;
针对每一所述预测字符,在所述注意力模块中,根据所述预测字符的字符声学向量、各个所述热词对应的融合特征向量和文本特征向量,确定所述预测字符对应的语境特征向量。
可选地,所述根据所述预测字符的字符声学向量、各个所述热词对应的融合特征向量和文本特征向量,确定所述预测字符对应的语境特征向量,包括:
将所述字符声学向量与每一所述热词对应的融合特征向量的点积,确定为该热词对应的初始权重;
将每一所述热词对应的初始权重进行归一化处理,获得每一所述热词对应的目标权重;
根据每一所述热词对应的目标权重对所述热词的文本特征向量进行加权和计算,获得所述语境特征向量。
可选地,所述根据所述预测字符的字符声学向量、各个所述热词对应的融合特征向量和文本特征向量,确定所述预测字符对应的语境特征向量,还包括:
将按照所述目标权重由大至小的顺序排序M之后的目标权重更新为零,M为正整数;
所述根据每一所述热词对应的目标权重对所述热词的文本特征向量进行加权和计算,获得所述语境特征向量,包括:
根据每一所述热词对应的更新后的目标权重对所述热词的文本特征向量进行加权和计算,获得所述语境特征向量。
可选地,所述根据所述语境特征解码器和所述语境特征向量,获得每一所述预测字符的语境概率分布,包括:
针对每一所述预测字符,根据所述预测字符的声学字符向量和所述语境特征向量,获得预测字符的目标特征向量;
根据所述语境特征解码器对所述目标特征向量进行解码,获得每一所述预测字符对应的语境概率分布。
可选地,通过以下方式确定所述训练词语的音标序列、文本序列以及训练标签:
从每一训练样本的训练标注文本中确定所述训练词语;
针对每一所述训练词语,根据所述训练词语的语种确定所述训练词语的音标序列,并从所述训练标注文本中确定所述文本序列;
针对每一所述训练词语,将所述训练词语对应的训练标注文本中、除所述训练词语之外的文本替换为预设标签,以生成所述训练词语对应的训练标签。
第二方面,本公开提供一种语音识别装置,所述装置包括:
接收模块,用于接收待识别语音数据;
处理模块,用于根据所述待识别语音数据、热词信息和语音识别模型,获得所述待识别语音数据对应的目标文本;其中,所述热词信息包含多个热词对应的文本序列和音标序列;所述语音识别模型包括语音识别子模型和语境识别子模型,所述语境识别子模型是基于训练词语以及所述训练词语的音标序列、文本序列以及训练标签进行训练的。
第三方面,本公开提供一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现第一方面任一所述方法的步骤。
第四方面,提供一种电子设备,包括:
存储装置,其上存储有计算机程序;
处理装置,用于执行所述存储装置中的所述计算机程序,以实现第一方面任一所述方法的步骤。
由此,在上述技术方案,用于对待识别语音数据进行识别的语音识别模型中可以包含语音识别子模型和语境识别子模型,则在语音识别的过程中可以基于该语音识别子模型进行语音识别,同时可以结合该语境识别子模型提高对待识别语音数据中的热词识别的准确性,进而提高语音识别的准确性。并且,在语境识别子模型进行训练时是结合训练数据的发音特征和文本特征进行训练的,可以基于该发音特征对拼写或发音相近的各个热词进行准确区分,因此对热词进行识别时,可以结合该多个特征从多个热词中进行准确的识别,避免对拼写或发音相近的热词的混淆识别,进一步提高语音识别的准确性,提升用户使用体验。
本公开的其他特征和优点将在随后的具体实施方式部分予以详细说明。
附图说明
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。在附图中:
图1为根据本公开的一种实施方式提供的一种语音识别方法的流程图;
图2是根据本公开的一种实施方式提供的一种语音识别模型的结构示意图;
图3是根据待识别语音数据、热词信息和语音识别模型,获得待识别语音数据对应的目标文本的示例性实现方式的流程图;
图4是根据本公开的一种实施方式提供的一种语音识别装置的框图;
图5示出了适于用来实现本公开实施例的电子设备的结构示意图。
具体实施方式
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。
图1所示,为根据本公开的一种实施方式提供的一种语音识别方法的流程图,如图1所示,所述方法可以包括:
在步骤11中,接收待识别语音数据。
在步骤12中,根据待识别语音数据、热词信息和语音识别模型,获得待识别语音数据对应的目标文本。
其中,所述热词信息包含多个热词对应的文本序列和音标序列。其中,该热词信息可以是与具体的应用语境对应的热词,以为该待识别语音数据的识别过程提供先验语境知识。音标序列用于表示热词的发音,如该热词为中文时,其对应的音标序列中则包含有调拼音,热词为英文时,其对应的音标序列则包含英式音标或者美式音标等。热词的文本序列则可以是该热词文本本身。
所述语音识别模型包括语音识别子模型和语境识别子模型,其中,所述语音识别子模型用于对该待识别语音数据进行语音信息识别,该语境识别子模型用于对该待识别语音数据进行语境信息识别,即用于识别该待识别语音数据中包含的热词特征。
具体地,所述语境识别子模型是基于训练词语以及所述训练词语的音标序列、文本序列以及训练标签进行训练的。因此,在该语境识别子模型的训练过程中可以同时结合训练词语对应的发音特征和文本特征进行训练,使得语境识别子模型的训练特征更加全面和丰富,为后续进行热词判定提高准确且全面的数据支持。
例如,热词信息中包含热词“***”和“章芝果”,待识别语音数据对应的真实文本应为“章芝果说她最近要结婚了”,通过相关技术,由于该热词信息中包含相近发音的热词,则在进行语音识别时,则会出现识别文本“***说她最近要结婚了”,即出现热词的混淆识别。而通过上述技术方案,在基于热词进行语音识别时,可以基于热词的发音特征对热词识别进行增强识别,如“***”对应的音标序列为“zhang1zhi4guo2”,“章芝果”对应的音标序列为“zhang1zhi1guo3”,则可以对发音相近的热词进行区分,得出识别结果“章芝果说她最近要结婚了”,提高语音识别的准确性。
由此,在上述技术方案,用于对待识别语音数据进行识别的语音识别模型中可以包含语音识别子模型和语境识别子模型,则在语音识别的过程中可以基于该语音识别子模型进行语音识别,同时可以结合该语境识别子模型提高对待识别语音数据中的热词识别的准确性,进而提高语音识别的准确性。并且,在语境识别子模型进行训练时是结合训练数据的发音特征和文本特征进行训练的,可以基于该发音特征对拼写或发音相近的各个热词进行准确区分,因此对热词进行识别时,可以结合该多个特征从多个热词中进行准确的识别,避免对拼写或发音相近的热词的混淆识别,进一步提高语音识别的准确性,提升用户使用体验。
在一种可能的实施例中,如图2所示,所述语音识别模型10可以包括语音识别子模型100和语境识别子模型200,所述语境识别子模型200包括发音特征编码器201、文本特征编码器202、注意力模块203和语境特征解码器204。相应地,在步骤12中,根据待识别 语音数据、热词信息和语音识别模型,获得待识别语音数据对应的目标文本的示例性实现方式如下,如图3所示,该步骤可以包括:
在步骤31中,根据发音特征编码器对热词的音标序列进行编码,获得热词的发音特征向量,并根据文本特征编码器对热词的文本序列进行编码,获得热词的文本特征向量。
在步骤32中,根据语音识别子模型和待识别语音数据,获得待识别语音数据对应的每一预测字符的字符声学向量和文本概率分布。
示例地,如图2所示,该语音识别子模型可以进一步包括编码器101、预测子模型102和解码器103,其中该预测子模型可以为CIF(Continuous Integrate-and-Fire,连续整合发放)模型。
通常情况下,可以将每秒的语音数据切分为多个音频帧,从而基于音频帧进行数据处理,示例地,可以将每秒的语音数据切分为100个音频帧进行处理。相应地,通过该编码器对该待识别语音数据的音频帧进行编码,获得的声学向量序列H可以表示为:
H:{H1,H2,…,HU},其中,U用于表示该输入待识别语音数据中的音频帧的数量,即该声学向量序列的长度。
之后,可以根据声学向量和所述预测子模型,获得待识别语音数据对应的字符声学向量。
示例地,可以将所述声学向量输入预测子模型,则该预测子模型可以对声学向量进行信息量预测,获得该音频帧对应的信息量。之后可以根据多个音频帧的信息量对音频帧的声学向量进行合并,获得字符声学向量。
在本公开实施例中默认每一预测字符对应的信息量是相同的,因此可以将音频帧对应的信息量从左到右的方式进行累加,信息量累加至预设阈值时,认为此时该累加的信息量对应的音频帧形成为一个预测字符,一个预测字符对应于一个或多个音频帧。其中,该预设阈值可以根据实际应用场景和经验进行设置,示例地该预设阈值可以设置为1,本公开对此不进行限定。
在一种可能的实施例中,可以通过如下方式根据多个音频帧的信息量对音频帧的声学向量进行合并:
按照信息量的顺序,依次获取一音频帧i的信息量Wi;
若Wi小于预设阈值β,则获取下一音频帧作为当前的音频帧,即i=i+1,并对遍历到的音频帧的信息量进行累加,若累加和大于该预设阈值,可以认为此时出现了字符边界,即该当前遍历到的音频帧中部分属于当前的预测字符,另一部分属于下一预测字符。
示例地,若W1+W2大于β,则可以认为此时出现了字符边界,即第1个音频帧和第2个音频帧的部分可以对应于一个预测字符,该预测字符的边界处于第2个音频帧中。此时 可以将该第2个音频帧的信息量切分为两部分,即一部分信息量属于当前的预测字符,剩余一部分的信息量属于下一预测字符。
相应地,第2个音频帧的信息量W2中属于当前的预测字符的信息量W21可以表示为:W21=β-W1;属于下一预测字符的信息量W22可以表示为:W22=W1-W21。
之后继续遍历音频帧的信息量,并从该第2个音频帧的剩余一部分的信息量继续进行信息量的累加,即将第2个音频帧中的信息量W22和第3个音频帧中的信息量W3进行累加,直至累加至预设阈值β,获得下一预测字符对应的音频帧。针对后续的音频帧的信息量以此类推,通过上述方式进行合并,获得该多个音频帧对应的各个预测字符。
基于此,在确定出该语音数据中预测字符和音频帧的对应关系后,针对每一预测字符,可以将该预测字符对应的每一音频帧的声学向量的加权和确定为该预测字符对应的字符声学向量。其中,该预测字符对应的每一音频帧的声学向量的权重为该音频帧在该预测字符中对应的信息量。若该音频帧全部属于该预测字符,则该音频帧的声学向量的权重为该音频帧的信息量,若该音频帧部分属于该预测字符,则该音频帧的声学向量的权重为该音频帧中该部分的信息量。
如上文所述示例,针对第一个预测字符,其包含第1个音频帧和第2个音频帧的部分,则该预测字符对应的字符声学向量C1可以表示为:
C1=W1*H1+W21*H2;
又如示例,针对第二个预测字符,其包含第2个音频帧的部分和第3个音频帧,则该预测字符对应的字符声学向量C2可以表示为:
C2=W22*H2+W3*H3。
之后,可以基于解码器对每一预测字符的字符声学向量进行解码,从而获得该预测字符的文本概率分布。
转回图3,在步骤33中,根据注意力模块、发音特征向量、文本特征向量以及字符声学向量,获得每一预测字符的语境特征向量。
其中,在该步骤中,在确定预测字符的语境特征向量时,可以通过结合各个热词对应的发音特征和文本特征,从而可以对各个热词的多个特征进行综合的考量,以提高语境特征向量中特征的丰富性和准确性。同时在该注意力模块中结合发音特征向量、文本特征向量以及字符声学向量,可以在其进行注意力计算时保证各个热词与语音数据之间的匹配性。其具体计算方式在下文进行描述。
在步骤34中,根据语境特征解码器和语境特征向量,获得每一预测字符的语境概率分布。
之后,可以基于语境特征解码器对每一预测字符的语境特征向量进行解码,从而可以 获得该预测字符的语境概率分布。
在步骤35中,根据文本概率分布和语境概率分布,确定待识别数据对应的目标文本。
作为示例,可以针对每一预测字符,将其文本概率分布和语境概率分布进行加权求和,从而可以获得该预测字符对应的综合概率分布。之后可以则可以基于该综合概率分布,通过贪心搜索(Greedy Search)的算法或者集束搜索(Beam Search)的算法确定出每一预测字符对应的识别字符,以获得该目标文本。其中上述搜索算法为本领域中的常用方式,在此不再赘述。
由此,通过上述技术方案,可以针对待识别语音数据中的每一预测字符的语音识别过程中进行热词增强识别,以提高语音识别的精细度和准确性,也可以在一定程度上提高语音识别的实时性,提升用户使用体验。
在一种可能的实施例中,所述根据所述注意力模块、所述发音特征向量、所述文本特征向量以及所述字符声学向量,获得每一所述预测字符的语境特征向量的示例性实现方式如下,该步骤可以包括:
针对每一所述热词,根据该热词的所述发音特征向量和所述文本特征向量,确定该热词对应的融合特征向量。
其中,可以通过将该发音特征向量和该文本特征向量进行拼接,从而获得该融合特征向量。
针对每一所述预测字符,在所述注意力模块中,根据所述预测字符的字符声学向量、各个所述热词对应的融合特征向量和文本特征向量,确定所述预测字符对应的语境特征向量。
其中,在所述注意力模块可以通过该字符声学向量和各个融合特征向量,确定当前的预测字符对各个热词的关注度,以为后续进行热词的识别判断提供数据支持。
在一种可能的实施例中,所述根据所述预测字符的字符声学向量、各个所述热词对应的融合特征向量和文本特征向量,确定所述预测字符对应的语境特征向量的示例性实现方式如下,该步骤可以包括:
将所述字符声学向量与每一所述热词对应的融合特征向量的点积,确定为该热词对应的初始权重。
其中,作为示例,针对字符声学向量Ci,可以将该Ci与n个热词中的每一热词对应的融合特征向量T1至Tn分别计算点积,即将Ci与T1的点积Q1确定该T1的初始权重,Ci与T2的点积Q1确定T2的初始权重,其他以此类推,以确定出每一热词对应的初始权重。具体地,针对每一热词在计算初始权重时,可以基于多头注意力(multi-head attention)计算Ci和融合特征向量的点积注意力,之后将计算获得的多个点积注意力的平均值作为该 特征融合向量对应的综合权重,即该该特征融合向量对应的热词的初始权重。相应地,该初始权重则可以用于表征该在该预测字符的字符声学向量中对每一热词的关注度。
将每一所述热词对应的初始权重进行归一化处理,获得每一所述热词对应的目标权重。
示例地,为了更加准确地衡量对各个热词的关注程度,可以将该各个热词对应的初始权重Q1-Qn进行归一化处理,例如可以对该Q1-Qn进行softmax计算,可以将每一热词的权重映射到统一标准下衡量,便于对各个热词对应的目标权重进行比较,从而可以确定出该预测字符更可能对应的热词。
之后,根据每一所述热词对应的目标权重对所述热词的文本特征向量进行加权和计算,获得所述语境特征向量。
在该实施例中,可以基于每一热词对应的目标权重与其对应的文本特征向量的乘积进行累加,获得该语境特征向量,则对应目标权重更高的文本特征向量则会在在该语境特征向量中对应更加明确的特征表示。
由此,通过上述技术方案,可以基于包含发音特征序列和文本特征序列的融合特征序列确定每一热词对应的目标权重,使得提供的特征更加丰富,并使得确定出的该目标权重更加准确,以提高语境特征向量的准确性,在一定程度上提高语境识别子模型对输入的热词识别的准确性,实现拼写或发音相近的热词之间的区分性,保证语音识别的准确性。
在一种可能的实施例中,所述根据所述预测字符的字符声学向量、各个所述热词对应的融合特征向量和文本特征向量,确定所述预测字符对应的语境特征向量的另一示例性实施方式如下,在上一实施例的基础上,还包括:
将按照所述目标权重由大至小的顺序排序M之后的目标权重更新为零,M为正整数。M可以根据实际使用场景进行设置,例如M可以设置为20。
其中,目标权重用于表示当前的预测字符对各个热词的关注度,则在热词对应的目标权重较小排序靠后时,表示该预测字符对应于该热词的可能性较低,此时可以直接将该热词的目标权重设置为0,在进行热词判定时重点判断可能性更高的热词。
具体地,在对目标权重更新时,按照所述目标权重由大至小的顺序进行排序,并保留前M的目标权重,对排序M之后的目标权重设置为零。
相应地,所述根据每一所述热词对应的目标权重对所述热词的文本特征向量进行加权和计算,获得所述语境特征向量的示例性实现方式可以包括:
根据每一所述热词对应的更新后的目标权重对所述热词的文本特征向量进行加权和计算,获得所述语境特征向量。
由此,通过上述技术方案,可以直接基于目标权重对可能性较小的热词进行排除识别,可以在一定程度上降低计算数据量,在提高热词增强识别的识别效率的同时,保证热词增 强识别的准确性。
在一种可能的实施例中,根据所述语境特征解码器和所述语境特征向量,获得每一所述预测字符的语境概率分布的示例实现方式如下,该步骤可以包括:
针对每一所述预测字符,根据所述预测字符的声学字符向量和所述语境特征向量,获得预测字符的目标特征向量。
作为示例,可以将该预测字符的声学字符向量和所述语境特征向量进行拼接,从而获得该目标特征向量。
根据所述语境特征解码器对所述目标特征向量进行解码,获得每一所述预测字符对应的语境概率分布。
由此,通过上述技术方案,在语境特征解码器进行解码时,可以基于包含输入的语音数据的音频特征和各个热词的相关特征的目标特征向量进行解码,提高该语境概率分布与待识别语音数据以及热词之间的匹配程度,为后续进行目标文本的确定提供准确且全面的数据支持,进一步提高语音识别过程中特征的多样性,从而提高语音识别结果的准确性。
在一种可能的实施例中,通过以下方式确定所述训练词语的音标序列、文本序列以及训练标签:
从每一训练样本的训练标注文本中确定所述训练词语。
作为示例,可以针对每一训练样本的训练标注文本,从该文本中随机抽取一条N-gram词语,并将该词语作为候选词语。之后可以从该候选词语中随机选择部分词语作为该训练词语。作为另一示例,可以从所述训练样本的训练标注文本中确定候选文本,之后可以针对每一候选文本,从该候选文本中随机抽取一条N-gram词语作为训练词语。由此以保证训练词语的多样性和随机性。
针对每一所述训练词语,根据所述训练词语的语种确定所述训练词语的音标序列,并从所述训练标注文本中确定所述文本序列。
其中,可以直接将从训练标注文本中提取出的该训练词语本身作为该文本序列,可以预先设置不同语种对应的音标序列类型,如设置中文对应的音标序列为拼音序列,设置英文对应的音标序列为英式音标序列。作为示例,可以基于电子辞典进行查询的方式确定对应的音标序列,如针对中文词语,可以针对该词语中的每一个字符查询中文辞典,从而获得该词语中每一字符对应的有调拼音,之后将每一字符的有调拼音进行拼接获得该词语的音标序列;也可以直接基于该词语进行查询,从而直接获得该词语的音标序列,如针对训练词语“凸优化理论”,其对应为音标序列表示为“tu1you1hua4li3lun4”。
针对每一所述训练词语,将所述训练词语对应的训练标注文本中、除所述训练词语之外的文本替换为预设标签,以生成所述训练词语对应的训练标签。
示例地,针对训练标注文本“凸优化理论是一门重要课程”,从中提取出的训练词语为“凸优化理论”,则在确定该训练词语对应的训练标签时,可以将所述训练词语对应的训练标注文本中、除所述训练词语之外的文本替换为预设标签,即将训练标注文本“凸优化理论是一门重要课程”中的“是一门重要课程”进行替换,获得训练标签“凸优化理论*******”。其中,预设标签没有实际含义,其用于表征此处对应的文本无对应的先验语境知识。
由此,通过上述技术方案,可以对训练样本进行自动化处理以获得用于对语境识别子模型进行训练的训练数据,并且在获得该训练数据时,可以同时提取训练词语对应的文本特征和发音特征,便于更加精确的对每一词语进行标识,以便于对拼写或发音相近的词语进行区分,从而可以为语境识别子模型的训练提供更加全面且可靠的特征信息,在一定程度上提高训练所得的语境识别子模型的准确性。
在一种可能的实施例中,所述语音识别模型可以通过如下方式进行训练:
在语音识别子模型训练完成的情况下,根据训练样本,确定训练样本中的训练词语的音标序列、文本序列以及训练标签。
其中,在本公开实施例中,可以分别对语音识别子模型和语境识别子模型进行训练。可以以训练语音数据作为输入,以对应的训练标注文本作为目标输出,实现对语音识别子模型的训练。其中,可以基于本领域中常用的训练方法进行训练,在此不再赘述。
在语音识别子模型训练完成的情况下,则可以基于该训练完成的语音识别子模型进一步实现对语境识别子模型的训练。其中,该语境识别子模型用于确定热词对应的先验语境知识,因此,在本公开实施例中,可以直接从训练样本中获取对应的训练词语,以基于该训练词语的先验语境知识进行训练,保证训练词语的多样性,从而可以提高模型的稳定性和泛化性。其中,训练词语的音标序列用于表示训练词语的发音,如该训练样本为中文时,其对应的音标序列中则包含有调拼音,训练样本为英文时,其对应的音标序列则包含英式音标或者美式音标等。训练词语的文本序列则可以是该训练词语对应的识别文本。训练词语的训练标签用于表示所述训练词语对应的目标输出。
之后,可以根据发音特征编码器对训练词语的音标序列进行编码,获得训练词语的发音特征向量,并根据文本特征编码器对训练词语的文本序列进行编码,获得训练词语的文本特征向量;
根据语音识别子模型获得训练样本中的训练语音数据对应的每一预测字符的字符声学向量;
根据注意力模块、发音特征向量、文本特征向量以及字符声学向量,获得每一预测字符的语境特征向量;
基于语境特征解码器对语境特征向量进行解码,每一预测字符的概率分布,之后根据 该概率分布获得语境识别子模型的输出文本。
其中,上述步骤的实现方式与上文所述针对热词和待识别语音数据的处理方式类似,在此不再赘述。
之后,则可以根据输出文本和训练词语对应的训练标签确定语境识别子模型的目标损失。
其中,可以根据该输出文本与训练标签计算交叉熵损失,并将该交叉熵损失确定为该目标损失。
在满足更新条件的情况下,根据目标损失对语境识别子模型的模型参数进行更新。
作为示例,该更新条件可以为目标损失大于预设的损失阈值,此时表示语境识别子模型的识别准确性不足。作为另一示例,该更新条件可以是迭代次数小于预设的次数阈值,此时认为语境识别子模型迭代次数较少,其识别准确性不足。相应地,在满足更新条件的情况下,可以根据该目标损失对该语境识别子模型的模型参数进行更新。其中,基于确定出的损失对模型参数进行更新的方式可以采用本领域中常用的更新方式,如梯度下降法,在此不再赘述。
在不满足该更新条件的情况下,则可以认为该语境识别子模型的识别精确性达到训练要求,此时可以停止训练过程,获得训练完成的语境识别子模型,进而获得训练完成的语音识别模型。
其中,需要进行说明的是,在对语境识别子模型的模型参数进行更新的过程中,保持训练完成的语音识别子模型的模型参数不变。由此,通过上述技术方案,可以在训练完成的语音识别子模型的基础上增加该语境识别子模型,以实现该语音识别模型,提高该训练方法的可扩展性和应用范围,同时可以在保证语音识别子模型的准确性的基础上,提供更加准确的先验语境知识,提高语音识别模型的识别准确率。
本公开还提供一种语音识别装置,如图4所示,所述装置40包括:
接收模块41,用于接收待识别语音数据;
处理模块42,用于根据所述待识别语音数据、热词信息和语音识别模型,获得所述待识别语音数据对应的目标文本;其中,所述热词信息包含多个热词对应的文本序列和音标序列;所述语音识别模型包括语音识别子模型和语境识别子模型,所述语境识别子模型是基于训练词语以及所述训练词语的音标序列、文本序列以及训练标签进行训练的。
可选地,所述语境识别子模型包括发音特征编码器、文本特征编码器、注意力模块和语境特征解码器;
所述处理模块包括:
编码子模块,用于根据所述发音特征编码器对所述热词的音标序列进行编码,获得所 述热词的发音特征向量,并根据所述文本特征编码器对所述热词的文本序列进行编码,获得所述热词的文本特征向量;
第一处理子模块,用于根据所述语音识别子模型和所述待识别语音数据,获得所述待识别语音数据对应的每一预测字符的字符声学向量和文本概率分布;
第二处理子模块,用于根据所述注意力模块、所述发音特征向量、所述文本特征向量以及所述字符声学向量,获得每一所述预测字符的语境特征向量;
第一解码子模块,用于根据所述语境特征解码器和所述语境特征向量,获得每一所述预测字符的语境概率分布;
第一确定子模块,用于根据所述文本概率分布和所述语境概率分布,确定所述待识别数据对应的目标文本。
可选地,所述第二处理子模块包括:
第二确定子模块,用于针对每一所述热词,根据该热词的所述发音特征向量和所述文本特征向量,确定该热词对应的融合特征向量;
第三确定子模块,用于针对每一所述预测字符,在所述注意力模块中,根据所述预测字符的字符声学向量、各个所述热词对应的融合特征向量和文本特征向量,确定所述预测字符对应的语境特征向量。
可选地,所述第三确定子模块包括:
第四确定子模块,用于将所述字符声学向量与每一所述热词对应的融合特征向量的点积,确定为该热词对应的初始权重;
第三处理子模块,用于将每一所述热词对应的初始权重进行归一化处理,获得每一所述热词对应的目标权重;
计算子模块,用于根据每一所述热词对应的目标权重对所述热词的文本特征向量进行加权和计算,获得所述语境特征向量。
可选地,所述第三确定子模块还包括:
更新子模块,用于将按照所述目标权重由大至小的顺序排序M之后的目标权重更新为零,M为正整数;
所述计算子模块用于:
根据每一所述热词对应的更新后的目标权重对所述热词的文本特征向量进行加权和计算,获得所述语境特征向量。
可选地,所述第一解码子模块包括:
第四处理子模块,用于针对每一所述预测字符,根据所述预测字符的声学字符向量和所述语境特征向量,获得预测字符的目标特征向量;
第二解码子模块,用于根据所述语境特征解码器对所述目标特征向量进行解码,获得每一所述预测字符对应的语境概率分布。
可选地,通过以下方式确定所述训练词语的音标序列、文本序列以及训练标签:
从每一训练样本的训练标注文本中确定所述训练词语;
针对每一所述训练词语,根据所述训练词语的语种确定所述训练词语的音标序列,并从所述训练标注文本中确定所述文本序列;
针对每一所述训练词语,将所述训练词语对应的训练标注文本中、除所述训练词语之外的文本替换为预设标签,以生成所述训练词语对应的训练标签。
下面参考图5,其示出了适于用来实现本公开实施例的电子设备600的结构示意图。本公开实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图5示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图5所示,电子设备600可以包括处理装置(例如中央处理器、图形处理器等)601,其可以根据存储在只读存储器(ROM)602中的程序或者从存储装置608加载到随机访问存储器(RAM)603中的程序而执行各种适当的动作和处理。在RAM 603中,还存储有电子设备600操作所需的各种程序和数据。处理装置601、ROM 602以及RAM 603通过总线604彼此相连。输入/输出(I/O)接口605也连接至总线604。
通常,以下装置可以连接至I/O接口605:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置606;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置607;包括例如磁带、硬盘等的存储装置608;以及通信装置609。通信装置609可以允许电子设备600与其他设备进行无线或有线通信以交换数据。虽然图5示出了具有各种装置的电子设备600,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置609从网络上被下载和安装,或者从存储装置608被安装,或者从ROM 602被安装。在该计算机程序被处理装置601执行时,执行本公开实施例的方法中限定的上述功能。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于 ——电、磁、光、电磁、红外线、或半导体的***、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行***、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行***、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
在一些实施方式中,客户端、服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:接收待识别语音数据;根据所述待识别语音数据、热词信息和语音识别模型,获得所述待识别语音数据对应的目标文本;其中,所述热词信息包含多个热词对应的文本序列和音标序列;所述语音识别模型包括语音识别子模型和语境识别子模型,所述语境识别子模型是基于训练词语以及所述训练词语的音标序列、文本序列以及训练标签进行训练的。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言——诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网 (LAN)或广域网(WAN)——连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的***、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的***来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的模块可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,模块的名称在某种情况下并不构成对该模块本身的限定,例如,接收模块还可以被描述为“接收待识别语音数据的模块”。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上***(SOC)、复杂可编程逻辑设备(CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行***、装置或设备使用或与指令执行***、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体***、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
根据本公开的一个或多个实施例,示例1提供了一种语音识别方法,所述方法包括:
接收待识别语音数据;
根据所述待识别语音数据、热词信息和语音识别模型,获得所述待识别语音数据对应的目标文本;其中,所述热词信息包含多个热词对应的文本序列和音标序列;所述语音识别模型包括语音识别子模型和语境识别子模型,所述语境识别子模型是基于训练词语以及所述训练词语的音标序列、文本序列以及训练标签进行训练的。
根据本公开的一个或多个实施例,示例2提供了示例1的方法,所述语境识别子模型包括发音特征编码器、文本特征编码器、注意力模块和语境特征解码器;
所述根据所述待识别语音数据、热词信息和语音识别模型,获得所述待识别语音数据对应的目标文本,包括:
根据所述发音特征编码器对所述热词的音标序列进行编码,获得所述热词的发音特征向量,并根据所述文本特征编码器对所述热词的文本序列进行编码,获得所述热词的文本特征向量;
根据所述语音识别子模型和所述待识别语音数据,获得所述待识别语音数据对应的每一预测字符的字符声学向量和文本概率分布;
根据所述注意力模块、所述发音特征向量、所述文本特征向量以及所述字符声学向量,获得每一所述预测字符的语境特征向量;
根据所述语境特征解码器和所述语境特征向量,获得每一所述预测字符的语境概率分布;
根据所述文本概率分布和所述语境概率分布,确定所述待识别数据对应的目标文本。
根据本公开的一个或多个实施例,示例3提供了示例2的方法,所述根据所述注意力模块、所述发音特征向量、所述文本特征向量以及所述字符声学向量,获得每一所述预测字符的语境特征向量,包括:
针对每一所述热词,根据该热词的所述发音特征向量和所述文本特征向量,确定该热词对应的融合特征向量;
针对每一所述预测字符,在所述注意力模块中,根据所述预测字符的字符声学向量、各个所述热词对应的融合特征向量和文本特征向量,确定所述预测字符对应的语境特征向量。
根据本公开的一个或多个实施例,示例4提供了示例3的方法,所述根据所述预测字符的字符声学向量、各个所述热词对应的融合特征向量和文本特征向量,确定所述预测字符对应的语境特征向量,包括:
将所述字符声学向量与每一所述热词对应的融合特征向量的点积,确定为该热词对应的初始权重;
将每一所述热词对应的初始权重进行归一化处理,获得每一所述热词对应的目标权重;
根据每一所述热词对应的目标权重对所述热词的文本特征向量进行加权和计算,获得所述语境特征向量。
根据本公开的一个或多个实施例,示例5提供了示例4的方法,所述根据所述预测字符的字符声学向量、各个所述热词对应的融合特征向量和文本特征向量,确定所述预测字 符对应的语境特征向量,还包括:
将按照所述目标权重由大至小的顺序排序M之后的目标权重更新为零,M为正整数;
所述根据每一所述热词对应的目标权重对所述热词的文本特征向量进行加权和计算,获得所述语境特征向量,包括:
根据每一所述热词对应的更新后的目标权重对所述热词的文本特征向量进行加权和计算,获得所述语境特征向量。
根据本公开的一个或多个实施例,示例6提供了示例2的方法,所述根据所述语境特征解码器和所述语境特征向量,获得每一所述预测字符的语境概率分布,包括:
针对每一所述预测字符,根据所述预测字符的声学字符向量和所述语境特征向量,获得预测字符的目标特征向量;
根据所述语境特征解码器对所述目标特征向量进行解码,获得每一所述预测字符对应的语境概率分布。
根据本公开的一个或多个实施例,示例7提供了示例1-6中任一示例的方法,通过以下方式确定所述训练词语的音标序列、文本序列以及训练标签:
从每一训练样本的训练标注文本中确定所述训练词语;
针对每一所述训练词语,根据所述训练词语的语种确定所述训练词语的音标序列,并从所述训练标注文本中确定所述文本序列;
针对每一所述训练词语,将所述训练词语对应的训练标注文本中、除所述训练词语之外的文本替换为预设标签,以生成所述训练词语对应的训练标签。
根据本公开的一个或多个实施例,示例8提供了一种语音识别装置,所述装置包括:
接收模块,用于接收待识别语音数据;
处理模块,用于根据所述待识别语音数据、热词信息和语音识别模型,获得所述待识别语音数据对应的目标文本;其中,所述热词信息包含多个热词对应的文本序列和音标序列;所述语音识别模型包括语音识别子模型和语境识别子模型,所述语境识别子模型是基于训练词语以及所述训练词语的音标序列、文本序列以及训练标签进行训练的。
根据本公开的一个或多个实施例,示例9提供了一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现示例1-7中任一示例所述方法的步骤。
根据本公开的一个或多个实施例,示例10提供了一种电子设备,包括:
存储装置,其上存储有计算机程序;
处理装置,用于执行所述存储装置中的所述计算机程序,以实现示例1-7中任一示例所述方法的步骤。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应 当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。

Claims (10)

  1. 一种语音识别方法,其特征在于,所述方法包括:
    接收待识别语音数据;
    根据所述待识别语音数据、热词信息和语音识别模型,获得所述待识别语音数据对应的目标文本;其中,所述热词信息包含多个热词对应的文本序列和音标序列;所述语音识别模型包括语音识别子模型和语境识别子模型,所述语境识别子模型是基于训练词语以及所述训练词语的音标序列、文本序列以及训练标签进行训练的。
  2. 根据权利要求1所述的方法,其特征在于,所述语境识别子模型包括发音特征编码器、文本特征编码器、注意力模块和语境特征解码器;
    所述根据所述待识别语音数据、热词信息和语音识别模型,获得所述待识别语音数据对应的目标文本,包括:
    根据所述发音特征编码器对所述热词的音标序列进行编码,获得所述热词的发音特征向量,并根据所述文本特征编码器对所述热词的文本序列进行编码,获得所述热词的文本特征向量;
    根据所述语音识别子模型和所述待识别语音数据,获得所述待识别语音数据对应的每一预测字符的字符声学向量和文本概率分布;
    根据所述注意力模块、所述发音特征向量、所述文本特征向量以及所述字符声学向量,获得每一所述预测字符的语境特征向量;
    根据所述语境特征解码器和所述语境特征向量,获得每一所述预测字符的语境概率分布;
    根据所述文本概率分布和所述语境概率分布,确定所述待识别数据对应的目标文本。
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述注意力模块、所述发音特征向量、所述文本特征向量以及所述字符声学向量,获得每一所述预测字符的语境特征向量,包括:
    针对每一所述热词,根据该热词的所述发音特征向量和所述文本特征向量,确定该热词对应的融合特征向量;
    针对每一所述预测字符,在所述注意力模块中,根据所述预测字符的字符声学向量、各个所述热词对应的融合特征向量和文本特征向量,确定所述预测字符对应的语境特征向量。
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述预测字符的字符声学向量、各个所述热词对应的融合特征向量和文本特征向量,确定所述预测字符对应的语境特征向量,包括:
    将所述字符声学向量与每一所述热词对应的融合特征向量的点积,确定为该热词对应 的初始权重;
    将每一所述热词对应的初始权重进行归一化处理,获得每一所述热词对应的目标权重;
    根据每一所述热词对应的目标权重对所述热词的文本特征向量进行加权和计算,获得所述语境特征向量。
  5. 根据权利要求4所述的方法,其特征在于,所述根据所述预测字符的字符声学向量、各个所述热词对应的融合特征向量和文本特征向量,确定所述预测字符对应的语境特征向量,还包括:
    将按照所述目标权重由大至小的顺序排序M之后的目标权重更新为零,M为正整数;
    所述根据每一所述热词对应的目标权重对所述热词的文本特征向量进行加权和计算,获得所述语境特征向量,包括:
    根据每一所述热词对应的更新后的目标权重对所述热词的文本特征向量进行加权和计算,获得所述语境特征向量。
  6. 根据权利要求2所述的方法,其特征在于,所述根据所述语境特征解码器和所述语境特征向量,获得每一所述预测字符的语境概率分布,包括:
    针对每一所述预测字符,根据所述预测字符的声学字符向量和所述语境特征向量,获得预测字符的目标特征向量;
    根据所述语境特征解码器对所述目标特征向量进行解码,获得每一所述预测字符对应的语境概率分布。
  7. 根据权利要求1-6中任一项所述的方法,其特征在于,通过以下方式确定所述训练词语的音标序列、文本序列以及训练标签:
    从每一训练样本的训练标注文本中确定所述训练词语;
    针对每一所述训练词语,根据所述训练词语的语种确定所述训练词语的音标序列,并从所述训练标注文本中确定所述文本序列;
    针对每一所述训练词语,将所述训练词语对应的训练标注文本中、除所述训练词语之外的文本替换为预设标签,以生成所述训练词语对应的训练标签。
  8. 一种语音识别装置,其特征在于,所述装置包括:
    接收模块,用于接收待识别语音数据;
    处理模块,用于根据所述待识别语音数据、热词信息和语音识别模型,获得所述待识别语音数据对应的目标文本;其中,所述热词信息包含多个热词对应的文本序列和音标序列;所述语音识别模型包括语音识别子模型和语境识别子模型,所述语境识别子模型是基于训练词语以及所述训练词语的音标序列、文本序列以及训练标签进行训练的。
  9. 一种计算机可读介质,其上存储有计算机程序,其特征在于,该程序被处理装置执 行时实现权利要求1-7中任一项所述方法的步骤。
  10. 一种电子设备,其特征在于,包括:
    存储装置,其上存储有计算机程序;
    处理装置,用于执行所述存储装置中的所述计算机程序,以实现权利要求1-7中任一项所述方法的步骤。
PCT/CN2022/089595 2021-06-30 2022-04-27 语音识别方法、装置、介质及设备 WO2023273578A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110735672.7A CN113470619B (zh) 2021-06-30 2021-06-30 语音识别方法、装置、介质及设备
CN202110735672.7 2021-06-30

Publications (1)

Publication Number Publication Date
WO2023273578A1 true WO2023273578A1 (zh) 2023-01-05

Family

ID=77876448

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/089595 WO2023273578A1 (zh) 2021-06-30 2022-04-27 语音识别方法、装置、介质及设备

Country Status (2)

Country Link
CN (1) CN113470619B (zh)
WO (1) WO2023273578A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116705058A (zh) * 2023-08-04 2023-09-05 贝壳找房(北京)科技有限公司 多模语音任务的处理方法、电子设备及可读存储介质
CN117437909A (zh) * 2023-12-20 2024-01-23 慧言科技(天津)有限公司 基于热词特征向量自注意力机制的语音识别模型构建方法

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113470619B (zh) * 2021-06-30 2023-08-18 北京有竹居网络技术有限公司 语音识别方法、装置、介质及设备
CN114036959A (zh) * 2021-11-25 2022-02-11 北京房江湖科技有限公司 会话语境的确定方法、装置、计算机程序产品和存储介质
CN115713939B (zh) * 2023-01-06 2023-04-21 阿里巴巴达摩院(杭州)科技有限公司 语音识别方法、装置及电子设备
CN117116264A (zh) * 2023-02-20 2023-11-24 荣耀终端有限公司 一种语音识别方法、电子设备以及介质
CN116110378B (zh) * 2023-04-12 2023-07-18 中国科学院自动化研究所 模型训练方法、语音识别方法、装置和电子设备

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20160023424A (ko) * 2014-08-22 2016-03-03 현대자동차주식회사 음성 인식 장치, 그를 포함하는 차량, 및 그 차량의 제어 방법
CN110706690A (zh) * 2019-09-16 2020-01-17 平安科技(深圳)有限公司 语音识别方法及其装置
CN111583909A (zh) * 2020-05-18 2020-08-25 科大讯飞股份有限公司 一种语音识别方法、装置、设备及存储介质
CN111816165A (zh) * 2020-07-07 2020-10-23 北京声智科技有限公司 语音识别方法、装置及电子设备
CN111933129A (zh) * 2020-09-11 2020-11-13 腾讯科技(深圳)有限公司 音频处理方法、语言模型的训练方法、装置及计算机设备
CN112489646A (zh) * 2020-11-18 2021-03-12 北京华宇信息技术有限公司 语音识别方法及其装置
CN113470619A (zh) * 2021-06-30 2021-10-01 北京有竹居网络技术有限公司 语音识别方法、装置、介质及设备

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9202465B2 (en) * 2011-03-25 2015-12-01 General Motors Llc Speech recognition dependent on text message content
US9305554B2 (en) * 2013-07-17 2016-04-05 Samsung Electronics Co., Ltd. Multi-level speech recognition
CN105719649B (zh) * 2016-01-19 2019-07-05 百度在线网络技术(北京)有限公司 语音识别方法及装置
IL252071A0 (en) * 2017-05-03 2017-07-31 Google Inc Contextual language translation
CN110689881B (zh) * 2018-06-20 2022-07-12 深圳市北科瑞声科技股份有限公司 语音识别方法、装置、计算机设备和存储介质
WO2020005202A1 (en) * 2018-06-25 2020-01-02 Google Llc Hotword-aware speech synthesis
CN109815322B (zh) * 2018-12-27 2021-03-12 东软集团股份有限公司 应答的方法、装置、存储介质及电子设备
CN110517692A (zh) * 2019-08-30 2019-11-29 苏州思必驰信息科技有限公司 热词语音识别方法和装置
CN110544477A (zh) * 2019-09-29 2019-12-06 北京声智科技有限公司 一种语音识别方法、装置、设备及介质
CN110879839A (zh) * 2019-11-27 2020-03-13 北京声智科技有限公司 一种热词识别方法、装置及***

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20160023424A (ko) * 2014-08-22 2016-03-03 현대자동차주식회사 음성 인식 장치, 그를 포함하는 차량, 및 그 차량의 제어 방법
CN110706690A (zh) * 2019-09-16 2020-01-17 平安科技(深圳)有限公司 语音识别方法及其装置
CN111583909A (zh) * 2020-05-18 2020-08-25 科大讯飞股份有限公司 一种语音识别方法、装置、设备及存储介质
CN111816165A (zh) * 2020-07-07 2020-10-23 北京声智科技有限公司 语音识别方法、装置及电子设备
CN111933129A (zh) * 2020-09-11 2020-11-13 腾讯科技(深圳)有限公司 音频处理方法、语言模型的训练方法、装置及计算机设备
CN112489646A (zh) * 2020-11-18 2021-03-12 北京华宇信息技术有限公司 语音识别方法及其装置
CN113470619A (zh) * 2021-06-30 2021-10-01 北京有竹居网络技术有限公司 语音识别方法、装置、介质及设备

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116705058A (zh) * 2023-08-04 2023-09-05 贝壳找房(北京)科技有限公司 多模语音任务的处理方法、电子设备及可读存储介质
CN116705058B (zh) * 2023-08-04 2023-10-27 贝壳找房(北京)科技有限公司 多模语音任务的处理方法、电子设备及可读存储介质
CN117437909A (zh) * 2023-12-20 2024-01-23 慧言科技(天津)有限公司 基于热词特征向量自注意力机制的语音识别模型构建方法
CN117437909B (zh) * 2023-12-20 2024-03-05 慧言科技(天津)有限公司 基于热词特征向量自注意力机制的语音识别模型构建方法

Also Published As

Publication number Publication date
CN113470619B (zh) 2023-08-18
CN113470619A (zh) 2021-10-01

Similar Documents

Publication Publication Date Title
WO2023273578A1 (zh) 语音识别方法、装置、介质及设备
JP7112536B2 (ja) テキストにおける実体注目点のマイニング方法および装置、電子機器、コンピュータ読取可能な記憶媒体並びにコンピュータプログラム
CN111583903B (zh) 语音合成方法、声码器训练方法、装置、介质及电子设备
CN111368559A (zh) 语音翻译方法、装置、电子设备及存储介质
WO2023273611A1 (zh) 语音识别模型的训练方法、语音识别方法、装置、介质及设备
WO2023273612A1 (zh) 语音识别模型的训练方法、语音识别方法、装置、介质及设备
CN112509562B (zh) 用于文本后处理的方法、装置、电子设备和介质
WO2022247562A1 (zh) 多模态数据检索方法、装置、介质及电子设备
WO2020182123A1 (zh) 用于推送语句的方法和装置
CN112883968B (zh) 图像字符识别方法、装置、介质及电子设备
CN112883967B (zh) 图像字符识别方法、装置、介质及电子设备
US20240029709A1 (en) Voice generation method and apparatus, device, and computer readable medium
WO2023273610A1 (zh) 语音识别方法、装置、介质及电子设备
WO2023143016A1 (zh) 特征提取模型的生成方法、图像特征提取方法和装置
WO2023005763A1 (zh) 信息处理方法、装置和电子设备
CN111368560A (zh) 文本翻译方法、装置、电子设备及存储介质
CN115270717A (zh) 一种立场检测方法、装置、设备及介质
CN114765025A (zh) 语音识别模型的生成方法、识别方法、装置、介质及设备
CN111090993A (zh) 属性对齐模型训练方法及装置
CN113140012B (zh) 图像处理方法、装置、介质及电子设备
CN113033707A (zh) 视频分类方法、装置、可读介质及电子设备
CN112069786A (zh) 文本信息处理方法、装置、电子设备及介质
CN114625876B (zh) 作者特征模型的生成方法、作者信息处理方法和装置
CN116244431A (zh) 文本分类方法、装置、介质及电子设备
CN116821327A (zh) 文本数据处理方法、装置、设备、可读存储介质及产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22831406

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE