WO2023273578A1 - Procédé et appareil de reconnaissance vocale, support et dispositif - Google Patents

Procédé et appareil de reconnaissance vocale, support et dispositif Download PDF

Info

Publication number
WO2023273578A1
WO2023273578A1 PCT/CN2022/089595 CN2022089595W WO2023273578A1 WO 2023273578 A1 WO2023273578 A1 WO 2023273578A1 CN 2022089595 W CN2022089595 W CN 2022089595W WO 2023273578 A1 WO2023273578 A1 WO 2023273578A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
training
feature vector
words
context
Prior art date
Application number
PCT/CN2022/089595
Other languages
English (en)
Chinese (zh)
Inventor
董林昊
韩明伦
马泽君
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2023273578A1 publication Critical patent/WO2023273578A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training

Definitions

  • the present disclosure relates to the field of computer technology, and in particular, to a speech recognition method, device, medium and equipment.
  • the present disclosure provides a speech recognition method, the method comprising:
  • the hot word information and the speech recognition model, the target text corresponding to the speech data to be recognized is obtained; wherein the hot word information includes a plurality of hot words corresponding to the text sequence and phonetic sequence; the The speech recognition model includes a speech recognition sub-model and a context recognition sub-model, and the context recognition sub-model is trained based on training words, phonetic symbols sequences, text sequences and training labels of the training words.
  • the context recognition sub-model includes a pronunciation feature encoder, a text feature encoder, an attention module and a context feature decoder;
  • the speech recognition sub-model and the speech data to be recognized obtain the character acoustic vector and text probability distribution of each predicted character corresponding to the speech data to be recognized;
  • a target text corresponding to the data to be recognized is determined according to the text probability distribution and the context probability distribution.
  • the obtaining the context feature vector of each predicted character according to the attention module, the pronunciation feature vector, the text feature vector and the character acoustic vector includes:
  • the attention module For each of the predicted characters, in the attention module, determine the context corresponding to the predicted character according to the character acoustic vector of the predicted character, the fusion feature vector and the text feature vector corresponding to each of the hot words Feature vector.
  • the determining the contextual feature vector corresponding to the predicted character according to the character acoustic vector of the predicted character, the fusion feature vector and the text feature vector corresponding to each of the hot words includes:
  • the dot product of the character acoustic vector and the fusion feature vector corresponding to each of the hot words is determined as the initial weight corresponding to the hot word;
  • the text feature vectors of the hot words are weighted and calculated according to the target weight corresponding to each of the hot words to obtain the context feature vector.
  • the determining the contextual feature vector corresponding to the predicted character according to the character acoustic vector of the predicted character, the fusion feature vector and the text feature vector corresponding to each of the hot words further includes:
  • the text feature vector of the hot word is weighted and calculated according to the target weight corresponding to each of the hot words, and the context feature vector is obtained, including:
  • the text feature vector of the hot word is weighted and calculated according to the updated target weight corresponding to each of the hot words to obtain the context feature vector.
  • the obtaining the context probability distribution of each predicted character according to the context feature decoder and the context feature vector includes:
  • the target feature vector is decoded according to the context feature decoder to obtain a context probability distribution corresponding to each predicted character.
  • determine the phonetic sequence, text sequence and training label of the training word in the following manner:
  • For each of the training words determine the phonetic sequence of the training words according to the language of the training words, and determine the text sequence from the training marked text;
  • texts other than the training words in the training tagged text corresponding to the training words are replaced with preset labels, so as to generate training labels corresponding to the training words.
  • the present disclosure provides a speech recognition device, the device comprising:
  • a receiving module configured to receive voice data to be recognized
  • a processing module configured to obtain target text corresponding to the speech data to be recognized according to the speech data to be recognized, hot word information and a speech recognition model; wherein, the hot word information includes a plurality of text sequences corresponding to hot words and phonetic symbol sequence; the speech recognition model includes a speech recognition sub-model and a context recognition sub-model, and the context recognition sub-model is trained based on training words and phonetic symbol sequences, text sequences and training labels of the training words.
  • the present disclosure provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of any one of the methods described in the first aspect are implemented.
  • an electronic device including:
  • a processing device configured to execute the computer program in the storage device to implement the steps of any one of the methods in the first aspect.
  • FIG. 1 is a flow chart of a speech recognition method provided according to an embodiment of the present disclosure
  • FIG. 2 is a schematic structural diagram of a speech recognition model provided according to an embodiment of the present disclosure
  • Fig. 4 is a block diagram of a speech recognition device provided according to an embodiment of the present disclosure.
  • the term “comprise” and its variations are open-ended, ie “including but not limited to”.
  • the term “based on” is “based at least in part on”.
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one further embodiment”; the term “some embodiments” means “at least some embodiments.” Relevant definitions of other terms will be given in the description below.
  • step 11 voice data to be recognized is received.
  • step 12 according to the speech data to be recognized, the hot word information and the speech recognition model, the target text corresponding to the speech data to be recognized is obtained.
  • the hot word information includes text sequences and phonetic symbol sequences corresponding to multiple hot words.
  • the hot word information may be a hot word corresponding to a specific application context, so as to provide prior context knowledge for the recognition process of the speech data to be recognized.
  • the phonetic symbol sequence is used to indicate the pronunciation of the hot word. If the hot word is in Chinese, its corresponding phonetic symbol sequence contains the tonal pinyin; when the hot word is in English, its corresponding phonetic symbol sequence includes British phonetic symbols or American phonetic symbols, etc. .
  • the text sequence of the hot word may be the hot word text itself.
  • the context recognition sub-model is trained based on training words and phonetic symbol sequences, text sequences and training labels of the training words. Therefore, in the training process of the context recognition sub-model, the pronunciation features and text features corresponding to the training words can be combined for training at the same time, so that the training features of the context recognition sub-model are more comprehensive and rich, and improve the accuracy of subsequent hot word determination. And comprehensive data support.
  • the hot word information contains the hot words "Zhang Zhiguo" and "Zhang Zhiguo"
  • the real text corresponding to the voice data to be recognized should be “Zhang Zhiguo said that she is getting married recently”. If there are hot words with similar pronunciation in , then when speech recognition is performed, the recognition text "Zhang Zhiguo said that she is going to get married recently" will appear, that is, confusing recognition of hot words will appear.
  • the phonetic sequence corresponding to "Zhang Zhiguo” is “zhang1zhi4guo2", and the phonetic sequence corresponding to "Zhang Zhiguo" If the phonetic sequence is "zhang1zhi1guo3", hot words with similar pronunciation can be distinguished, and the recognition result "Zhang Zhiguo said that she is getting married recently" can be obtained to improve the accuracy of speech recognition.
  • the speech recognition model used to recognize the speech data to be recognized may include a speech recognition sub-model and a context recognition sub-model, then in the process of speech recognition, speech recognition may be performed based on the speech recognition sub-model.
  • the context recognition sub-model can be combined to improve the accuracy of hot word recognition in the speech data to be recognized, thereby improving the accuracy of speech recognition.
  • the context recognition sub-model is trained, it is trained in combination with the pronunciation features and text features of the training data. Based on the pronunciation features, hot words with similar spelling or pronunciation can be accurately distinguished. Therefore, when identifying hot words , can combine the multiple features to perform accurate recognition from multiple hot words, avoid confusing recognition of hot words with similar spelling or pronunciation, further improve the accuracy of speech recognition, and improve user experience.
  • the speech recognition model 10 may include a speech recognition submodel 100 and a context recognition submodel 200, and the context recognition submodel 200 includes a pronunciation feature encoder 201 , a textual feature encoder 202 , an attention module 203 and a contextual feature decoder 204 .
  • the exemplary implementation of obtaining the target text corresponding to the speech data to be recognized is as follows, as shown in Figure 3, this step may include:
  • step 31 the phonetic symbol sequence of the hot word is encoded according to the pronunciation feature encoder to obtain the pronunciation feature vector of the hot word, and the text sequence of the hot word is encoded according to the text feature encoder to obtain the text feature vector of the hot word.
  • step 32 according to the speech recognition sub-model and the speech data to be recognized, the character acoustic vector and text probability distribution of each predicted character corresponding to the speech data to be recognized are obtained.
  • the speech recognition sub-model may further include an encoder 101, a prediction sub-model 102 and a decoder 103, wherein the prediction sub-model may be CIF (Continuous Integrate-and-Fire, continuous integration and delivery) Model.
  • CIF Continuous Integrate-and-Fire, continuous integration and delivery
  • the voice data per second can be divided into multiple audio frames, so as to perform data processing based on the audio frames.
  • the voice data per second can be divided into 100 audio frames for processing.
  • the obtained acoustic vector sequence H can be expressed as:
  • H ⁇ H1,H2,...,HU ⁇
  • U is used to represent the number of audio frames in the input speech data to be recognized, that is, the length of the acoustic vector sequence.
  • the acoustic vector can be input into the predicting sub-model, and the predicting sub-model can predict the amount of information on the acoustic vector to obtain the amount of information corresponding to the audio frame.
  • the acoustic vectors of the audio frames may be combined according to the amount of information of multiple audio frames to obtain character acoustic vectors.
  • the amount of information corresponding to each predicted character is the same, so the amount of information corresponding to the audio frame can be accumulated from left to right.
  • the preset threshold it is considered that the amount of information at this time
  • the audio frame corresponding to the accumulated information amount is formed into a prediction character, and a prediction character corresponds to one or more audio frames.
  • the preset threshold may be set according to actual application scenarios and experience, for example, the preset threshold may be set to 1, which is not limited in the present disclosure.
  • the acoustic vectors of the audio frames may be combined according to the amount of information of multiple audio frames in the following manner:
  • the information volume of the second audio frame can be divided into two parts, that is, a part of the information volume belongs to the current predicted character, and the remaining part of the information volume belongs to the next predicted character.
  • the amount W3 is accumulated until it reaches the preset threshold ⁇ , and the audio frame corresponding to the next predicted character is obtained.
  • the amount of information of the subsequent audio frames can be deduced by analogy, and combined in the above manner to obtain each predicted character corresponding to the plurality of audio frames.
  • the weighted sum of the acoustic vectors of each audio frame corresponding to the predicted character can be determined as the corresponding Character acoustic vector.
  • the weight of the acoustic vector of each audio frame corresponding to the predicted character is the corresponding information amount of the audio frame in the predicted character. If the audio frame all belongs to the predicted character, the weight of the acoustic vector of the audio frame is the information amount of the audio frame, and if the audio frame part belongs to the predicted character, the weight of the acoustic vector of the audio frame is the audio frame The amount of information in this section.
  • the character acoustic vector C1 corresponding to the predicted character can be expressed as:
  • the character acoustic vector C2 corresponding to the predicted character can be expressed as:
  • the character acoustic vector of each predicted character can be decoded based on the decoder, so as to obtain the text probability distribution of the predicted character.
  • step 33 according to the attention module, the pronunciation feature vector, the text feature vector and the character acoustic vector, the context feature vector of each predicted character is obtained.
  • this step when determining the contextual feature vector of the predicted character, it is possible to comprehensively consider multiple features of each hot word by combining the corresponding pronunciation features and text features of each hot word, so as to improve language The richness and accuracy of features in the environment feature vector.
  • the pronunciation feature vector, text feature vector and character acoustic vector are combined in the attention module to ensure the matching between each hot word and the voice data when performing attention calculation. Its specific calculation method is described below.
  • step 34 the context probability distribution of each predicted character is obtained according to the context feature decoder and the context feature vector.
  • the context feature vector of each predicted character can be decoded based on the context feature decoder, so that the context probability distribution of the predicted character can be obtained.
  • step 35 the target text corresponding to the data to be recognized is determined according to the text probability distribution and the context probability distribution.
  • each predicted character its text probability distribution and context probability distribution may be weighted and summed, so as to obtain a comprehensive probability distribution corresponding to the predicted character. Then, based on the comprehensive probability distribution, the recognition character corresponding to each predicted character can be determined through a Greedy Search algorithm or a Beam Search algorithm, so as to obtain the target text.
  • the above-mentioned search algorithm is a common method in the art, and will not be repeated here.
  • the pronunciation feature vector, the text feature vector and the character acoustic vector an example of the context feature vector of each predicted character is obtained.
  • This step may include:
  • For each hot word according to the pronunciation feature vector and the text feature vector of the hot word, determine the fusion feature vector corresponding to the hot word.
  • the fusion feature vector can be obtained by concatenating the pronunciation feature vector and the text feature vector.
  • the attention module For each of the predicted characters, in the attention module, determine the context corresponding to the predicted character according to the character acoustic vector of the predicted character, the fusion feature vector and the text feature vector corresponding to each of the hot words Feature vector.
  • the attention module can use the character acoustic vector and each fused feature vector to determine the degree of attention of the current predicted character to each hot word, so as to provide data support for subsequent recognition and judgment of hot words.
  • the exemplary method of determining the contextual feature vector corresponding to the predicted character according to the character acoustic vector of the predicted character, the fusion feature vector and the text feature vector corresponding to each of the hot words The implementation is as follows, and this step may include:
  • the dot product of the character acoustic vector and the fused feature vector corresponding to each hot word is determined as the initial weight corresponding to the hot word.
  • the dot product attention of Ci and the fusion feature vector can be calculated based on multi-head attention, and then the average of the obtained multiple dot product attention will be calculated
  • the value is used as the comprehensive weight corresponding to the feature fusion vector, that is, the initial weight of the hot word corresponding to the feature fusion vector.
  • the initial weight can be used to represent the degree of attention to each hot word in the character acoustic vector of the predicted character.
  • the initial weight corresponding to each of the hot words is normalized to obtain the target weight corresponding to each of the hot words.
  • the initial weights Q1-Qn corresponding to each hot word can be normalized, for example, softmax calculations can be performed on the Q1-Qn, and each The weights of hot words are mapped to a unified standard for measurement, which is convenient for comparing the target weights corresponding to each hot word, so that the hot word that is more likely to correspond to the predicted character can be determined.
  • the text feature vectors of the hot words are weighted and calculated according to the target weight corresponding to each of the hot words to obtain the context feature vector.
  • the product of the target weight corresponding to each hot word and its corresponding text feature vector can be accumulated to obtain the context feature vector, then the text feature vector corresponding to a higher target weight will be included in the context feature vector Corresponding to a more explicit feature representation in the environment feature vector.
  • the exemplary implementation is as follows, on the basis of the previous embodiment, it also includes:
  • the target weights after sorting M according to the target weights from large to small are updated to zero, and M is a positive integer.
  • M can be set according to actual usage scenarios, for example, M can be set to 20.
  • the target weight is used to represent the attention degree of the current predicted character to each hot word, then when the target weight corresponding to the hot word is smaller and ranked later, it means that the possibility that the predicted character corresponds to the hot word is low, so You can directly set the target weight of the hot word to 0, and focus on hot words with higher probability when judging hot words.
  • the target weights are sorted in descending order, and the first M target weights are retained, and the target weights after the sorting M are set to zero.
  • the text feature vectors of the hot words are weighted and calculated, and an exemplary implementation manner of obtaining the context feature vector may include:
  • the text feature vector of the hot word is weighted and calculated according to the updated target weight corresponding to each of the hot words to obtain the context feature vector.
  • hot words with less possibility can be directly excluded and recognized based on the target weight, the amount of calculation data can be reduced to a certain extent, and the recognition efficiency of hot words enhanced recognition can be improved while ensuring hot words Enhance the accuracy of recognition.
  • an example implementation manner of obtaining the context probability distribution of each predicted character is as follows, and this step may include:
  • a target feature vector of the predicted character is obtained according to the acoustic character vector of the predicted character and the contextual feature vector.
  • the acoustic character vector of the predicted character and the context feature vector may be concatenated to obtain the target feature vector.
  • the target feature vector is decoded according to the context feature decoder to obtain a context probability distribution corresponding to each predicted character.
  • the contextual feature decoder when the contextual feature decoder performs decoding, it can be decoded based on the target feature vector containing the audio features of the input voice data and the relevant features of each hot word, and the context probability distribution and the target feature vector to be improved are improved. Identify the matching degree between speech data and hot words, provide accurate and comprehensive data support for the subsequent determination of target text, further improve the diversity of features in the speech recognition process, and thus improve the accuracy of speech recognition results.
  • the phonetic sequence, text sequence and training label of the training word are determined in the following manner:
  • an N-gram word may be randomly extracted from the text, and this word may be used as a candidate word. Part of the words can then be randomly selected from the candidate words as the training words.
  • candidate texts may be determined from the training labeled text of the training sample, and then for each candidate text, an N-gram word may be randomly extracted from the candidate text as a training word. This ensures the diversity and randomness of the training words.
  • the phonetic sequence of the training word is determined according to the language of the training word, and the text sequence is determined from the training marked text.
  • the training word itself extracted from the training marked text can be directly used as the text sequence, and the phonetic symbol sequence types corresponding to different languages can be preset, such as setting the phonetic symbol sequence corresponding to Chinese as a pinyin sequence, and setting the phonetic symbol sequence corresponding to English It is a sequence of British phonetic symbols.
  • the corresponding phonetic sequence can be determined by querying based on an electronic dictionary.
  • a Chinese dictionary can be queried for each character in the word to obtain the tonal pinyin corresponding to each character in the word, and then Concatenate the toned pinyin of each character to obtain the phonetic sequence of the word; you can also directly query based on the word, so as to directly obtain the phonetic sequence of the word, such as for the training word "convex optimization theory", which corresponds to the phonetic sequence Expressed as "tu1you1hua4li3lun4".
  • texts other than the training words in the training tagged text corresponding to the training words are replaced with preset labels, so as to generate training labels corresponding to the training words.
  • the training words extracted from it are "convex optimization theory"
  • the training words corresponding to In the training label text the text other than the training words is replaced with the preset label, that is, the "is an important course” in the training label text "convex optimization theory is an important course” is replaced to obtain the training label "Convex Optimization Theory *****”.
  • the preset label has no actual meaning, and it is used to represent that the corresponding text here has no corresponding prior contextual knowledge.
  • the training samples can be automatically processed to obtain training data for training the context recognition sub-model, and when the training data is obtained, the text features and pronunciation corresponding to the training words can be extracted at the same time features, so that each word can be identified more accurately, so as to distinguish words with similar spelling or pronunciation, so that it can provide more comprehensive and reliable feature information for the training of the context recognition sub-model, and improve the training to a certain extent. Accuracy of the resulting context recognition submodel.
  • the speech recognition model can be trained in the following manner:
  • the phonetic sequence, text sequence and training labels of the training words in the training samples are determined.
  • the speech recognition sub-model and the context recognition sub-model may be trained separately.
  • the speech recognition sub-model can be trained by using the training speech data as input and the corresponding training marked text as the target output.
  • the training may be performed based on commonly used training methods in this field, which will not be repeated here.
  • the context recognition sub-model can be further trained based on the trained speech recognition sub-model.
  • the context recognition sub-model is used to determine the prior context knowledge corresponding to the hot words, therefore, in the embodiment of the present disclosure, the corresponding training words can be obtained directly from the training samples, so that based on the prior knowledge of the training words Contextual knowledge is used for training to ensure the diversity of training words, which can improve the stability and generalization of the model.
  • the phonetic sequence of the training word is used to represent the pronunciation of the training word. If the training sample is Chinese, its corresponding phonetic sequence contains tonal pinyin; when the training sample is English, its corresponding phonetic sequence contains British Phonetic symbols or American phonetic symbols, etc.
  • the text sequence of the training word may be the recognition text corresponding to the training word.
  • the training label of the training word is used to represent the target output corresponding to the training word.
  • the pronunciation feature vector, the text feature vector and the character acoustic vector, the context feature vector of each predicted character is obtained;
  • the target loss of the context recognition sub-model can be determined according to the output text and the training labels corresponding to the training words.
  • the cross-entropy loss can be calculated according to the output text and the training label, and the cross-entropy loss can be determined as the target loss.
  • the update condition may be that the target loss is greater than a preset loss threshold, which means that the recognition accuracy of the context recognition sub-model is insufficient.
  • the update condition may be that the number of iterations is less than a preset number threshold, and at this time, it is considered that the number of iterations of the context recognition sub-model is relatively small, and its recognition accuracy is insufficient.
  • the model parameters of the context recognition sub-model can be updated according to the target loss.
  • the method of updating the model parameters based on the determined loss may adopt a commonly used updating method in the field, such as the gradient descent method, which will not be repeated here.
  • the training process can be stopped to obtain the trained context recognition sub-model, and then obtain the trained speech Identify the model.
  • the model parameters of the trained speech recognition sub-model are kept unchanged.
  • the context recognition sub-model can be added on the basis of the trained speech recognition sub-model to realize the speech recognition model, improve the scalability and application range of the training method, and simultaneously On the basis of ensuring the accuracy of the speech recognition sub-model, more accurate prior context knowledge is provided to improve the recognition accuracy of the speech recognition model.
  • the present disclosure also provides a speech recognition device, as shown in FIG. 4 , the device 40 includes:
  • the receiving module 41 is used to receive voice data to be recognized
  • the processing module 42 is used to obtain the target text corresponding to the speech data to be recognized according to the speech data to be recognized, the hot word information and the speech recognition model; wherein, the hot word information includes a text sequence corresponding to a plurality of hot words and phonetic symbol sequence; the speech recognition model includes a speech recognition sub-model and a context recognition sub-model, and the context recognition sub-model is trained based on training words and phonetic symbol sequences, text sequences and training labels of the training words.
  • the context recognition sub-model includes a pronunciation feature encoder, a text feature encoder, an attention module and a context feature decoder;
  • the processing modules include:
  • the first processing submodule is used to obtain the character acoustic vector and text probability distribution of each predicted character corresponding to the speech data to be recognized according to the speech recognition submodel and the speech data to be recognized;
  • the first determining submodule is configured to determine the target text corresponding to the data to be recognized according to the text probability distribution and the context probability distribution.
  • the second processing submodule includes:
  • the fourth determining submodule is used to determine the dot product of the character acoustic vector and the fused feature vector corresponding to each of the hot words as the initial weight corresponding to the hot word;
  • the third processing submodule is used to normalize the initial weight corresponding to each of the hot words, and obtain the target weight corresponding to each of the hot words;
  • the calculation sub-module is used for:
  • the text feature vector of the hot word is weighted and calculated according to the updated target weight corresponding to each of the hot words to obtain the context feature vector.
  • the first decoding submodule includes:
  • the fourth processing submodule is used to obtain the target feature vector of the predicted character according to the acoustic character vector of the predicted character and the context feature vector for each predicted character;
  • the second decoding submodule is configured to decode the target feature vector according to the context feature decoder to obtain a context probability distribution corresponding to each of the predicted characters.
  • determine the phonetic sequence, text sequence and training label of the training word in the following manner:
  • For each of the training words determine the phonetic sequence of the training words according to the language of the training words, and determine the text sequence from the training marked text;
  • texts other than the training words in the training tagged text corresponding to the training words are replaced with preset labels, so as to generate training labels corresponding to the training words.
  • FIG. 5 it shows a schematic structural diagram of an electronic device 600 suitable for implementing the embodiments of the present disclosure.
  • the terminal equipment in the embodiment of the present disclosure may include but not limited to such as mobile phone, notebook computer, digital broadcast receiver, PDA (personal digital assistant), PAD (tablet computer), PMP (portable multimedia player), vehicle terminal (such as mobile terminals such as car navigation terminals) and fixed terminals such as digital TVs, desktop computers and the like.
  • the electronic device shown in FIG. 5 is only an example, and should not limit the functions and scope of use of the embodiments of the present disclosure.
  • an electronic device 600 may include a processing device (such as a central processing unit, a graphics processing unit, etc.) 601, which may be randomly accessed according to a program stored in a read-only memory (ROM) 602 or loaded from a storage device 608. Various appropriate actions and processes are executed by programs in the memory (RAM) 603 . In the RAM 603, various programs and data necessary for the operation of the electronic device 600 are also stored.
  • the processing device 601, ROM 602, and RAM 603 are connected to each other through a bus 604.
  • An input/output (I/O) interface 605 is also connected to the bus 604 .
  • the following devices can be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speaker, vibration an output device 607 such as a computer; a storage device 608 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 609.
  • the communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While FIG. 5 shows electronic device 600 having various means, it should be understood that implementing or having all of the means shown is not a requirement. More or fewer means may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product, which includes a computer program carried on a non-transitory computer readable medium, where the computer program includes program code for executing the method shown in the flowchart.
  • the computer program may be downloaded and installed from a network via communication means 609, or from storage means 608, or from ROM 602.
  • the processing device 601 When the computer program is executed by the processing device 601, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are performed.
  • the above-mentioned computer-readable medium in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two.
  • a computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can transmit, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device .
  • Program code embodied on a computer readable medium may be transmitted by any appropriate medium, including but not limited to wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.
  • the client and the server can communicate using any currently known or future network protocols such as HTTP (HyperText Transfer Protocol, Hypertext Transfer Protocol), and can communicate with digital data in any form or medium
  • HTTP HyperText Transfer Protocol
  • the communication eg, communication network
  • Examples of communication networks include local area networks (“LANs”), wide area networks (“WANs”), internetworks (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network of.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device, or may exist independently without being incorporated into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs.
  • the electronic device receives the speech data to be recognized; A speech recognition model, which obtains the target text corresponding to the speech data to be recognized; wherein, the hot word information includes text sequences and phonetic symbol sequences corresponding to a plurality of hot words; the speech recognition model includes a speech recognition sub-model and context recognition A sub-model, the context recognition sub-model is trained based on the training words and the phonetic symbol sequences, text sequences and training labels of the training words.
  • Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, or combinations thereof, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and Includes conventional procedural programming languages - such as "C" or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (for example, using an Internet service provider to connected via the Internet).
  • LAN local area network
  • WAN wide area network
  • Internet service provider for example, using an Internet service provider to connected via the Internet.
  • each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.
  • the modules involved in the embodiments described in the present disclosure may be implemented by software or by hardware. Wherein, the name of the module does not constitute a limitation on the module itself under certain circumstances, for example, the receiving module can also be described as "a module that receives speech data to be recognized".
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSPs Application Specific Standard Products
  • SOCs System on Chips
  • CPLD Complex Programmable Logical device
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • Example 1 provides a speech recognition method, the method comprising:
  • the hot word information and the speech recognition model, the target text corresponding to the speech data to be recognized is obtained; wherein the hot word information includes a plurality of hot words corresponding to the text sequence and phonetic sequence; the The speech recognition model includes a speech recognition sub-model and a context recognition sub-model, and the context recognition sub-model is trained based on training words, phonetic symbols sequences, text sequences and training labels of the training words.
  • Example 2 provides the method of Example 1, the context recognition sub-model includes a pronunciation feature encoder, a text feature encoder, an attention module and a context feature decoder;
  • the speech recognition sub-model and the speech data to be recognized obtain the character acoustic vector and text probability distribution of each predicted character corresponding to the speech data to be recognized;
  • a target text corresponding to the data to be recognized is determined according to the text probability distribution and the context probability distribution.
  • Example 4 provides the method of Example 3, wherein the character acoustic vectors of the predicted characters, the fusion feature vectors and text feature vectors corresponding to each of the hot words are used to determine the Predict the contextual feature vector corresponding to the character, including:
  • the dot product of the character acoustic vector and the fusion feature vector corresponding to each of the hot words is determined as the initial weight corresponding to the hot word;
  • the text feature vectors of the hot words are weighted and calculated according to the target weight corresponding to each of the hot words to obtain the context feature vector.
  • Example 5 provides the method of Example 4, the determination of the The contextual feature vector corresponding to the predicted character also includes:
  • the text feature vector of the hot word is weighted and calculated according to the target weight corresponding to each of the hot words, and the context feature vector is obtained, including:
  • the text feature vectors of the hot words are weighted and calculated according to the updated target weights corresponding to each of the hot words to obtain the context feature vector.
  • Example 6 provides the method of Example 2, the context probability distribution of each of the predicted characters is obtained according to the context feature decoder and the context feature vector ,include:
  • the target feature vector is decoded according to the context feature decoder to obtain a context probability distribution corresponding to each predicted character.
  • Example 7 provides the method of any one of Examples 1-6, which determines the phonetic sequence, text sequence and training label of the training word in the following manner:
  • texts other than the training words in the training labeled text corresponding to the training words are replaced with preset labels, so as to generate training labels corresponding to the training words.
  • Example 8 provides a speech recognition device, the device comprising:
  • a receiving module configured to receive voice data to be recognized
  • a processing module configured to obtain target text corresponding to the speech data to be recognized according to the speech data to be recognized, hot word information and a speech recognition model; wherein, the hot word information includes a plurality of text sequences corresponding to hot words and phonetic symbol sequence; the speech recognition model includes a speech recognition sub-model and a context recognition sub-model, and the context recognition sub-model is trained based on training words and phonetic symbol sequences, text sequences and training labels of the training words.
  • Example 9 provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of any one of the methods described in Examples 1-7 are implemented .
  • Example 10 provides an electronic device, comprising:
  • a processing device configured to execute the computer program in the storage device, so as to implement the steps of the method in any one of examples 1-7.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

L'invention concerne un procédé et un appareil de reconnaissance vocale, ainsi qu'un support lisible par ordinateur et un dispositif. Le procédé comprend les étapes consistant : à recevoir des données vocales devant être reconnues (11) ; et selon lesdites données de parole, des informations de mot réveil et un modèle de reconnaissance vocale, obtenir un texte cible correspondant auxdites données de parole (12), les informations de mot réveil comprenant des séquences de texte et des séquences de symboles phonétiques correspondant à une pluralité de mots réveil, le modèle de reconnaissance vocale comprenant un sous-modèle de reconnaissance vocale et un sous-modèle de reconnaissance de contexte, et le sous-modèle de reconnaissance de contexte étant entrainé sur la base d'un mot d'entrainement et d'une séquence de symboles phonétiques, d'une séquence de texte et d'une étiquette d'entrainement du mot d'entrainement. Par conséquent, lorsqu'un sous-modèle de reconnaissance de contexte est entrainé, un entrainement est effectué en vue d'une caractéristique de prononciation et d'une caractéristique de texte de données d'entrainement, de sorte que des mots réveil qui sont similaires à l'orthographe ou à la prononciation peuvent être distingués avec précision sur la base de la caractéristique de prononciation, ce qui permet d'éviter une reconnaissance confondue des mots réveil lorsque les mots réveil sont reconnus, et d'améliorer, en outre, la précision de la reconnaissance de la parole.
PCT/CN2022/089595 2021-06-30 2022-04-27 Procédé et appareil de reconnaissance vocale, support et dispositif WO2023273578A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110735672.7 2021-06-30
CN202110735672.7A CN113470619B (zh) 2021-06-30 2021-06-30 语音识别方法、装置、介质及设备

Publications (1)

Publication Number Publication Date
WO2023273578A1 true WO2023273578A1 (fr) 2023-01-05

Family

ID=77876448

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/089595 WO2023273578A1 (fr) 2021-06-30 2022-04-27 Procédé et appareil de reconnaissance vocale, support et dispositif

Country Status (2)

Country Link
CN (1) CN113470619B (fr)
WO (1) WO2023273578A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116705058A (zh) * 2023-08-04 2023-09-05 贝壳找房(北京)科技有限公司 多模语音任务的处理方法、电子设备及可读存储介质
CN117437909A (zh) * 2023-12-20 2024-01-23 慧言科技(天津)有限公司 基于热词特征向量自注意力机制的语音识别模型构建方法

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113470619B (zh) * 2021-06-30 2023-08-18 北京有竹居网络技术有限公司 语音识别方法、装置、介质及设备
CN114036959A (zh) * 2021-11-25 2022-02-11 北京房江湖科技有限公司 会话语境的确定方法、装置、计算机程序产品和存储介质
CN115713939B (zh) * 2023-01-06 2023-04-21 阿里巴巴达摩院(杭州)科技有限公司 语音识别方法、装置及电子设备
CN116110378B (zh) * 2023-04-12 2023-07-18 中国科学院自动化研究所 模型训练方法、语音识别方法、装置和电子设备

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20160023424A (ko) * 2014-08-22 2016-03-03 현대자동차주식회사 음성 인식 장치, 그를 포함하는 차량, 및 그 차량의 제어 방법
CN110706690A (zh) * 2019-09-16 2020-01-17 平安科技(深圳)有限公司 语音识别方法及其装置
CN111583909A (zh) * 2020-05-18 2020-08-25 科大讯飞股份有限公司 一种语音识别方法、装置、设备及存储介质
CN111816165A (zh) * 2020-07-07 2020-10-23 北京声智科技有限公司 语音识别方法、装置及电子设备
CN111933129A (zh) * 2020-09-11 2020-11-13 腾讯科技(深圳)有限公司 音频处理方法、语言模型的训练方法、装置及计算机设备
CN112489646A (zh) * 2020-11-18 2021-03-12 北京华宇信息技术有限公司 语音识别方法及其装置
CN113470619A (zh) * 2021-06-30 2021-10-01 北京有竹居网络技术有限公司 语音识别方法、装置、介质及设备

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9202465B2 (en) * 2011-03-25 2015-12-01 General Motors Llc Speech recognition dependent on text message content
US9305554B2 (en) * 2013-07-17 2016-04-05 Samsung Electronics Co., Ltd. Multi-level speech recognition
CN105719649B (zh) * 2016-01-19 2019-07-05 百度在线网络技术(北京)有限公司 语音识别方法及装置
IL252071A0 (en) * 2017-05-03 2017-07-31 Google Inc Contextual language translation
CN110689881B (zh) * 2018-06-20 2022-07-12 深圳市北科瑞声科技股份有限公司 语音识别方法、装置、计算机设备和存储介质
CN110896664B (zh) * 2018-06-25 2023-12-26 谷歌有限责任公司 热词感知语音合成
CN109815322B (zh) * 2018-12-27 2021-03-12 东软集团股份有限公司 应答的方法、装置、存储介质及电子设备
CN110517692A (zh) * 2019-08-30 2019-11-29 苏州思必驰信息科技有限公司 热词语音识别方法和装置
CN110544477A (zh) * 2019-09-29 2019-12-06 北京声智科技有限公司 一种语音识别方法、装置、设备及介质
CN110879839A (zh) * 2019-11-27 2020-03-13 北京声智科技有限公司 一种热词识别方法、装置及***

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20160023424A (ko) * 2014-08-22 2016-03-03 현대자동차주식회사 음성 인식 장치, 그를 포함하는 차량, 및 그 차량의 제어 방법
CN110706690A (zh) * 2019-09-16 2020-01-17 平安科技(深圳)有限公司 语音识别方法及其装置
CN111583909A (zh) * 2020-05-18 2020-08-25 科大讯飞股份有限公司 一种语音识别方法、装置、设备及存储介质
CN111816165A (zh) * 2020-07-07 2020-10-23 北京声智科技有限公司 语音识别方法、装置及电子设备
CN111933129A (zh) * 2020-09-11 2020-11-13 腾讯科技(深圳)有限公司 音频处理方法、语言模型的训练方法、装置及计算机设备
CN112489646A (zh) * 2020-11-18 2021-03-12 北京华宇信息技术有限公司 语音识别方法及其装置
CN113470619A (zh) * 2021-06-30 2021-10-01 北京有竹居网络技术有限公司 语音识别方法、装置、介质及设备

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116705058A (zh) * 2023-08-04 2023-09-05 贝壳找房(北京)科技有限公司 多模语音任务的处理方法、电子设备及可读存储介质
CN116705058B (zh) * 2023-08-04 2023-10-27 贝壳找房(北京)科技有限公司 多模语音任务的处理方法、电子设备及可读存储介质
CN117437909A (zh) * 2023-12-20 2024-01-23 慧言科技(天津)有限公司 基于热词特征向量自注意力机制的语音识别模型构建方法
CN117437909B (zh) * 2023-12-20 2024-03-05 慧言科技(天津)有限公司 基于热词特征向量自注意力机制的语音识别模型构建方法

Also Published As

Publication number Publication date
CN113470619A (zh) 2021-10-01
CN113470619B (zh) 2023-08-18

Similar Documents

Publication Publication Date Title
WO2023273578A1 (fr) Procédé et appareil de reconnaissance vocale, support et dispositif
JP7112536B2 (ja) テキストにおける実体注目点のマイニング方法および装置、電子機器、コンピュータ読取可能な記憶媒体並びにコンピュータプログラム
CN111583903B (zh) 语音合成方法、声码器训练方法、装置、介质及电子设备
WO2023273611A1 (fr) Procédé et appareil d'apprentissage de modèle de reconnaissance de la parole, procédé et appareil de reconnaissance de la parole, support et dispositif
CN111368559A (zh) 语音翻译方法、装置、电子设备及存储介质
WO2023273612A1 (fr) Procédé et appareil d'entraînement pour modèle de reconnaissance de la parole, procédé et appareil de reconnaissance de la parole, support et dispositif
CN112509562B (zh) 用于文本后处理的方法、装置、电子设备和介质
WO2022247562A1 (fr) Procédé et appareil de récupération de données multimodales, et support et dispositif électronique
WO2020182123A1 (fr) Procédé et dispositif d'envoi d'instructions
US20240029709A1 (en) Voice generation method and apparatus, device, and computer readable medium
CN112883967B (zh) 图像字符识别方法、装置、介质及电子设备
CN112883968B (zh) 图像字符识别方法、装置、介质及电子设备
WO2023273610A1 (fr) Procédé et appareil de reconnaissance de la parole, support et dispositif électronique
WO2023143016A1 (fr) Procédé et appareil de génération de modèle d'extraction de caractéristiques, et procédé et appareil d'extraction de caractéristiques d'image
WO2023005763A1 (fr) Procédé et appareil de traitement d'informations, et dispositif électronique
CN111368560A (zh) 文本翻译方法、装置、电子设备及存储介质
CN115270717A (zh) 一种立场检测方法、装置、设备及介质
CN114765025A (zh) 语音识别模型的生成方法、识别方法、装置、介质及设备
CN111090993A (zh) 属性对齐模型训练方法及装置
CN113140012B (zh) 图像处理方法、装置、介质及电子设备
CN113033707A (zh) 视频分类方法、装置、可读介质及电子设备
CN112069786A (zh) 文本信息处理方法、装置、电子设备及介质
CN114625876B (zh) 作者特征模型的生成方法、作者信息处理方法和装置
CN116244431A (zh) 文本分类方法、装置、介质及电子设备
CN113947060A (zh) 文本转换方法、装置、介质及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22831406

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22831406

Country of ref document: EP

Kind code of ref document: A1