WO2020156342A1 - 语音识别方法、装置、电子设备及存储介质 - Google Patents

语音识别方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2020156342A1
WO2020156342A1 PCT/CN2020/073328 CN2020073328W WO2020156342A1 WO 2020156342 A1 WO2020156342 A1 WO 2020156342A1 CN 2020073328 W CN2020073328 W CN 2020073328W WO 2020156342 A1 WO2020156342 A1 WO 2020156342A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
decoding network
path
corpus
node
Prior art date
Application number
PCT/CN2020/073328
Other languages
English (en)
French (fr)
Inventor
王杰
钟贵平
***
吴本谷
陈江
Original Assignee
北京猎户星空科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京猎户星空科技有限公司 filed Critical 北京猎户星空科技有限公司
Publication of WO2020156342A1 publication Critical patent/WO2020156342A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications

Definitions

  • This application relates to the field of speech recognition technology, and in particular to a speech recognition method, device, electronic equipment, and storage medium.
  • the speech recognition system mainly includes a set of acoustic models, language models and decoders.
  • the accuracy of speech recognition mainly depends on the language model. As the user's personalization needs become higher and higher, different language models need to be trained for different users to provide proprietary speech recognition services.
  • the training method of personalized language model is to use the user's own corpus to train the general language model to generate a user-specific language model, and deploy a set of specialized speech recognition services for each user, through periodic updates Language models to meet the individual needs of users.
  • the embodiments of the present application provide a voice recognition method, device, electronic equipment, and storage medium to solve the need to deploy a set of specialized voice recognition services for each user in order to meet the needs of user personalized customization in the prior art, resulting in resources The problem of serious waste.
  • an embodiment of the present application provides a voice recognition method, including:
  • the searching for the optimal path corresponding to the input voice in the decoding network according to the user ID includes:
  • the searching for the optimal path corresponding to the input voice in the decoding network according to the user ID includes:
  • search for the optimal path corresponding to the input voice in the decoding network According to the language model corresponding to the user ID, search for the optimal path corresponding to the input voice in the decoding network.
  • the decoding network is constructed based on a full dictionary.
  • update the language model corresponding to the user ID in the following manner:
  • the probability score corresponding to the user ID of the path mark between the corresponding word nodes in the decoding network is updated.
  • the determining that the language model corresponding to the user ID needs to be updated includes:
  • the detecting whether the corpus corresponding to the user ID has been updated includes:
  • the method further includes:
  • For each phoneme node in the decoding network select the maximum value of the appearance frequency scores of the target word node corresponding to the phoneme node corresponding to the user ID, and determine it from the phoneme node to the target word node The path corresponding to the latest forward probability of the user ID;
  • the look-ahead probability corresponding to the user ID of the path from the phoneme node to the target word node in the decoding network is updated.
  • obtaining the appearance frequency score corresponding to each word node includes:
  • the frequency of the word node is normalized to obtain the appearance frequency score corresponding to the word node.
  • an embodiment of the present application provides a voice recognition device, including:
  • the obtaining module is used to obtain the input voice and the user ID corresponding to the input voice;
  • the decoding module is used to search for the optimal path corresponding to the input voice in the decoding network according to the user ID, and the path between the word nodes in the decoding network is marked with the user ID;
  • the determining module is used to determine the text information corresponding to the input voice according to the optimal path.
  • the decoding module is specifically configured to: determine the optimal path corresponding to the input voice according to the probability score corresponding to the user ID marked by the path between each word node in the decoding network.
  • the decoding module is specifically configured to:
  • search for the optimal path corresponding to the input voice in the decoding network According to the language model corresponding to the user ID, search for the optimal path corresponding to the input voice in the decoding network.
  • the decoding network is constructed based on a full dictionary.
  • model update module for:
  • the probability score corresponding to the user ID of the path mark between the corresponding word nodes in the decoding network is updated.
  • model update module is specifically configured to:
  • model update module is specifically configured to:
  • model update module is also used to:
  • For each phoneme node in the decoding network select the maximum value of the appearance frequency scores of the target word node corresponding to the phoneme node corresponding to the user ID, and determine it from the phoneme node to the target word node The path corresponding to the latest forward probability of the user ID;
  • the look-ahead probability corresponding to the user ID of the path from the phoneme node to the target word node in the decoding network is updated.
  • model update module is specifically configured to:
  • the frequency of the word node is normalized to obtain the appearance frequency score corresponding to the word node.
  • an embodiment of the present application provides an electronic device, including a transceiver, a memory, a processor, and a computer program stored on the memory and running on the processor, wherein the transceiver is used for When receiving and sending data under control, the processor implements the steps of any of the above methods when the processor executes the program.
  • an embodiment of the present application provides a computer-readable storage medium having computer program instructions stored thereon, and when the program instructions are executed by a processor, the steps of any of the above methods are implemented.
  • the present application also provides a computer program product, the computer program product includes a computer program stored on a computer-readable storage medium, the computer program includes program instructions, when the program instructions are executed by a processor Steps to implement any of the above voice recognition methods.
  • the technical solution provided by the embodiments of the present application marks the user ID on the path between the word nodes in the constructed decoding network, so that in the process of using the decoding network to recognize speech, it is possible to search for only the user ID marked with the user ID.
  • the optimal path is selected from the searched multiple paths, and the text information corresponding to the input voice is determined according to the optimal path, so that different users can obtain different recognition results based on the same decoding network. Therefore, only a set of decoding network needs to be deployed on the server side.
  • the decoding network integrates multiple user-specific language models and can provide personalized speech recognition services for multiple users while saving hardware resources.
  • FIG. 1 is a schematic diagram of an application scenario of a speech recognition method provided by an embodiment of the application
  • FIG. 2 is a schematic flowchart of a voice recognition method provided by an embodiment of this application.
  • FIG. 3 is an example of a local network in a decoding network provided by an embodiment of this application.
  • Fig. 4 is an example of a path between word nodes in a decoding network provided by an embodiment of the application
  • FIG. 5 is another example of a local network in a decoding network provided by an embodiment of this application.
  • FIG. 6 is an example of a local network in a decoding network constructed based on language models of multiple users according to an embodiment of the application;
  • FIG. 7 is a schematic flowchart of a method for updating a language model corresponding to a user ID according to an embodiment of the application
  • FIG. 8 is a schematic structural diagram of a speech recognition device provided by an embodiment of this application.
  • FIG. 9 is a schematic structural diagram of an electronic device provided by an embodiment of the application.
  • language model (Language Model, LM) is to establish a distribution that can describe the probability of occurrence of a given word sequence in a language.
  • the language model is a model that describes the probability distribution of words, a model that can reliably reflect the probability distribution of words used in language recognition.
  • Language models occupy an important position in natural language processing and have been widely used in speech recognition, machine translation and other fields.
  • a language model can be used to obtain the most likely word sequence among multiple word sequences in speech recognition, or given several words, predict the next most likely word, etc.
  • Commonly used language models include N-Gram LM (N-gram language model), Big-Gram LM (binary language model), and Tri-Gram LM (tri-gram language model).
  • A Acoustic model
  • the current mainstream systems mostly use hidden Markov models for modeling.
  • a dictionary is a collection of phonemes corresponding to words and describes the mapping relationship between words and phonemes.
  • Phoneme is the smallest unit of speech. It is analyzed based on the pronunciation actions in syllables. One action constitutes a phoneme.
  • Phonemes in Chinese are divided into two categories: initials and finals. For example, initials include: b, p, m, f, d, t, etc., and finals include: a, o, e, i, u, ü, ai, ei , Ao, an, ian, ong, iong, etc.
  • Phonemes in English are divided into vowels and consonants. For example, vowels include a, e, ai, etc., and consonants include p, t, h, etc.
  • Look-ahead probability In order not to cut the path with lower acoustic scores in the middle of decoding, it is generally adopted to decompose the appearance probability scores based on the language model to represent the frequency of each word occurrence.
  • the technology of language model look-ahead technology is to introduce the appearance probability score corresponding to the word node on the path from the phoneme node to the word node in the decoding network, and use the maximum value of the occurrence probability score as the phoneme node to all The look-ahead probability on the path of the word node reached.
  • the look-ahead probability is added to the score of the path, which can significantly improve some of the lower acoustic scores but higher probability scores The score of the path to avoid cutting such paths during the pruning process.
  • the training method of personalized language model is to use the user's own corpus to train the general language model to generate a user-specific language model, and deploy a set of special speech recognition services for each user.
  • the language model is periodically updated to meet the individual needs of users.
  • deploying a set of specialized speech recognition services for each user will cause a serious waste of resources and generate huge expenses.
  • the inventor of the present application considers that the user ID is marked on the path between the word nodes in the constructed decoding network, so that in the process of using the decoding network to recognize speech, it is possible to search for only those marked with the user ID according to the user ID.
  • the path of the user ID selects the optimal path from the searched multiple paths, and determines the text information corresponding to the input voice according to the optimal path, so that different users can obtain different recognition results based on the same decoding network. Therefore, only a set of decoding network needs to be deployed on the server side.
  • the decoding network integrates multiple user-specific language models and can provide personalized speech recognition services for multiple users while saving hardware resources.
  • the full vocabulary is used to construct the decoding network, so that the constructed decoding network can be applied to multiple users.
  • the decoding network constructed based on the full vocabulary can also realize online update of the language model corresponding to each user.
  • the language model of a user needs to be updated, only need to recalculate the word nodes in the decoding network according to the updated language model of the user
  • the probability score of the path, and the user’s probability score in the decoding network is updated based on the user ID in the decoding network, the changes brought by the updated language model can be introduced into the decoding network, and the decoding network passes the decoding network after updating the probability score Perform a path search to obtain a recognition result that meets the user's personalized needs.
  • FIG. 1 is a schematic diagram of an application scenario of the voice recognition method provided by an embodiment of the application.
  • Multiple users 10 jointly use the voice recognition service provided by the decoder in the same server 12.
  • the smart device 11 sends the voice signal input by the user 10 to the server 12.
  • the server 12 decodes the voice signal through the decoding network in the decoder to obtain the text information corresponding to the voice signal, and
  • the decoded text information is fed back to the smart device 11 to complete the voice recognition service.
  • the smart device 11 and the server 12 communicate through a network, and the network may be a local area network, a wide area network, or the like.
  • the smart device 11 can be a smart speaker, a robot, etc., or a portable device (for example, a mobile phone, a tablet, a laptop, etc.), or a personal computer (PC, Personal Computer), and the server 12 can be any voice recognition service. Server equipment.
  • an embodiment of the present application provides a voice recognition method, including the following steps:
  • S201 Acquire an input voice and a user ID corresponding to the input voice.
  • the smart terminal can send the collected input voice and user ID to the server, and the server performs voice recognition on the input voice according to the user ID.
  • one user ID corresponds to one language model, and the corpus corresponding to each user ID is used to train each user-specific language model.
  • the user ID in this embodiment may be enterprise-level, that is, the user ID is used to identify a different enterprise, a language model corresponding to an enterprise, and the smart device under the enterprise uses a language model.
  • the user ID can also be device-level, that is, the user ID is used to identify a type or device.
  • a type of device or a device corresponds to a language model.
  • a smart speaker corresponds to a language model about music
  • a chat robot responds to a language model about chat. Language model, so that different devices can use the same decoding network.
  • the user ID can also be service-level, that is, different services correspond to a language model, and smart devices under the service use a language model. and many more.
  • the embodiments of the present application do not limit the specific implementation of the user ID, and can be configured according to actual application scenarios or requirements.
  • search for the optimal path corresponding to the input voice in the decoding network, and the path between the word nodes in the decoding network is marked with the user ID.
  • the decoding network is a network diagram representing the relationship between phonemes and words and between words and words.
  • the decoding network can be constructed based on the acoustic model and the corpus and language model corresponding to the multiple users.
  • the specific construction method is as follows:
  • the first step is to obtain a dictionary containing all the words in the corpus based on the corpus corresponding to each user ID, and convert the words in the dictionary into phoneme strings.
  • the phoneme string of "open” is "k-ai”
  • the phoneme string of "Beijing” is "b-ei-j-ing”
  • the phoneme string of a word and the word form a path.
  • the path corresponding to " ⁇ " is "k-ai- ⁇ ” and "Beijing" corresponds to The path is "b-ei-j-ing-Beijing".
  • the second step is to merge the nodes in the paths corresponding to all the words in the dictionary, that is, to merge the same phonemes in each path into one node to form a network of phoneme strings corresponding to all words, and one phoneme is one of the networks. Phoneme node.
  • Figure 3 shows an example of a partial network in the decoding network.
  • the "k” in the phoneme strings of words such as “ka”, “kai” and “ke” are merged into a node in a network.
  • the last node of each path in the network corresponds to the vocabulary corresponding to the phoneme string composed of phonemes on the path, as shown in Figure 3, the word “ka-ka” corresponds to "ka", “ka-ch-e-truck”
  • the corresponding word is "truck”.
  • the node corresponding to the phoneme in the decoding network is called the phoneme node, and the node corresponding to the vocabulary is called the word node.
  • the method for generating a decoding network based on a dictionary is an existing technology and will not be described in detail.
  • the third step is to determine the acoustic score between the connected phoneme nodes in the decoding network constructed in the second step according to the acoustic model.
  • multiple users can share one acoustic model.
  • the fourth step is to determine the connection relationship and probability score between the words and words in the dictionary for each user ID according to the language model of the user ID, and establish the words and words in the decoding network constructed in the second step according to the connection relationship And mark the user ID and the probability score of the user on the path between word nodes.
  • W 1 ) of another word W 2 appearing after a word W 1 can be determined, and the conditional probability p(W 2
  • the corpus for training the language model includes "My family is in Beijing", and the words in the corpus include " ⁇ ", “ ⁇ ”, “ ⁇ ”, “Beijing”, then in the decoding network, the word nodes “ ⁇ ” and “ ⁇ ” ”Is connected, “ ⁇ ” and “ ⁇ ” are connected, and a connection is established between “Zi” and “Beijing”, and then according to the language model, “I” and “home”, “ ⁇ ” and “ ⁇ ” and “ ⁇ ” are connected.
  • Fig. 4 is an example of the path between word nodes in the decoding network. Fig. 4 conceals the network relationship between phoneme nodes and word nodes.
  • the word node “I” is connected to the first phoneme node of "home”
  • SA 1 , SA 2 , SA 3 Represents the acoustic score
  • SL 1 represents the probability score of the path from the word node “I” to "home” corresponding to the user ID 1
  • SL 2 represents the probability score of the path from the word node “I” to "home” corresponding to the user ID 2 .
  • the probability score of each user ID is marked on the corresponding path in the decoding network, so that when decoding, the path corresponding to the user can be selected according to the user ID, and the input voice can be determined based on the probability score on the corresponding path The optimal path.
  • a decoding network that can be used by multiple users can be obtained. Pre-loading the constructed decoding network into the decoder of the server can provide voice recognition services for these multiple users.
  • S203 Determine text information corresponding to the input voice according to the optimal path.
  • the process of speech recognition includes: preprocessing the speech signal, extracting the acoustic feature vector of the speech signal, and then inputting the acoustic feature vector into the acoustic model to obtain the phoneme sequence; based on the phoneme sequence and the corresponding speech signal User ID, search for a path with the highest score in the decoding network as the optimal path, and determine the text sequence corresponding to the optimal path as the recognition result of the voice signal.
  • the optimal path is determined according to the total score of each path.
  • the total score of the path is determined according to the acoustic score on the path and the probability score corresponding to the user ID.
  • the decoding score on a path can be calculated by the following formula:
  • L is a decoding path
  • SA i is the i-th acoustic score on the path L
  • SL j,x is the j-th probability score corresponding to the user whose user ID is x on the path L.
  • the score of the decoding result "My Home" corresponding to user ID 1 is (logSA 1 +logSA 2 +logSA 3 +log SL 1 ).
  • the user ID is marked on the path between the word nodes in the decoding network.
  • the path that the user can use is selected according to the user ID on the path, so that different users can decode based on the same
  • the network gets different recognition results.
  • FIG. 6 it is a partial example of a decoding network generated based on language models of multiple users. Due to space limitations, some phoneme nodes in FIG. 6 are not shown. Taking Figure 6 as an example, when the voice signal of user ID 1 is recognized, the path between the word node "Zai" and "Beijing" is marked with "ID 1 ".
  • the selected path is "Zai-Beijing" , Instead of selecting the other two paths in Figure 6; when recognizing the voice signal of user ID 2 , the selected path is "Zai-Suzhou” and “Zai-Jiangsu” two paths marked with ID 2 .
  • the speech recognition method of the embodiment of the present application only needs to deploy a set of decoding network on the server side.
  • the decoding network integrates multiple user-specific language models, and can provide personalized speech recognition services for multiple users, while saving Hardware resources.
  • step S202 specifically includes: determining the optimal path corresponding to the input voice according to the probability score corresponding to the user ID marked by the path between each word node in the decoding network.
  • the user ID is used in the decoding network to distinguish the probability scores of different users, so that multiple users can share a decoding network.
  • the probability score of the user ID marked on the decoding network path is taken to calculate the total score of each path, and the path with the highest total score is selected as the optimal path, based on the optimal path.
  • step S202 specifically includes: searching for the optimal path corresponding to the input voice in the decoding network according to the user ID, including: obtaining the language model corresponding to the user ID according to the user ID; The language model searches for the optimal path corresponding to the input voice in the decoding network.
  • each user ID corresponds to a language model, which is trained based on the corpus in the corpus corresponding to the user ID.
  • the language model corresponding to the user ID is obtained based on the user ID corresponding to the input voice, and the user ID is used to correspond
  • search for the optimal path corresponding to the input voice and provide personalized voice recognition services for different users.
  • its unique language model will be loaded into the decoder according to the user ID in advance, while the language models of other user IDs cannot be loaded into the decoder, so that multiple users can share one. It sets a general decoding network while maintaining its own characteristic language model.
  • the embodiment of the present application uses a full dictionary to construct a decoding network shared by multiple users.
  • the full dictionary in the embodiment of this application is a dictionary containing a large number of commonly used words.
  • the number of vocabulary contained in the full dictionary is more than 100,000, which can cover different topics in multiple fields.
  • the vocabulary in the full dictionary includes words and words.
  • the full dictionary can cover all the words contained in the corpus corresponding to the user ID.
  • the method of constructing a decoding network shared by multiple users based on a full dictionary is similar to the method of constructing a decoding network based on a corpus corresponding to multiple users, and will not be repeated here.
  • a full dictionary is used to construct a decoding network, so that the constructed decoding network can be applied to more users.
  • the nodes in the decoding network including word nodes and phoneme nodes
  • the decoding network do not need to be repeated. This means that there is no need to rebuild the decoding network, and there is no need to restart the decoder, so that new users can be added online, ensuring that users can uninterruptedly obtain speech recognition services, and improving user experience.
  • the embodiment of the present application can update the language model corresponding to each user ID through the following steps:
  • the language model corresponding to the user ID needs to be updated through the following steps: detecting whether the corpus corresponding to the user ID is updated; if the corpus corresponding to the user ID is updated, determining that the language model corresponding to the user ID needs to be updated.
  • collect the corpus corresponding to each user ID and store the corpus in the corpus corresponding to the user ID.
  • music-related corpus can be collected; for individual users, it can be collected when the user uses a smart device.
  • the input corpus is stored in the user's corpus to continuously update the user's language model and improve the accuracy of speech recognition. It can periodically or periodically check whether the corpus corresponding to each user ID has been updated. If it is detected that the corpus corresponding to a certain user ID has been updated, the corpus corresponding to the user ID will be used for the user ID
  • the corresponding language model is trained to update the language model corresponding to the user ID.
  • the detection time or detection period can be set according to actual conditions, which is not limited in this embodiment.
  • regular or periodic detection tasks it is possible to periodically detect whether the corpus is updated and update the language model in time, making the process of model updating more automated and saving manpower.
  • the following steps can be used to detect whether the corpus in the corpus has been updated: calculate the first summary value of all corpora in the corpus corresponding to the user ID; compare the first summary value with the second summary value, If the first summary value is different from the second summary value, confirm that the corpus corresponding to the user ID has been updated; if the first summary value is the same as the second summary value, then confirm that the corpus corresponding to the user ID has not been updated, and there is no need to update the user The language model corresponding to the ID.
  • the second summary value is the summary value of all the corpora in the corpus corresponding to the user ID after the most recent update.
  • the MD5 Message-Digest Algorithm can be used to generate the digest values of all corpora in the corpus.
  • the first summary value of the corpus corresponding to the user ID can be stored as the second summary value used when checking whether the corpus is updated next time.
  • S702 Update the language model according to the corpus in the corpus corresponding to the user ID, and determine the latest probability score corresponding to the path between each word node in the decoding network.
  • the language model is updated according to the corpus in the corpus corresponding to the user ID, and the conditional probability between each word appearing in the corpus corresponding to the user ID is re-determined according to the updated language model, as the corresponding word node According to the latest probability score corresponding to the path of, update the probability score corresponding to the user ID of the path mark between the corresponding word nodes in the decoding network.
  • the language model corresponding to the user ID is updated, if a usable path is added, the user ID of the user and the probability score corresponding to the path can be added to the path corresponding to the decoding network.
  • the process of performing voice recognition based on the updated language model corresponding to the user ID is roughly as follows: preprocessing the voice signal corresponding to the user ID, extracting the acoustic feature vector of the voice signal, and then combining the acoustic feature The vector is input into the acoustic model to obtain the phoneme sequence; based on the phoneme sequence, according to the user ID, search for a path with the highest score in the decoding network as the optimal path, and the text sequence corresponding to the optimal path is determined as the recognition result of the speech signal.
  • the score of the path is determined according to the acoustic score on the path and the probability score corresponding to the user ID. Specifically, the decoding score on a path can be calculated by the following formula:
  • L is a decoding path
  • SA i is the i-th acoustic score on the path L
  • SL j,x is the j-th probability score of the user ID x on the path L.
  • the score of the decoding result "My Home" corresponding to the user whose user ID is ID 1 is (logSA 1 +logSA 2 +logSA 3 +log SL 1 ).
  • each user ID uses the same acoustic score.
  • the decoding network Since the decoding network has been pre-loaded into the decoder, once it is detected that the language model corresponding to a certain user ID needs to be updated, it is only necessary to recalculate the path between each word node in the decoding network according to the updated language model corresponding to the user ID With the probability score of, the changes brought by the updated language model can be introduced into the decoding network.
  • the decoder uses the decoding network with the updated probability score to perform a path search, and the correct result can be solved.
  • the user ID is marked on the path of the constructed decoding network.
  • the probability score of the path between nodes, and the user’s probability score in the decoding network is updated based on the user ID in the decoding network.
  • the changes brought by the updated language model can be introduced into the decoding network.
  • the decoder updates the probability score
  • the decoding network performs path search to solve the result that meets the user's personalized needs. Therefore, only a set of decoders need to be deployed on the server side to train their unique language model for each user and provide users with personalized The speech recognition service, while greatly saving hardware resources.
  • the method of the embodiment of the present application uses a full vocabulary to construct a decoding network, so that the constructed decoding network can be applied to multiple users.
  • the nodes in the decoding network including word nodes and phoneme nodes
  • Reconstruction means that there is no need to rebuild the decoding network and restart the decoder, thus realizing the online update of the language model, ensuring that users can continuously obtain speech recognition services and improving user experience.
  • the path from each phoneme node in the decoding network to all word nodes that the phoneme node can reach also includes the forward probability corresponding to each user ID.
  • the path between the phoneme node "b" and the word node "Beijing" is marked with “ID 1 "and “LA 1 ", indicating that on this path, the forward probability corresponding to user ID 1 is SL 1 ;
  • “ID 2 " and “SL 2 " are marked between “s” and “Suzhou”, which means that on this path, the forward-looking probability corresponding to user ID 2 is LA 2 ;
  • between "j" and “Jiangsu” is marked “ID 2 ”, “SL 2 ”, “ID 3 ”, and “SL 3 ”indicate that on this path, the forward probability corresponding to user ID 2 is LA 2 , and the forward probability corresponding to user ID 3 is LA 3 .
  • the score of the path needs to be added to the look-ahead probability on the path, that is, in the path search, the intermediate score of the path L is:
  • SA i is the i-th acoustic score on path L
  • SL j,x is the j-th probability score corresponding to the user with user ID x on path L
  • LA n,x is the user ID x on path L
  • the nth forward probability corresponding to the user After introducing the forward probability, you can increase the scores of some paths during the pruning process to prevent them from being clipped. Then, after searching for each possible path, subtract the forward probability on the path to obtain the corresponding path
  • the final score of the path is:
  • the forward-looking probability corresponding to each user ID can be calculated by the following formula:
  • W(s) refers to the set of words corresponding to word nodes that can be reached from a phoneme node s in the decoding network
  • h is the corpus used to train the language model corresponding to the user ID
  • h) is The appearance frequency score corresponding to the word w in the set W(s), and the appearance frequency score is used to characterize the frequency of appearance of the word w in the corpus corresponding to the user ID.
  • the word node corresponding to the word in W(s) in the decoding network is called the target word node corresponding to the phoneme node s.
  • determine the appearance frequency score corresponding to each word node in the following way: determine the frequency of the word node in the corpus corresponding to the corpus corresponding to the user ID in the decoding network; for the word node in the corpus For the word node corresponding to the corpus, the frequency of the word node is normalized to obtain the appearance frequency score corresponding to the word node.
  • the value of the appearance frequency score corresponding to each word node is in the range of [0,1].
  • the set of words corresponding to the target word node reachable with the node "k" as the starting point of the path is ⁇ card, truck, open, open door , Triumph, Section, Class ⁇ , based on the corpus corresponding to the user ID, count the frequency of each word in the set ⁇ card, truck, open, open door, Triumph, Section, class ⁇ in the corpus, for the set ⁇ card, truck
  • the frequency of each word in, open, open the door, triumphant, ke, class ⁇ is normalized, and the appearance frequency scores p(card
  • the model update method of the embodiment of the present application further includes the following steps: According to the frequency of each word node in the decoding network in the corpus corresponding to the user ID, obtain the user ID corresponding to each word node Appearance frequency score; for each phoneme node in the decoding network, select the maximum value of the appearance frequency score of the target word node corresponding to the user ID corresponding to the phoneme node, and determine the path corresponding to the user ID from the phoneme node to each target word node According to the latest lookahead probability, update the lookahead probability corresponding to the user ID of the path from the phoneme node to the target word node in the decoding network.
  • the appearance frequency score corresponding to each word node is obtained, including: determining that the word node corresponding to the corpus in the corpus corresponding to the user ID in the decoding network is in the corpus Frequency of appearance: For the word node corresponding to the corpus in the corpus, the frequency of the word node is normalized to obtain the frequency score corresponding to the word node.
  • the forward probability corresponding to each user ID in the decoding network there is no need to modify the nodes (including word nodes and phoneme nodes) in the decoding network.
  • the language model corresponding to a certain user ID needs to be updated, it is only necessary to recalculate the forward probability of the path from each phoneme node to the target word node in the decoding network according to the updated language model, and then the updated language
  • the changes brought about by the model are introduced into the decoding network to prevent the paths with lower acoustic scores from being clipped during path pruning.
  • the decoder uses the decoding network with the updated look-ahead probability to perform path search, and the correct results can be solved.
  • the voice recognition method of the embodiment of the present application can be used to recognize any language, such as Chinese, English, Japanese, German, etc.
  • the description is mainly based on the speech recognition of Chinese as an example, and the voice recognition methods of other languages are similar to this, and the embodiments of the present application will not be illustrated one by one.
  • an embodiment of the present application further provides a speech recognition device 80, which includes an acquisition module 801, a decoding module 802, and a determination module 803.
  • the obtaining module 801 is used to obtain the input voice and the user ID corresponding to the input voice.
  • the decoding module 802 is configured to search for the optimal path corresponding to the input voice in the decoding network according to the user ID, and the path between the word nodes in the decoding network is marked with the user ID.
  • the determining module 803 is configured to determine text information corresponding to the input voice according to the optimal path.
  • the decoding module 802 is specifically configured to determine the optimal path corresponding to the input voice according to the probability score corresponding to the user ID marked by the path between each word node in the decoding network.
  • the decoding module 802 is specifically configured to: obtain the language model corresponding to the user ID according to the user ID; according to the language model corresponding to the user ID, search for the optimal path corresponding to the input voice in the decoding network.
  • the decoding network is constructed based on a full dictionary.
  • the speech recognition device 80 of the embodiment of the present application further includes a model update module, which is used to: determine that the language model corresponding to the user ID needs to be updated; update the language model according to the corpus in the corpus corresponding to the user ID, and determine the decoding network The latest probability score corresponding to the path between each word node; according to the latest probability score, the probability score corresponding to the user ID marked by the path between the corresponding word nodes in the decoding network is updated.
  • a model update module which is used to: determine that the language model corresponding to the user ID needs to be updated; update the language model according to the corpus in the corpus corresponding to the user ID, and determine the decoding network The latest probability score corresponding to the path between each word node; according to the latest probability score, the probability score corresponding to the user ID marked by the path between the corresponding word nodes in the decoding network is updated.
  • model update module is specifically used to detect whether the corpus corresponding to the user ID is updated; if the corpus corresponding to the user ID is updated, it is determined that the language model corresponding to the user ID needs to be updated.
  • model update module is specifically configured to: calculate the first summary value of all corpora in the corpus corresponding to the user ID; compare the first summary value with the second summary value, and if they are not the same, confirm the corpus corresponding to the user ID There is an update, and the second summary value is the summary value of all corpora in the corpus corresponding to the user ID after the most recent update.
  • the model update module is also used to: obtain the appearance frequency score of each word node corresponding to the user ID according to the frequency of each word node in the decoding network in the corpus corresponding to the user ID; For each phoneme node, select the maximum value of the frequency scores of the user ID corresponding to the target word node corresponding to the phoneme node, and determine it as the latest look-ahead probability of the user ID corresponding to the path from the phoneme node to each target word node; according to the latest look-ahead probability, Update the look-ahead probability corresponding to the user ID of the path from the phoneme node to the target word node in the decoding network.
  • model update module is specifically used to determine the frequency of the word node in the corpus corresponding to the corpus corresponding to the user ID in the decoding network; for the word node corresponding to the corpus in the corpus, the frequency of the word node Perform normalization to obtain the frequency score corresponding to the word node.
  • the voice recognition device provided in the embodiment of the present application adopts the same inventive concept as the above-mentioned voice recognition method, and can achieve the same beneficial effects, which will not be repeated here.
  • the electronic device can be a controller of a smart device (such as a robot, a smart speaker, etc.), a desktop computer, or a portable Computers, smart phones, tablet computers, personal digital assistants (PDAs), servers, etc.
  • the electronic device 90 may include a processor 901, a memory 902, and a transceiver 903.
  • the transceiver 903 is used to receive and send data under the control of the processor 901.
  • the memory 902 may include a read only memory (ROM) and a random access memory (RAM), and provides the processor with program instructions and data stored in the memory.
  • the memory may be used to store the program of the voice recognition method.
  • the processor 901 can be a CPU (central embedded device), ASIC (Application Specific Integrated Circuit, application-specific integrated circuit), FPGA (Field-Programmable Gate Array, field programmable gate array) or CPLD (Complex Programmable Logic Device), complex programmable
  • the (logic device) processor implements the voice recognition method in any of the foregoing embodiments according to the obtained program instructions by calling the program instructions stored in the memory.
  • the embodiment of the present application provides a computer-readable storage medium for storing computer program instructions used for the above-mentioned electronic device, which includes a program for executing the above-mentioned voice recognition method.
  • the above-mentioned computer storage medium may be any available medium or data storage device that the computer can access, including but not limited to magnetic storage (such as floppy disk, hard disk, magnetic tape, magneto-optical disk (MO), etc.), optical storage (such as CD, DVD, BD) , HVD, etc.), and semiconductor memory (such as ROM, EPROM, EEPROM, non-volatile memory (NAND FLASH), solid state drive (SSD)), etc.
  • magnetic storage such as floppy disk, hard disk, magnetic tape, magneto-optical disk (MO), etc.
  • optical storage such as CD, DVD, BD) , HVD, etc.
  • semiconductor memory such as ROM, EPROM, EEPROM, non-volatile memory (NAND FLASH), solid state drive (SSD)

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

提供了一种语音识别方法、装置、电子设备及存储介质,其中该方法包括:获取输入语音以及输入语音对应的用户ID(S201);根据用户ID,在解码网络中,搜索输入语音对应的最优路径,解码网络中各词节点之间的路径标记有用户ID(S202);根据最优路径确定输入语音对应的文本信息(S203)。该语音识别方法基于一套解码网络,即可为用户提供个性化的语音识别服务,同时大大节省了硬件资源。

Description

语音识别方法、装置、电子设备及存储介质
相关申请的交叉引用
本申请要求在2019年01月30日提交中国专利局、申请号为201910094102.7、申请名称为“语音识别方法、装置、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及语音识别技术领域,尤其涉及一种语音识别方法、装置、电子设备及存储介质。
背景技术
语音识别***中主要包含一套声学模型、语言模型和解码器。语音识别的准确度主要依赖于语言模型,随着用户个性化需要越来越高,需要为不同的用户训练不同的语言模型,以提供专有的语音识别服务。目前,个性化语言模型的训练方法都是利用用户自身的语料对通用语言模型进行训练,以生成用户专有的语言模型,并针对每个用户部署一套专门的语音识别服务,通过周期性更新语言模型来满足用户个性化需求。
发明内容
本申请实施例提供一种语音识别方法、装置、电子设备及存储介质,以解决现有技术中为满足用户个性化定制的需求,需要为每个用户部署一套专门的语音识别服务,造成资源的严重浪费的问题。
第一方面,本申请一实施例提供了一种语音识别方法,包括:
获取输入语音以及输入语音对应的用户ID;
根据用户ID,在解码网络中,搜索输入语音对应的最优路径,解码网络 中各词节点之间的路径标记有用户ID;
根据最优路径确定输入语音对应的文本信息。
可选地,所述根据所述用户ID,在解码网络中,搜索所述输入语音对应的最优路径,包括:
根据所述解码网络中各词节点之间的路径标记的所述用户ID对应的概率分值,确定所述输入语音对应的最优路径。
可选地,所述根据所述用户ID,在解码网络中,搜索所述输入语音对应的最优路径,包括:
根据所述用户ID,获取所述用户ID对应的语言模型;
根据所述用户ID对应的语言模型,在所述解码网络中,搜索所述输入语音对应的最优路径。
可选地,所述解码网络是基于全量词典构建得到的。
可选地,通过如下方式更新所述用户ID对应的语言模型:
确定所述用户ID对应的语言模型需要更新;
根据所述用户ID对应的语料库中的语料,更新所述语言模型,并确定所述解码网络中各词节点之间的路径对应的最新概率得分;
根据所述最新概率得分,更新所述解码网络中对应的词节点之间的路径标记的所述用户ID对应的概率得分。
可选地,所述确定所述用户ID对应的语言模型需要更新,包括:
检测所述用户ID对应的语料库是否有更新;
若所述用户ID对应的语料库有更新,确定所述用户ID对应的语言模型需要更新。
可选地,所述检测所述用户ID对应的语料库是否有更新,包括:
计算所述用户ID对应的语料库中的所有语料的第一摘要值;
将所述第一摘要值与第二摘要值进行比较,若不相同,则确认所述用户ID对应的语料库有更新,所述第二摘要值为最近一次更新后所述用户ID对应的语料库中所有语料的摘要值。
可选地,在确定所述用户ID对应的语言模型需要更新之后,还包括:
根据所述解码网络中各词节点在所述用户ID对应的语料库中出现的频率,得到各个词节点对应所述用户ID的出现频率分值;
针对所述解码网络中的每个音素节点,选择所述音素节点对应的目标词节点对应所述用户ID的出现频率分值中的最大值,确定为所述音素节点到所述各目标词节点的路径对应所述用户ID的最新前瞻概率;
根据所述最新前瞻概率,更新所述解码网络中的音素节点到目标词节点的路径的与所述用户ID对应的前瞻概率。
可选地,根据所述解码网络中各词节点在所述用户ID对应的语料库中出现的频率,得到各个词节点对应的出现频率分值,包括:
确定所述解码网络中与所述用户ID对应的语料库中的语料对应的词节点在所述语料库中出现的频率;
针对所述语料库中的语料对应的词节点,对该词节点的频率进行归一化,得到该词节点对应的出现频率分值。
第二方面,本申请一实施例提供了一种语音识别装置,包括:
获取模块,用于获取输入语音以及输入语音对应的用户ID;
解码模块,用于根据用户ID,在解码网络中,搜索输入语音对应的最优路径,解码网络中各词节点之间的路径标记有用户ID;
确定模块,用于根据最优路径确定输入语音对应的文本信息。
可选地,所述解码模块具体用于:根据所述解码网络中各词节点之间的路径标记的所述用户ID对应的概率分值,确定所述输入语音对应的最优路径。
可选地,所述解码模块具体用于:
根据所述用户ID,获取所述用户ID对应的语言模型;
根据所述用户ID对应的语言模型,在所述解码网络中,搜索所述输入语音对应的最优路径。
可选地,所述解码网络是基于全量词典构建得到的。
可选地,还包括模型更新模块,用于:
确定所述用户ID对应的语言模型需要更新;
根据所述用户ID对应的语料库中的语料,更新所述语言模型,并确定所述解码网络中各词节点之间的路径对应的最新概率得分;
根据所述最新概率得分,更新所述解码网络中对应的词节点之间的路径标记的所述用户ID对应的概率得分。
可选地,所述模型更新模块具体用于:
检测所述用户ID对应的语料库是否有更新;
若所述用户ID对应的语料库有更新,确定所述用户ID对应的语言模型需要更新。
可选地,所述模型更新模块具体用于:
计算所述用户ID对应的语料库中的所有语料的第一摘要值;
将所述第一摘要值与第二摘要值进行比较,若不相同,则确认所述用户ID对应的语料库有更新,所述第二摘要值为最近一次更新后所述用户ID对应的语料库中所有语料的摘要值。
可选地,所述模型更新模块还用于:
根据所述解码网络中各词节点在所述用户ID对应的语料库中出现的频率,得到各个词节点对应所述用户ID的出现频率分值;
针对所述解码网络中的每个音素节点,选择所述音素节点对应的目标词节点对应所述用户ID的出现频率分值中的最大值,确定为所述音素节点到所述各目标词节点的路径对应所述用户ID的最新前瞻概率;
根据所述最新前瞻概率,更新所述解码网络中的音素节点到目标词节点的路径的与所述用户ID对应的前瞻概率。
可选地,所述模型更新模块具体用于:
确定所述解码网络中与所述用户ID对应的语料库中的语料对应的词节点在所述语料库中出现的频率;
针对所述语料库中的语料对应的词节点,对该词节点的频率进行归一化,得到该词节点对应的出现频率分值。
第三方面,本申请一实施例提供了一种电子设备,包括收发机、存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,收发机用于在处理器的控制下接收和发送数据,处理器执行程序时实现上述任一种方法的步骤。
第四方面,本申请一实施例提供了一种计算机可读存储介质,其上存储有计算机程序指令,该程序指令被处理器执行时实现上述任一种方法的步骤。
第五方面,本申请还提供了一种计算机程序产品,所述计算机程序产品包括存储在计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,所述程序指令被处理器执行时实现上述任一语音识别方法的步骤。
本申请实施例提供的技术方案,在构建的解码网络中各词节点之间的路径上标记用户ID,使得在利用解码网络识别语音的过程中,能够根据用户ID,仅搜索标记有该用户ID的路径,在从搜索到的多条路径中选出最优路径,根据最优路径确定输入语音对应的文本信息,使得不同用户能够基于同一解码网络得到不同的识别结果。因此,在服务器端仅需部署一套解码网络,该解码网络融合了多个用户专属的语言模型,能够为多个用户提供个性化的语音识别服务,同时节省了硬件资源。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例中所需要使用的附图作简单地介绍,显而易见地,下面所介绍的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的语音识别方法的应用场景示意图;
图2为本申请一实施例提供的语音识别方法的流程示意图;
图3为本申请实施例提供的解码网络中局部网络的一个示例;
图4为本申请实施例提供的解码网络中词节点间的路径的一个示例;
图5为本申请实施例提供的解码网络中局部网络的另一个示例;
图6为本申请实施例提供的基于多个用户的语言模型构建的解码网络中局部网络的一个示例;
图7为本申请实施例提供的更新一个用户ID对应的语言模型的方法的流程示意图;
图8为本申请一实施例提供的语音识别装置的结构示意图;
图9为本申请一实施例提供的电子设备的结构示意图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。
为了方便理解,下面对本申请实施例中涉及的名词进行解释:
语言模型(Language Model,LM)的目的是建立一个能够描述给定词序列在语言中的出现的概率的分布。也就是说,语言模型是描述词汇概率分布的模型,一个能可靠反应语言识别时用词的概率分布的模型。语言模型在自然语言处理中占有重要的地位,在语音识别、机器翻译等领域得到了广泛应用。例如,利用语言模型能够得到语音识别多种词序列中可能性最大的一个词序列,或者给定若干词,预测下一个最可能出现的词语等。常用的语言模型包括N-Gram LM(N元语言模型)、Big-Gram LM(二元语言模型)、Tri-Gram LM(三元语言模型)。
声学模型(AM,Acoustic model)是语音识别***中最为重要的部分之一,是把语音的声学特征分类对应到音素的模型。目前的主流***多采用隐马尔科夫模型进行建模。
词典是字词对应的音素集合,描述了词汇和音素之间的映射关系。
音素(phone),是语音中的最小的单位,依据音节里的发音动作来分析,一个动作构成一个音素。汉语中的音素分为声母、韵母两大类,例如,声母 包括:b、p、m、f、d、t、等,韵母包括:a、o、e、i、u、ü、ai、ei、ao、an、ian、ong、iong等。英语中的音素分为元音、辅音两大类,例如,元音有a、e、ai等,辅音有p、t、h等。
前瞻概率(look-ahead probability):为了在解码的中间过程中不会裁剪掉声学得分较低的路径,一般采取将基于语言模型得到的表征各个词出现的频率的出现概率分值分解至树杈的技术即语言模型look-ahead技术,即在解码网络中音素节点到词节点的路径上就引入词节点对应的出现概率分值,并且将出现概率分值中的最大值作为音素节点到所有能够到达的词节点的路径上的前瞻概率,在计算音素节点到词节点的路径的得分时,将前瞻概率增加到该路径的得分中,这样可显著提高一些声学得分较低但概率得分较高的路径的得分,以避免剪枝过程中剪去这类路径。
附图中的任何元素数量均用于示例而非限制,以及任何命名都仅用于区分,而不具有任何限制含义。
在具体实践过程中,个性化语言模型的训练方法都是利用用户自身的语料对通用语言模型进行训练,以生成用户专有的语言模型,并针对每个用户部署一套专门的语音识别服务,通过周期性更新语言模型来满足用户个性化需求。但是,为每个用户部署一套专门的语音识别服务的方式,会造成资源的严重浪费,产生巨大的开销。
为此,本申请的发明人考虑到,在构建的解码网络中各词节点之间的路径上标记用户ID,使得在利用解码网络识别语音的过程中,能够根据用户ID,仅搜索标记有该用户ID的路径,在从搜索到的多条路径中选出最优路径,根据最优路径确定输入语音对应的文本信息,使得不同用户能够基于同一解码网络得到不同的识别结果。因此,在服务器端仅需部署一套解码网络,该解码网络融合了多个用户专属的语言模型,能够为多个用户提供个性化的语音识别服务,同时节省了硬件资源。
此外,采用全量词表构建解码网络,使得构建的解码网络能够适用于多个用户,在添加新用户时,不需要重新构建解码网络,也就不需要重启解码 器,从而实现了在线新添加新用户,保证用户能不间断地获取到语音识别服务,提高用户体验。基于全量词表构建的解码网络,还能够实现在线更新各个用户对应的语言模型,当某一用户的语言模型需要更新时,只需要根据该用户更新后的语言模型重新计算解码网络中词节点间路径的概率得分,并基于解码网络中的用户ID更新该用户在解码网络中的概率得分,就可以将更新后的语言模型带来的变化引入解码网络,解码网络通过更新概率得分后的解码网络进行路径搜索,从而得到符合该用户个性化需求的识别结果。因此,在服务器端仅需部署一套解码器,即可为各个用户训练出其专属的语言模型,为用户提供个性化的语音识别服务,并且实现了语言模型的在线更新,及时更新用户的语言模型,并保证用户能不间断地获取到语音识别服务,提高用户体验。
在介绍了本申请的基本原理之后,下面具体介绍本申请的各种非限制性实施方式。
首先参考图1,其为本申请实施例提供的语音识别方法的应用场景示意图。多个用户10共同使用同一服务器12中的解码器提供的语音识别服务。用户10与智能设备11交互过程中,智能设备11将用户10输入的语音信号发送给服务器12,服务器12通过解码器中的解码网络对语音信号进行解码处理,得到语音信号对应的文本信息,并将解码得到的文本信息反馈给智能设备11,完成语音识别服务。
这种应用场景下,智能设备11和服务器12之间通过网络进行通信连接,该网络可以为局域网、广域网等。智能设备11可以为智能音箱、机器人等,也可以为便携设备(例如:手机、平板、笔记本电脑等),还可以为个人电脑(PC,Personal Computer),服务器12可以为任何能够提供语音识别服务的服务器设备。
下面结合图1所示的应用场景,对本申请实施例提供的技术方案进行说明。
参考图2,本申请实施例提供一种语音识别方法,包括以下步骤:
S201、获取输入语音以及输入语音对应的用户ID。
具体实施时,可由智能终端将采集到的输入语音以及用户ID发送给服务器,由服务器根据用户ID对输入语音进行语音识别。本实施例中,一个用户ID对应一个语言模型,并利用各个用户ID对应的语料库中的语料,训练各个用户专用的语言模型。
本实施例中的用户ID可以企业级的,即用户ID用于标识一个不同的企业,一个企业对应的一个语言模型,该企业下的智能设备使用一个语言模型。用户ID还可以是设备级的,即用户ID用于标识一类或一个设备,一类设备或一个设备对应一个语言模型,例如智能音箱对应一个关于音乐的语言模型,聊天机器人对一个关于聊天的语言模型,这样不同的设备可使用同一解码网络。用户ID还可以是业务级的,即不同业务对应一个语言模型,该业务下的智能设备使用一个语言模型。等等。本申请实施例中不对用户ID的具体实现进行限定,可根据实际应用场景或需求进行配置。
S202、根据用户ID,在解码网络中,搜索输入语音对应的最优路径,解码网络中各词节点之间的路径标记有用户ID。
本实施例中,多个用户ID共同使用一个解码网络。解码网络为表示音素与词以及词与词之间关系的网络图。
为实现多个用户共用一个解码网络,可基于声学模型以及这多个用户对应的语料库和语言模型来构建解码网络,具体构建方法如下:
第一步,基于各用户ID对应的语料库中的语料,得到包含语料库中所有词汇的词典,把词典中的词汇转换为音素串,例如,“开”的音素串为“k-ai”,“北京”的音素串为“b-ei-j-ing”,一个词汇的音素串以及该词汇组成一条路径,例如,“开”对应的路径为“k-ai-开”,“北京”对应的路径为“b-ei-j-ing-北京”。
第二步,对词典中所有词汇对应的路径中的节点进行合并,即将各路径中相同的音素合并为一个节点,以将所有词汇对应的音素串组成一个网络,一个音素作为该网络中的一个音素节点。
图3给出了解码网络中局部网络的一个示例。其中,“卡”、“开”、“科”等词的音素串中的“k”合并为一个网络中的一个节点。网络中每条路径的最后一个节点对应该条路径上的音素组成的音素串对应的词汇,如图3中,“k-a-卡”对应的词汇为“卡”,“k-a-ch-e-卡车”对应的词汇为“卡车”。
为描述方便,本实施例中,将解码网络中的音素对应的节点称为音素节点,将词汇对应的节点称为词节点。
由于大量相同的节点被合并在一起,因此可以显著降低搜索空间的规模,减少解码过程的运算量。基于词典生成解码网络的方法为现有技术,不再赘述。
第三步,根据声学模型确定上述第二步中构建的解码网络中相连的音素节点间的声学得分。
本实施例中,多个用户可共用一个声学模型。
第四步,针对各用户ID,根据该用户ID的语言模型,确定词典中词和词之间的连接关系和概率得分,根据连接关系在上述第二步中构建的解码网络中建立词与词之间的连接路径,并在词节点之间的路径上标记用户ID以及该用户的概率得分。
具体实施时,根据语言模型能够确定在一个词W 1之后出现另一个词W 2的条件概率p(W 2|W 1),将条件概率p(W 2|W 1)作为从词W 1到W 2的概率得分。
例如,训练语言模型的语料中包括“我家在北京”,语料中的词汇包括“我”、“家”、“在”、“北京”,则在解码网络中,词节点“我”和“家”之间相连,“家”和“在”之相连,“在”和“北京”之间建立连接,再根据语言模型确定“我”和“家”、“家”和“在”、“在”和“北京”之间的概率得分。如图4为解码网络中词节点间的路径的一个示例,图4中隐去了音素节点和词节点间的网络关系。需要说明的是,解码网络中词节点和词节点之间实际的连接方式如图5所示,词节点“我”与“家”的第一个音素节点连接,SA 1、SA 2、SA 3表示声学得分,SL 1表示用户ID 1对应的词节点“我”到“家”的路径的 概率得分,SL 2表示用户ID 2对应的词节点“我”到“家”的路径的概率得分。
通过第四步,将各用户ID的概率得分标记到解码网络中对应的路径上,使得解码时,能够根据用户ID,选择该用户对应的路径,并基于对应路径上的概率得分,确定输入语音的最优路径。
通过上述四个步骤就可以得到可供多个用户共同使用的一个解码网络。将构建好的解码网络预先加载到服务器的解码器中,即可为这多个用户提供语音识别服务。
S203、根据最优路径确定输入语音对应的文本信息。
基于上述任一实施例,语音识别的过程包括:对语音信号进行预处理,提取语音信号的声学特征向量,然后,将声学特征向量输入声学模型,得到音素序列;基于音素序列和语音信号对应的用户ID,在解码网络中搜索一条得分最高的路径作为最优路径,将最优路径对应的文字序列确定为该语音信号的识别结果。其中,根据各条路径的总得分确定最优路径,路径的总得分根据路径上的声学得分和用户ID对应的概率得分确定,具体可通过以下公式计算一条路径上的解码得分:
Figure PCTCN2020073328-appb-000001
其中,L为一条解码路径,SA i为路径L上的第i个声学得分,SL j,x为路径L上的用户ID为x的用户对应的第j个概率得分。以图5为例,用户ID 1对应的解码结果“我家”的得分为(logSA 1+logSA 2+logSA 3+log SL 1)。
本申请实施例的方法,在解码网络中各词节点之间的路径上标记了用户ID,在搜索路径时,根据路径上的用户ID选择该用户可使用的路径,使得不同用户能够基于同一解码网络得到不同的识别结果。参考图6,为基于多个用户的语言模型生成的解码网络的局部示例,由于篇幅限制,图6中部分音素节点未示出。以图6为例,在对用户ID 1的语音信号进行识别时,词节点“在”和“北京”之间的路径标记有“ID 1”,此时,选择的路径是“在-北京”,而不会选择图6中的其它两条路径;在对用户ID 2的语音信号进行识别时,选择的 路径是“在-苏州”和“在-江苏”这两条标记有ID 2的路径。
因此,本申请实施例的语音识别方法,在服务器端仅需部署一套解码网络,该解码网络融合了多个用户专属的语言模型,能够为多个用户提供个性化的语音识别服务,同时节省了硬件资源。
作为一种可能的实现方式,步骤S202具体包括:根据解码网络中各词节点之间的路径标记的用户ID对应的概率分值,确定输入语音对应的最优路径。
具体地,根据不同用户的语言模型会得到不同的概率得分,对同一路径来说,不同的概率得分会导致出现完全不同的识别结果。因此,本申请实施例在解码网络中利用用户ID对不同用户的概率得分进行区分,使得多个用户能共用一个解码网络。解码时,根据当前使用解码网络的用户的用户ID,取解码网络路径上标记有该用户ID的概率得分计算各条路径的总得分,选择总得分最高的路径作为最优路径,基于最优路径上的词节点对应的词汇,得到语音识别结果。参考图6,“在”和“北京”之间标注有“ID 1”和“SL 1”,表示解码时只有用户ID 1可以使用该路径,且对应的概率得分为SL 1;“在”和“苏州”之间标注有“ID 2”和“SL 2”,表示解码时只有用户ID 2可以使用该路径,且对应的概率得分为SL 2;“在”和“江苏”之间标注有“ID 2”、“SL 2”、“ID 3”、“SL 3”,表示解码时用户ID 2和ID 3都使用该路径,且用户ID 2通过该路径时的概率得分为SL 2,用户ID 3通过该路径时的概率得分为SL 3
作为一种可能的实现方式,步骤S202具体包括:根据用户ID,在解码网络中,搜索输入语音对应的最优路径,包括:根据用户ID,获取用户ID对应的语言模型;根据用户ID对应的语言模型,在解码网络中,搜索输入语音对应的最优路径。
具体实施时,每个用户ID对应一个语言模型,该语言模型是基于用户ID对应的语料库中的语料训练得到的,基于输入语音对应的用户ID获取到用户ID对应的语言模型,利用用户ID对应的语言模型,在解码网络中,搜索输入语音对应的最优路径,为不同用户提供个性化的语音识别服务。由于在进行语音识别服务的时候,会提前根据用户ID将其独有的语言模型加载到解码器 中,而其他用户ID的语言模型无法加载到解码器中,以此来达到多个用户共用一套通用解码网络,而又保持自己特色的语言模型的服务方式。
在上述任一实施例的基础上,为了使得构建的解码网络能够适用于更多的用户,本申请实施例采用全量词典构建多个用户共享的解码网络。
本申请实施例中的全量词典为包含大量常用词汇的词典。具体实施时,全量词典包含的词汇的数量在10万以上,能够涵盖多个领域不同的主题,全量词典中的词汇包括字和词语。全量词典能够覆盖所有用户ID对应的语料库中包含的词汇。
基于全量词典构建多个用户共享的解码网络的方法,与上述基于多个用户对应的语料库构建解码网络的方法类似,不再赘述。
当有新的用户需要使用解码网络时,只需要根据该用户对应的语料库中的语料训练通用语言模型,得到该用户专属的语言模型,然后,根据该用户的语言模型,确定解码网络中各词节点之间的路径对应的概率得分,在解码网络中各词节点之间的路径上,标记该用户的用户ID和对应的概率得分。
本申请实施例的方法,采用全量词典构建解码网络,使得构建的解码网络能够适用于更多用户,此外,在添加新用户时,解码网络中的节点(包括词节点和音素节点)不需要重构,即,不需要重新构建解码网络,也就不需要重启解码器,从而实现了在线新添加新用户,保证用户能不间断地获取到语音识别服务,提高用户体验。
基于上述任一实施例,如图7所示,基于全量词典构建的解码网络,本申请实施例可通过如下步骤更新每个用户ID对应的语言模型:
S701、确定用户ID对应的语言模型需要更新。
进一步地,可通过如下步骤确定用户ID对应的语言模型需要更新:检测用户ID对应的语料库是否有更新;若用户ID对应的语料库有更新,确定用户ID对应的语言模型需要更新。
具体实施时,收集各个用户ID对应的语料,并将语料存储到该用户ID对应的语料库中,例如,针对智能音箱,可收集音乐相关的语料;对于个人 用户,可收集该用户使用智能设备时输入的语料,存储到该用户的语料库中,以不断更新该用户的语言模型,提高语音识别的准确度。可定时或周期性检测各个用户ID对应的语料库中的语料是否有更新,若检测到某一用户ID对应的语料库中的语料有更新,则利用该用户ID对应的语料库中的语料对该用户ID对应的语言模型进行训练,以更新该用户ID对应的语言模型。其中,检测的时间或检测周期可根据实际情况进行设置,本实施例不作限定。通过设置定时或周期性检测的任务,能够定时检测语料库是否有更新,并及时更新语言模型,使得模型更新的过程更加自动化,节省了人力。
作为一种可能的实现方式,可通过如下步骤检测语料库中的语料是否有更新:计算用户ID对应的语料库中的所有语料的第一摘要值;将第一摘要值与第二摘要值进行比较,若第一摘要值与第二摘要值不相同,则确认用户ID对应的语料库有更新;若第一摘要值与第二摘要值相同,则确认用户ID对应的语料库未更新,不需要更新该用户ID对应的语言模型。其中,第二摘要值为最近一次更新后用户ID对应的语料库中所有语料的摘要值。
具体实施时,可采用MD5消息摘要算法(MD5 Message-Digest Algorithm)生成语料库中所有语料的摘要值。每次更新完一个用户ID对应的语言模型后,可存储该用户ID对应的语料库的第一摘要值,作为下一次检测该语料库是否有更新时使用的第二摘要值。
S702、根据用户ID对应的语料库中的语料,更新语言模型,并确定解码网络中各词节点之间的路径对应的最新概率得分。
S703、根据最新概率得分,更新解码网络中对应的词节点之间的路径标记的用户ID对应的概率得分。
具体实施时,根据用户ID对应的语料库中的语料更新语言模型,并根据更新后的语言模型重新确定用户ID对应的语料库中出现的各个词之间的条件概率,作为对应的各词节点之间的路径对应的最新概率得分,根据最新概率得分,更新解码网络中对应的词节点之间的路径标记的用户ID对应的概率得分。当用户ID对应的语言模型更新后,若新增了一条可使用的路径,则可在 解码网络对应的路径上增加该用户的用户ID和该路径对应的概率得分。以图6为例,若用户ID 1的语言模型更新后,新增了“在”到“苏州”的路径,则在“在”到“苏州”的路径标记上该用户的ID 1以及对应的概率得分。
基于上述任一实施例,基于用户ID对应的更新后的语言模型进行语音识别的过程大致为:对用户ID对应的语音信号进行预处理,提取该语音信号的声学特征向量,然后,将声学特征向量输入声学模型,得到音素序列;基于音素序列,根据用户ID,在解码网络中搜索一条得分最高的路径作为最优路径,最优路径对应的文字序列确定为该语音信号的识别结果。
其中,路径的得分根据路径上的声学得分和用户ID对应的概率得分确定,具体可通过以下公式计算一条路径上的解码得分:
Figure PCTCN2020073328-appb-000002
其中,L为一条解码路径,SA i为路径L上的第i个声学得分,SL j,x为路径L上用户ID为x的第j个概率得分。以图5为例,用户ID为ID 1的用户对应的解码结果“我家”的得分为(logSA 1+logSA 2+logSA 3+log SL 1)。本实施例中,由于各用户ID使用同一声学模型,因此,每个用户ID使用相同的声学得分。
由于已经预先将解码网络预先加载到解码器中,一旦检测到需要更新某一用户ID对应的语言模型,只需要根据用户ID对应的更新后的语言模型重新计算解码网络中各词节点间路径上的概率得分,就可以将更新后的语言模型带来的变化引入解码网络,解码器利用更新概率得分后的解码网络进行路径搜索,就可以解出正确结果。
本申请实施例的方法,在构建的解码网络的路径上标记有用户ID,当某一用户的语言模型需要更新时,只需要根据该用户ID对应的更新后的语言模型重新计算解码网络中词节点间路径的概率得分,并基于解码网络中的用户ID更新该用户在解码网络中的概率得分,就可以将更新后的语言模型带来的变化引入解码网络,解码器通过更新概率得分后的解码网络进行路径搜索,从而解出符合该用户个性化需求的结果,因此,在服务器端仅需部署一套解 码器,即可为各个用户训练出其独有的语言模型,为用户提供个性化的语音识别服务,同时大大节省了硬件资源。
本申请实施例的方法,采用全量词表构建解码网络,使得构建的解码网络能够适用于多个用户,此外,在语言模型更新时,解码网络中的节点(包括词节点和音素节点)不需要重构,也就是说,不需要重新构建解码网络,也就不需要重启解码器,从而实现了语言模型的在线更新,保证用户能不间断地获取到语音识别服务,提高用户体验。
基于上述任一实施例,解码网络中各个音素节点到该音素节点能够到达的所有词节点的路径上还包括各个用户ID对应的前瞻概率。参考图6,音素节点“b”和词节点“北京”之间的路径上标注有“ID 1”和“LA 1”,表示在这条路径上,用户ID 1对应的前瞻概率为SL 1;“s”和“苏州”之间标注有“ID 2”和“SL 2”,表示在这条路径上,用户ID 2对应的前瞻概率为LA 2;“j”和“江苏”之间标注有“ID 2”、“SL 2”、“ID 3”、“SL 3”,表示在这条路径上,用户ID 2对应的前瞻概率为LA 2,用户ID 3对应的前瞻概率为LA 3
基于用户ID对应的前瞻概率,在根据音素序列搜索对应的词序列的过程中,路径的得分需要加上该路径上的前瞻概率,即,在路径搜索时,路径L的中间得分为:
Figure PCTCN2020073328-appb-000003
其中,SA i为路径L上的第i个声学得分,SL j,x为路径L上用户ID为x的用户对应的第j个概率得分,LA n,x为路径L上用户ID为x的用户对应的第n个前瞻概率。引入前瞻概率后,就可以在剪枝过程中提高一些路径的得分,防止其被裁剪掉,然后,在搜索到各条可能的路径后,再减去路径上的前瞻概率,得到各条路径对应的得分,即路径的最终得分为:
Figure PCTCN2020073328-appb-000004
最后,选取Score值最高的路径作为解码结果。
在构建解码网络时,根据用户ID对应的语言模型确定解码网络中,各用户ID对应的各个音素节点到该音素节点能够到达的所有词节点的路径的前瞻概率。具体地,针对各用户ID对应的前瞻概率,可通过以下公式计算得到:
Figure PCTCN2020073328-appb-000005
其中,W(s)是指从解码网络中的一个音素节点s开始可以到达的词节点对应的词的集合,h为训练该用户ID对应的语言模型使用的语料,p(w|h)为集合W(s)中的词w对应的出现频率分值,该出现频率分值用于表征词w在该用户ID对应的语料库中出现的频率。
本实施例中,将W(s)中的词在解码网络中对应的词节点称为音素节点s对应的目标词节点。作为一种可能的实现方式,通过如下方式确定各个词节点对应的出现频率分值:确定解码网络中与用户ID对应的语料库中的语料对应的词节点在语料库中出现的频率;针对语料库中的语料对应的词节点,对该词节点的频率进行归一化,得到该词节点对应的出现频率分值。
本实施例中,每个词节点对应的出现频率分值的取值在[0,1]范围内。
举例说明,以图3中的节点“k”为例,针对每个用户ID,以节点“k”为路径的起点可到达的目标词节点对应的词的集合为{卡,卡车,开,开门,凯旋,科,课},基于该用户ID对应的语料库,统计集合{卡,卡车,开,开门,凯旋,科,课}中的各个词在语料库中出现的频率,对集合{卡,卡车,开,开门,凯旋,科,课}中的各个词的频率进行归一化,得到各个词对应的出现频率分值p(卡|h)、p(卡车|h)、p(开|h)、p(开门|h)、p(凯旋|h)、p(科|h)、p(课|h),取这些出现频率分值中最大的出现频率分值,作为在解码网络中,节点“k”到集合{卡,卡车,开,开门,凯旋,科,课}中的各个词节点的路径上的该用户ID对应的前瞻概率,利用根据该用户ID对应的语言模型确定出的节点“k”对应的所有目标词节点的出现频率分值中的最大值,作为节点“k”到所有目标词节点的所有路径的前瞻概率,以避免在利用解码网络解码的过程中剪去节点“k”对应的路径中声学得分较低的路径。
相应地,在确定语言模型需要更新之后,本申请实施例的模型更新方法还包括以下步骤:根据解码网络中各词节点在用户ID对应的语料库中出现的频率,得到各个词节点对应用户ID的出现频率分值;针对解码网络中的每个音素节点,选择音素节点对应的目标词节点对应用户ID的出现频率分值中的最大值,确定为音素节点到各目标词节点的路径对应用户ID的最新前瞻概率;根据最新前瞻概率,更新解码网络中的音素节点到目标词节点的路径的与用户ID对应的前瞻概率。
进一步地,根据解码网络中各词节点在语料库中出现的频率,得到各个词节点对应的出现频率分值,包括:确定解码网络中与用户ID对应的语料库中的语料对应的词节点在语料库中出现的频率;针对语料库中的语料对应的词节点,对该词节点的频率进行归一化,得到该词节点对应的出现频率分值。
同样,在更新解码网络中的各用户ID对应的前瞻概率时,不需要修改解码网络中的节点(包括词节点和音素节点)。一旦检测到某一用户ID对应的语言模型需要更新时,只需要根据更新后的语言模型重新计算解码网络中各音素节点到目标词节点的路径的前瞻概率,然后,就可以将更新后的语言模型带来的变化引入解码网络,防止在路径修剪时裁剪掉声学得分较低的路径,解码器利用更新了前瞻概率后的解码网络进行路径搜索,就可以解出正确结果。
本申请实施例的语音识别方法,可用于识别任意一门语言,例如汉语、英语、日语、德语等。本申请实施例中主要是以对汉语的语音识别为例进行说明的,对其他语言的语音识别方法与此类似,本申请实施例中不再一一举例说明。
如图8所示,基于与上述语音识别方法相同的发明构思,本申请实施例还提供了一种语音识别装置80,包括获取模块801、解码模块802和确定模块803。
获取模块801,用于获取输入语音以及输入语音对应的用户ID。
解码模块802,用于根据用户ID,在解码网络中,搜索输入语音对应的 最优路径,解码网络中各词节点之间的路径标记有用户ID。
确定模块803,用于根据最优路径确定输入语音对应的文本信息。
进一步地,解码模块802具体用于:根据解码网络中各词节点之间的路径标记的用户ID对应的概率分值,确定输入语音对应的最优路径。
进一步地,解码模块802具体用于:根据用户ID,获取用户ID对应的语言模型;根据用户ID对应的语言模型,在解码网络中,搜索输入语音对应的最优路径。
基于上述任一实施例,解码网络是基于全量词典构建得到的。
进一步地,本申请实施例的语音识别装置80还包括模型更新模块,用于:确定用户ID对应的语言模型需要更新;根据用户ID对应的语料库中的语料,更新语言模型,并确定解码网络中各词节点之间的路径对应的最新概率得分;根据最新概率得分,更新解码网络中对应的词节点之间的路径标记的用户ID对应的概率得分。
进一步地,模型更新模块具体用于:检测用户ID对应的语料库是否有更新;若用户ID对应的语料库有更新,确定用户ID对应的语言模型需要更新。
进一步地,模型更新模块具体用于:计算用户ID对应的语料库中的所有语料的第一摘要值;将第一摘要值与第二摘要值进行比较,若不相同,则确认用户ID对应的语料库有更新,第二摘要值为最近一次更新后用户ID对应的语料库中所有语料的摘要值。
基于上述任一实施例,模型更新模块还用于:根据解码网络中各词节点在用户ID对应的语料库中出现的频率,得到各个词节点对应用户ID的出现频率分值;针对解码网络中的每个音素节点,选择音素节点对应的目标词节点对应用户ID的出现频率分值中的最大值,确定为音素节点到各目标词节点的路径对应用户ID的最新前瞻概率;根据最新前瞻概率,更新解码网络中的音素节点到目标词节点的路径的与用户ID对应的前瞻概率。
进一步地,模型更新模块具体用于:确定解码网络中与用户ID对应的语料库中的语料对应的词节点在语料库中出现的频率;针对语料库中的语料对 应的词节点,对该词节点的频率进行归一化,得到该词节点对应的出现频率分值。
本申请实施例提的语音识别装置与上述语音识别方法采用了相同的发明构思,能够取得相同的有益效果,在此不再赘述。
基于与上述语音识别方法相同的发明构思,本申请实施例还提供了一种电子设备,该电子设备具体可以为智能设备(如机器人,智能音箱等)的控制器,也可以为桌面计算机、便携式计算机、智能手机、平板电脑、个人数字助理(Personal Digital Assistant,PDA)、服务器等。如图9所示,该电子设备90可以包括处理器901、存储器902和收发机903。收发机903用于在处理器901的控制下接收和发送数据。
存储器902可以包括只读存储器(ROM)和随机存取存储器(RAM),并向处理器提供存储器中存储的程序指令和数据。在本申请实施例中,存储器可以用于存储语音识别方法的程序。
处理器901可以是CPU(中央处埋器)、ASIC(Application Specific Integrated Circuit,专用集成电路)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)或CPLD(Complex Programmable Logic Device,复杂可编程逻辑器件)处理器通过调用存储器存储的程序指令,按照获得的程序指令实现上述任一实施例中的语音识别方法。
本申请实施例提供了一种计算机可读存储介质,用于储存为上述电子设备所用的计算机程序指令,其包含用于执行上述语音识别方法的程序。
上述计算机存储介质可以是计算机能够存取的任何可用介质或数据存储设备,包括但不限于磁性存储器(例如软盘、硬盘、磁带、磁光盘(MO)等)、光学存储器(例如CD、DVD、BD、HVD等)、以及半导体存储器(例如ROM、EPROM、EEPROM、非易失性存储器(NAND FLASH)、固态硬盘(SSD))等。
以上所述,以上实施例仅用以对本申请的技术方案进行了详细介绍,但以上实施例的说明只是用于帮助理解本申请实施例的方法,不应理解为对本 申请实施例的限制。本技术领域的技术人员可轻易想到的变化或替换,都应涵盖在本申请实施例的保护范围之内。

Claims (12)

  1. 一种语音识别方法,其特征在于,包括:
    获取输入语音以及所述输入语音对应的用户ID;
    根据所述用户ID,在解码网络中,搜索所述输入语音对应的最优路径,所述解码网络中各词节点之间的路径标记有用户ID;
    根据所述最优路径确定所述输入语音对应的文本信息。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述用户ID,在解码网络中,搜索所述输入语音对应的最优路径,包括:
    根据所述解码网络中各词节点之间的路径标记的所述用户ID对应的概率分值,确定所述输入语音对应的最优路径。
  3. 根据权利要求1所述的方法,其特征在于,所述根据所述用户ID,在解码网络中,搜索所述输入语音对应的最优路径,包括:
    根据所述用户ID,获取所述用户ID对应的语言模型;
    根据所述用户ID对应的语言模型,在所述解码网络中,搜索所述输入语音对应的最优路径。
  4. 根据权利要求1-3任一项所述的方法,其特征在于,所述解码网络是基于全量词典构建得到的。
  5. 根据权利要求4所述的方法,其特征在于,通过如下方式更新所述用户ID对应的语言模型:
    确定所述用户ID对应的语言模型需要更新;
    根据所述用户ID对应的语料库中的语料,更新所述语言模型,并确定所述解码网络中各词节点之间的路径对应的最新概率得分;
    根据所述最新概率得分,更新所述解码网络中对应的词节点之间的路径标记的所述用户ID对应的概率得分。
  6. 根据权利要求5所述的方法,其特征在于,所述确定所述用户ID对应的语言模型需要更新,包括:
    检测所述用户ID对应的语料库是否有更新;
    若所述用户ID对应的语料库有更新,确定所述用户ID对应的语言模型需要更新。
  7. 根据权利要求6所述的方法,其特征在于,所述检测所述用户ID对应的语料库是否有更新,包括:
    计算所述用户ID对应的语料库中的所有语料的第一摘要值;
    将所述第一摘要值与第二摘要值进行比较,若不相同,则确认所述用户ID对应的语料库有更新,所述第二摘要值为最近一次更新后所述用户ID对应的语料库中所有语料的摘要值。
  8. 根据权利要求5-7任一所述的方法,其特征在于,在确定所述用户ID对应的语言模型需要更新之后,还包括:
    根据所述解码网络中各词节点在所述用户ID对应的语料库中出现的频率,得到各个词节点对应所述用户ID的出现频率分值;
    针对所述解码网络中的每个音素节点,选择所述音素节点对应的目标词节点对应所述用户ID的出现频率分值中的最大值,确定为所述音素节点到所述各目标词节点的路径对应所述用户ID的最新前瞻概率;
    根据所述最新前瞻概率,更新所述解码网络中的音素节点到目标词节点的路径的与所述用户ID对应的前瞻概率。
  9. 根据权利要求8所述的方法,其特征在于,根据所述解码网络中各词节点在所述用户ID对应的语料库中出现的频率,得到各个词节点对应的出现频率分值,包括:
    确定所述解码网络中与所述用户ID对应的语料库中的语料对应的词节点在所述语料库中出现的频率;
    针对所述语料库中的语料对应的词节点,对该词节点的频率进行归一化,得到该词节点对应的出现频率分值。
  10. 一种语音识别装置,其特征在于,包括:
    获取模块,用于获取输入语音以及所述输入语音对应的用户ID;
    解码模块,用于根据所述用户ID,在解码网络中,搜索所述输入语音对应的最优路径,所述解码网络中各词节点之间的路径标记有用户ID;
    确定模块,用于根据所述最优路径确定所述输入语音对应的文本信息。
  11. 一种电子设备,包括收发机、存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述收发机用于在所述处理器的控制下接收和发送数据,所述处理器执行所述程序时实现权利要求1至9任一项所述方法的步骤。
  12. 一种计算机可读存储介质,其上存储有计算机程序指令,其特征在于,该程序指令被处理器执行时实现权利要求1至9任一项所述方法的步骤。
PCT/CN2020/073328 2019-01-30 2020-01-20 语音识别方法、装置、电子设备及存储介质 WO2020156342A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910094102.7 2019-01-30
CN201910094102.7A CN111508497B (zh) 2019-01-30 2019-01-30 语音识别方法、装置、电子设备及存储介质

Publications (1)

Publication Number Publication Date
WO2020156342A1 true WO2020156342A1 (zh) 2020-08-06

Family

ID=71840088

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/073328 WO2020156342A1 (zh) 2019-01-30 2020-01-20 语音识别方法、装置、电子设备及存储介质

Country Status (3)

Country Link
CN (1) CN111508497B (zh)
TW (1) TWI752406B (zh)
WO (1) WO2020156342A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112102815B (zh) * 2020-11-13 2021-07-13 深圳追一科技有限公司 语音识别方法、装置、计算机设备和存储介质
CN113113024A (zh) * 2021-04-29 2021-07-13 科大讯飞股份有限公司 语音识别方法、装置、电子设备和存储介质
CN113327597B (zh) * 2021-06-23 2023-08-22 网易(杭州)网络有限公司 语音识别方法、介质、装置和计算设备

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541505A (zh) * 2011-01-04 2012-07-04 ***通信集团公司 语音输入方法及其***
CN103092928A (zh) * 2012-12-31 2013-05-08 安徽科大讯飞信息科技股份有限公司 语音查询方法及***
CN103903619A (zh) * 2012-12-28 2014-07-02 安徽科大讯飞信息科技股份有限公司 一种提高语音识别准确率的方法及***
CN105895104A (zh) * 2014-05-04 2016-08-24 讯飞智元信息科技有限公司 说话人自适应识别方法及***
CN106469554A (zh) * 2015-08-21 2017-03-01 科大讯飞股份有限公司 一种自适应的识别方法及***
CN106683677A (zh) * 2015-11-06 2017-05-17 阿里巴巴集团控股有限公司 语音识别方法及装置
US20180336887A1 (en) * 2017-05-22 2018-11-22 Samsung Electronics Co., Ltd. User adaptive speech recognition method and apparatus

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20010037287A1 (en) * 2000-03-14 2001-11-01 Broadbent David F. Method and apparatus for an advanced speech recognition portal for a mortgage loan management system
US10121471B2 (en) * 2015-06-29 2018-11-06 Amazon Technologies, Inc. Language model speech endpointing

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541505A (zh) * 2011-01-04 2012-07-04 ***通信集团公司 语音输入方法及其***
CN103903619A (zh) * 2012-12-28 2014-07-02 安徽科大讯飞信息科技股份有限公司 一种提高语音识别准确率的方法及***
CN103092928A (zh) * 2012-12-31 2013-05-08 安徽科大讯飞信息科技股份有限公司 语音查询方法及***
CN105895104A (zh) * 2014-05-04 2016-08-24 讯飞智元信息科技有限公司 说话人自适应识别方法及***
CN106469554A (zh) * 2015-08-21 2017-03-01 科大讯飞股份有限公司 一种自适应的识别方法及***
CN106683677A (zh) * 2015-11-06 2017-05-17 阿里巴巴集团控股有限公司 语音识别方法及装置
US20180336887A1 (en) * 2017-05-22 2018-11-22 Samsung Electronics Co., Ltd. User adaptive speech recognition method and apparatus

Also Published As

Publication number Publication date
CN111508497B (zh) 2023-09-26
TWI752406B (zh) 2022-01-11
CN111508497A (zh) 2020-08-07
TW202032534A (zh) 2020-09-01

Similar Documents

Publication Publication Date Title
US11398236B2 (en) Intent-specific automatic speech recognition result generation
US20240161732A1 (en) Multi-dialect and multilingual speech recognition
CN108091328B (zh) 基于人工智能的语音识别纠错方法、装置及可读介质
US10176804B2 (en) Analyzing textual data
KR102390940B1 (ko) 음성 인식을 위한 컨텍스트 바이어싱
CN108899013B (zh) 语音搜索方法、装置和语音识别***
JP2022531479A (ja) 音声認識のためのコンテキストバイアス
US11016968B1 (en) Mutation architecture for contextual data aggregator
WO2020156342A1 (zh) 语音识别方法、装置、电子设备及存储介质
CN109754809A (zh) 语音识别方法、装置、电子设备及存储介质
US10152298B1 (en) Confidence estimation based on frequency
US9922650B1 (en) Intent-specific automatic speech recognition result generation
US11562743B2 (en) Analysis of an automatically generated transcription
CN110070859B (zh) 一种语音识别方法及装置
CN111462748B (zh) 语音识别处理方法、装置、电子设备及存储介质
KR20190000776A (ko) 정보 입력 방법
CN111061840A (zh) 数据识别方法、装置及计算机可读存储介质
CN107112009B (zh) 用于生成混淆网络的方法、***和计算机可读存储设备
KR20180062003A (ko) 음성 인식 오류 교정 방법
CN112489626A (zh) 一种信息识别方法、装置及存储介质
CN112632987B (zh) 词槽的识别方法、装置及电子设备
CN114154487A (zh) 文本自动纠错方法、装置、电子设备及存储介质
WO2020233381A1 (zh) 基于语音识别的服务请求方法、装置及计算机设备
US11756538B1 (en) Lower latency speech processing
JP2012018201A (ja) テキスト補正方法及び認識方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20748038

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20748038

Country of ref document: EP

Kind code of ref document: A1