CN117373458A

CN117373458A - Method and equipment for voice recognition

Info

Publication number: CN117373458A
Application number: CN202210766769.9A
Authority: CN
Inventors: 蒋泳森
Original assignee: Douyin Vision Beijing Co Ltd
Current assignee: Douyin Vision Beijing Co Ltd
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2024-01-09

Abstract

A method and apparatus for speech recognition, the method comprising: acquiring voice to be recognized; inputting the voice to a decoder to output text conforming to natural language, wherein the decoder comprises an acoustic model, a pronunciation dictionary and a language model; the language model comprises a basic language model and at least one keyword language model, wherein a first training corpus used by the basic language model comprises at least one keyword class mark; the at least one keyword-language model is generated by training at least one second training corpus respectively, and the at least one second training corpus corresponds to the at least one keyword class mark one by one; the language model is used for calling the keyword language model corresponding to the keyword category mark to search and returning the corresponding keywords under the condition that the basic language model outputs the keyword category mark. The method can improve the efficiency of training and identifying the keywords and reduce the occupation of resources and the time cost.

Description

Method and equipment for voice recognition

Technical Field

The embodiment of the disclosure relates to the field of information technology, in particular to a method and equipment for voice recognition.

Background

With the development of information technology, human-computer interaction scenes are more and more common in daily life, and voice recognition technology is widely applied to the fields of intelligent home, intelligent office, intelligent automobiles and the like as an important way of human-computer interaction.

The voice recognition technology takes voice as a research object, and the machine automatically recognizes and understands the voice dictated by the human through voice signal processing and pattern recognition. The current general speech recognition scheme is to learn mass data through a neural network technology to obtain a speech recognition model, and then recognize speech through the speech recognition model.

Keywords often occur during speech recognition, which may refer to a class of words that are time-efficient, specific or proprietary, such as song names, person names, address names, technical words, etc. Because the number of the keywords is large, when the recognition tasks of a plurality of keywords are simultaneously executed, a huge language model needs to be established for training and recognition, and the occupied resources and time cost are also large.

Therefore, there is a need in the industry for a speech recognition method that can improve the processing efficiency of keywords.

Disclosure of Invention

The embodiment of the disclosure provides a method and a device for voice recognition, which can improve the efficiency of training and recognizing keywords and reduce the resource occupation and time cost.

In a first aspect, embodiments of the present disclosure provide a method for speech recognition, comprising: acquiring voice to be recognized; inputting the voice to a decoder to output a text conforming to a natural language, wherein the decoder comprises an acoustic model for converting the voice into a phoneme sequence, a pronunciation dictionary for converting the phoneme sequence into a word sequence, and a language model for converting the word sequence into the text conforming to the natural language; the language model comprises a basic language model and at least one keyword language model, wherein a first training corpus used by the basic language model comprises at least one keyword class mark, the keyword class mark is used for replacing a corresponding keyword in the first training corpus, and each keyword class mark corresponds to a keyword set of one class; the at least one keyword language model is generated by training at least one second training corpus respectively, the at least one second training corpus corresponds to the at least one keyword class mark one by one, and the second training corpus comprises a set of keywords corresponding to the keyword class mark; and the language model is used for calling the keyword language model corresponding to the keyword category mark to search and returning the corresponding keywords under the condition that the basic language model outputs the keyword category mark.

In a second aspect, embodiments of the present disclosure provide a method for speech recognition, comprising: acquiring a first training corpus, wherein the first training corpus comprises at least one keyword category mark, each keyword category mark corresponds to a keyword set of one category, and the at least one keyword category mark is used for replacing a corresponding keyword in the first training corpus; training according to the first training corpus to generate a basic language model; acquiring at least one second training corpus, wherein the at least one second training corpus corresponds to the at least one keyword category mark one by one, and the second training corpus comprises a set of keywords corresponding to the keyword category mark; generating at least one keyword-language model according to the at least one second training corpus respectively; and generating a decoder according to the basic language model and the at least one keyword language model, wherein the decoder is used for decoding input voice so as to output text conforming to natural language.

In a third aspect, embodiments of the present disclosure provide an apparatus for speech recognition, comprising: the acquisition module is used for acquiring the voice to be recognized; a processing module for inputting the speech to a decoder to output a natural language compliant text, wherein the decoder comprises an acoustic model for converting the speech into a phoneme sequence, a pronunciation dictionary for converting the phoneme sequence into a word sequence, and a language model for converting the word sequence into the natural language compliant text; the language model comprises a basic language model and at least one keyword language model, wherein a first training corpus used by the basic language model comprises at least one keyword class mark, the keyword class mark is used for replacing a corresponding keyword in the first training corpus, and each keyword class mark corresponds to a keyword set of one class; the at least one keyword language model is generated by training at least one second training corpus respectively, the at least one second training corpus corresponds to the at least one keyword class mark one by one, and the second training corpus comprises a set of keywords corresponding to the keyword class mark; and the language model is used for calling the keyword language model corresponding to the keyword category mark to search and returning the corresponding keywords under the condition that the basic language model outputs the keyword category mark.

In a fourth aspect, embodiments of the present disclosure provide an apparatus for speech recognition, comprising: the system comprises an acquisition module, a first training corpus and a second training corpus, wherein the first training corpus comprises at least one keyword category mark, each keyword category mark corresponds to a keyword set of one category, and the at least one keyword category mark is used for replacing a corresponding keyword in the first training corpus; the processing module is used for training according to the first training corpus and generating a basic language model; the acquisition module is further used for acquiring at least one second training corpus, the at least one second training corpus corresponds to the at least one keyword category mark one by one, and the second training corpus comprises a set of keywords corresponding to the keyword category mark; the processing module is further used for generating at least one keyword language model according to the at least one second training corpus respectively; the processing module is further configured to generate a decoder according to the base language model and the at least one keyword-language model, the decoder being configured to decode input speech to output text conforming to natural language.

In a fifth aspect, embodiments of the present disclosure provide an electronic device, including: a processor and a memory;

the memory stores computer-executable instructions; the processor executes the computer-executable instructions stored in the memory such that the at least one processor performs the method for speech recognition as described in the first aspect and the various possible designs of the first aspect, or performs the method for speech recognition as described in the second aspect and the various possible designs of the second aspect.

In a sixth aspect, embodiments of the present disclosure provide a computer-readable storage medium having stored therein computer-executable instructions which, when executed by a processor, implement the method for speech recognition as described above in the first aspect and the various possible designs of the first aspect, or perform the method for speech recognition as described above in the second aspect and the various possible designs of the second aspect.

In a seventh aspect, embodiments of the present disclosure provide a computer program product comprising a computer program which, when executed by a processor, implements the method for speech recognition as described in the first aspect and the various possible designs of the first aspect, or performs the method for speech recognition as described in the second aspect and the various possible designs of the second aspect.

The embodiment provides a method and equipment for voice recognition, which are used for independently training a basic query part and a keyword part and combining the basic query part and the keyword part together for decoding in the decoding process, so that the complexity of training and decoding the keywords is reduced, the efficiency of training and identifying the keywords is improved, and the resource occupation and the time cost are reduced.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the solutions in the prior art, a brief description will be given below of the drawings that are needed in the embodiments or the description of the prior art, it being obvious that the drawings in the following description are some embodiments of the present disclosure, and that other drawings may be obtained from these drawings without inventive effort to a person of ordinary skill in the art.

FIG. 1 is a schematic diagram of a possible application scenario of an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a possible application scenario of a further embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a frame of a speech recognition system of an embodiment of the present disclosure;

FIG. 4 is a flow diagram of a method for speech recognition according to an embodiment of the present disclosure;

FIG. 5 is a flow chart of a method for speech recognition according to yet another embodiment of the present disclosure;

FIG. 6 is a schematic diagram of an apparatus 600 for speech recognition according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of an apparatus 700 for speech recognition according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of an electronic device 800 according to an embodiment of the present disclosure.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some embodiments of the present disclosure, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without inventive effort, based on the embodiments in this disclosure are intended to be within the scope of this disclosure.

With the popularization of intelligent terminals, human-computer interaction scenes are more and more common, and as an important interface of human-computer interaction, many factories set voice recognition software in terminal equipment. So that the user can control the terminal device by voice. For example, the user may inquire about city weather by voice, play music by voice control terminal device, dial a phone call in the address book by voice, etc. In some interaction scenarios, there are a large number of keywords. Keywords refer to a class of words with timeliness and specificity that appear in a particular business scenario, such as song names, singer names, address book names, place names, terminal application names, and the like.

Compared with common vocabulary, the keywords have the characteristics of frequent change, unpredictability, custom and the like. And the occupation of the keywords in massive speech recognition training corpora is low, so that in a conventional speech recognition scheme, the training and recognition of the keywords require more occupied resources and time cost, and the recognition error rate is high.

In order to solve the above-mentioned problems, the embodiments of the present disclosure provide a method and apparatus for speech recognition, which have the main ideas that in the speech recognition process, a basic query (query) portion and a keyword portion are independently trained, and are combined together to perform decoding in the decoding process, so that the efficiency of training and recognizing keywords can be improved, and the resource occupation and the time cost can be reduced.

Fig. 1 is a schematic diagram of a possible application scenario of an embodiment of the present disclosure. As shown in fig. 1, the application scenario may include: a speech recognition device 110. The user may give an indication to the speech recognition device 100 by speech. After the voice signal is acquired, the voice recognition device 110 may process the voice signal, extract acoustic features, recognize the semantics of the voice signal, and output text that can be recognized by a computer.

Alternatively, the voice recognition device 110 may include, but is not limited to, the following terminal devices: intelligent terminals, mobile phones, tablet computers, personal digital assistants, palm game consoles or wearable devices, etc.

Fig. 2 is a schematic diagram of a possible application scenario of a further embodiment of the present disclosure. As shown in fig. 2, the application scenario may include a speech recognition device 110 and a terminal device 120. Wherein the speech recognition device 110 may be connected to the terminal device 120 via a network. Terminal device 120 may include, but is not limited to: intelligent terminals, cell phones, computers, tablet computers, personal digital assistants, palm game consoles or wearable devices, etc. The speech recognition device 110 may include, but is not limited to, a speech recognizer, a server, a tablet, a computer, and the like. After the user sends a voice signal to the terminal device 120, the terminal device 120 may send the voice signal to the voice recognition device 110 through the network, the voice recognition device 110 processes the voice signal, extracts acoustic features, recognizes the semantics of the voice signal, and outputs text that can be recognized by a computer.

FIG. 3 is a schematic diagram of a speech recognition system according to one embodiment of the present disclosure. As shown in fig. 3, the speech recognition system includes modules such as an acoustic model, a pronunciation dictionary, a language model, and a decoder.

The acoustic model is a model obtained by training by using acoustic features and corresponding labels, and is used for establishing corresponding probability distribution between an acoustic signal and a modeling unit. The language model is a model which is obtained based on massive training corpus training and can realize mapping of phonemes to texts. The language model may be used to build a rationality model of the text logic. The pronunciation dictionary is used to build a mapping relationship between the acoustic model and the language model. It will be appreciated that the acoustic model is used to convert speech into a sequence of phonemes and the pronunciation dictionary is used to convert the sequence of phonemes into a sequence of words; language model for converting word sequence into text conforming to natural language

The decoder can construct a decoding graph by using knowledge sources such as an acoustic model, a hairstyle dictionary, a language model and the like, and the decoding process of the decoding graph is a process of searching an optimal path in the decoding graph. When decoding, it is often necessary to represent the language model, pronunciation dictionary, acoustic model in the form of a weighted finite state transducer (weighted finite state transducer, WFST) (whose progeny may be denoted as. Fst), and then compile the overall decoding diagram by combining, determinising, minimizing, etc. In the decoding process, a pruning strategy is adopted to cut a path with low probability at the current moment so as to keep the optimal path as far as possible.

Fig. 4 is a flow chart of a method for speech recognition according to an embodiment of the present disclosure. The method may be applied to the speech recognition device 110 shown in fig. 1 or fig. 2, and may be a terminal device or a server, for example. As shown in fig. 4, the method includes the following.

S401, acquiring a first training corpus, wherein the first training corpus comprises at least one keyword category mark, each keyword category mark corresponds to a keyword set of one category, and the at least one keyword category mark is used for replacing a corresponding keyword in the first training corpus.

The keywords may refer to words with timeliness and specificity, such as song names, singer names, address book names, place names, terminal application names, and the like, which appear in a specific business scenario.

It will be appreciated that the embodiments of the present disclosure may divide a plurality of keywords into different keyword sets according to categories, each keyword set corresponds to one keyword category label, and then replace keywords in the first training corpus with corresponding keyword category labels.

For example, all singer names may be added to a singer name set, and the keyword class flag corresponding to the singer name set may be named class_singer. Adding all song names into a song name set, and naming the keyword category mark corresponding to the song name set as class_music. Then the singer name in the first training corpus is replaced by class_singer, and the song name in the first training corpus is replaced by class_music.

S402, training is carried out according to the first training corpus, and a basic language model is generated.

Alternatively, after generating the base language model, a corresponding first weighted finite state transducer may be generated from the base language model, and the first weighted finite state transducer may serve as a language model for the underlying query. As an example, the first finite state machine may be denoted as g_base.

Alternatively, the embodiment of the present disclosure does not limit the kind of the generated basic language model. For example, N-GRAM (N-GRAM), recurrent neural networks (recurrent neural net work, RNN), and the like.

S403, at least one second training corpus is obtained, the at least one second training corpus corresponds to the at least one keyword category mark one by one, and the second training corpus comprises a set of keywords corresponding to the keyword category mark.

It will be appreciated that the second training corpus is a collection of different types of keywords, and each second training corpus corresponds to one keyword category label. For example, the singer name replaced in the S401 part may form a keyword set, and be referred to as a second training corpus corresponding to class_singer. The keyword set formed by the song names replaced in the S401 part may be referred to as a second training corpus corresponding to class_music.

S404, generating at least one keyword language model according to at least one second training corpus respectively.

For example, a keyword language model corresponding to class_singer may be generated according to a second training corpus corresponding to class_singer. And generating a keyword language model corresponding to the class_music according to the second training corpus corresponding to the class_music.

Further, at least one second weighted finite state transducer may be generated based on the at least one keyword language model, respectively. For example, a corresponding second weighted finite state transducer may be generated from the keyword language model corresponding to class_singer and denoted as g_singer. A corresponding second weighted finite state transducer may be generated from the keyword language model corresponding to class_music and denoted g_music.

Alternatively, the embodiment of the present disclosure does not limit the kind of the generated keyword language model. For example, N-GRAM (N-GRAM), recurrent neural networks (recurrent neural net work, RNN), and the like.

S405, generating a decoder according to the basic language model and at least one keyword language model, wherein the decoder is used for decoding input voice so as to output text conforming to natural language.

Optionally, the generating a decoder according to the basic language model and the at least one keyword language model includes: generating a decoder according to an acoustic model, a pronunciation dictionary and a language model, wherein the acoustic model is used for converting the voice into a phoneme sequence, and the pronunciation dictionary is used for converting the phoneme sequence into a word sequence; the language model is used for converting the word sequence into a text conforming to natural language; the language model comprises a basic language model and at least one keyword language model, and the language model is used for calling the keyword language model corresponding to the keyword category mark to search and returning the corresponding keyword under the condition that the basic language model outputs the keyword category mark.

As an example, weighted finite state machines may be generated from the acoustic model, the pronunciation dictionary, and the language model, respectively, and combined into a decoding graph to decode the input speech.

For example, a first weighted finite state transducer may be generated from the base language model; at least one second weighted finite state transducer is generated based on the at least one keyword language model, respectively. The first weighted finite state transducer and the second weighted finite state transducer described above may be used to construct a decoding graph.

In the embodiment of the disclosure, a method and equipment for voice recognition are provided, wherein the method trains a basic query part and a keyword part independently and combines the basic query part and the keyword part together for decoding in the decoding process, so that the complexity of training and decoding keywords is reduced, the efficiency of training and recognizing the keywords is improved, and the resource occupation and the time cost are reduced.

The training process of the speech recognition method is described above, and the decoding process in the speech recognition method is described next. Optionally, the method of fig. 4 further comprises:

s406, decoding the input voice according to the decoder to output the text conforming to the natural language.

In some examples, decoding input speech according to a decoder includes: pruning operation is carried out according to a first weighted finite state transducer, and the first weighted finite state transducer is generated according to a basic language model; in the pruning operation process of the first weighted finite state machines, if the output of the currently processed edge is a keyword category mark, determining a second weighted finite state machine corresponding to the keyword category mark from at least one second weighted finite state machine, wherein the at least one second weighted finite state machine is generated according to at least one keyword language model respectively; searching keywords in a second weighted finite state transducer corresponding to the keyword class mark and outputting the keywords; and after the search keyword is finished, returning to the first weighted finite state transducer to continue pruning operation.

Optionally, embodiments of the present disclosure do not limit the specific strategies adopted by pruning operations. For example, cumulative beam pruning strategies, histogram pruning strategies, acoustic model predictive pruning strategies, and the like may be employed.

In the embodiment of the disclosure, keywords in a first training corpus are replaced by keyword class marks in a training process, a first weighted finite state transducer is generated, the replaced keywords are generated into at least one second weighted finite state transducer according to different classes, the first weighted finite state transducer is used as a basic query part for decoding in a decoding process, and the second weighted finite state transducer is used for keyword searching when the keywords are required to be searched, so that the complexity of a decoder is reduced, the efficiency of training and recognizing the keywords can be improved, and the resource occupation and the time cost are reduced.

In some examples, the inputting the speech to a decoder to output text conforming to natural language includes: before the search keyword starts, outputting a keyword start flag for indicating that keywords corresponding to the keyword category flag are to be started to be output; after the search keyword is ended, a keyword ending mark is output, and the keyword ending mark is used for indicating that the keywords corresponding to the keyword class mark are ended.

In the embodiment of the disclosure, in the decoding process of the voice, before entering the second weighted finite state transducer and after finishing searching the keyword, a keyword start mark and a keyword end mark are respectively output so as to identify the keyword retrieval state in the decoding process, thereby being convenient for improving the decoding efficiency.

For example, taking the accumulated beam pruning policy as an example, in the decoding process, if a bundle search (search search) is performed along g_base.fst, if the output flag of the current side is class_singer, which indicates that the SINGER name needs to be searched at this time, it is necessary to search the network of g_singer.fst corresponding to the SINGER name, and output a keyword start flag b_singer at the same time, to indicate that the result to be output next is the SINGER name. After the search keyword is ended, the first weighted finite state machine is returned to continue pruning operation, and a keyword end flag e_singer is output to indicate that the search for the SINGER name has ended.

Fig. 5 is a flow chart of a method for speech recognition according to yet another embodiment of the present disclosure. The method may be performed by a terminal device. As shown in fig. 5, the method includes the following.

S501, acquiring voice to be recognized.

Optionally, the voice to be recognized may include a voice signal sent by the user, or also include an acoustic feature sequence obtained by extracting acoustic features of the voice signal of the user. For example, the acoustic feature sequence may be obtained by processing the audio such as framing and feature extraction using the short-time stationarity of the sound signal.

S502, inputting the voice to a decoder to output a text conforming to natural language, wherein the decoder comprises an acoustic model, a pronunciation dictionary and a language model, the acoustic model is used for converting the voice into a phoneme sequence, and the pronunciation dictionary is used for converting the phoneme sequence into a word sequence; the language model is used to convert word sequences into text that conforms to natural language.

The language model comprises a basic language model and at least one keyword language model, wherein a first training corpus used by the basic language model comprises at least one keyword category mark, the keyword category mark is used for replacing a corresponding keyword in the first training corpus, and each keyword category mark corresponds to a keyword set of one category.

The at least one keyword-language model is generated by training at least one second training corpus respectively, the at least one second training corpus corresponds to the at least one keyword category mark one by one, and the second training corpus comprises a set of keywords corresponding to the keyword category mark.

The language model is used for calling the keyword language model corresponding to the keyword category mark to search and returning the corresponding keywords under the condition that the basic language model outputs the keyword category mark.

In some examples, inputting speech to a decoder to output text conforming to natural language includes: pruning operation is carried out according to a first weighted finite state transducer, and the first weighted finite state transducer is generated according to a basic language model; in the pruning operation process of the first weighted finite state machines, if the output of the currently processed edge is a keyword category mark, determining a second weighted finite state machine corresponding to the keyword category mark from at least one second weighted finite state machine, wherein the at least one second weighted finite state machine is generated according to at least one keyword language model respectively; searching keywords in a second weighted finite state transducer corresponding to the keyword class mark and outputting the keywords; and after the search keyword is finished, returning to the first weighted finite state transducer to continue pruning operation.

In the embodiment of the disclosure, in a speech recognition process, a decoder is built based on a language model trained by a first training corpus and at least one second training corpus, wherein keywords in the first training corpus are replaced by keyword class marks, a first weighted finite state conversion machine is generated, the replaced keywords are generated into at least one second weighted finite state conversion machine according to different classes, the first weighted finite state conversion machine is used as a basic query part for decoding in the decoding process, and the second weighted finite state conversion machine is used for keyword search when the keywords are required to be searched, so that the complexity of the decoder is reduced, the efficiency of training and keyword recognition can be improved, and the resource occupation and time cost are reduced.

In some examples, inputting speech to a decoder to output text conforming to natural language includes: before the search keyword starts, outputting a keyword start flag for indicating that keywords corresponding to the keyword category flag are to be started to be output; after the search keyword is ended, a keyword ending mark is output, and the keyword ending mark is used for indicating that the keywords corresponding to the keyword class mark are ended.

Corresponding to the method for speech recognition of the above embodiments, fig. 6 is a schematic structural diagram of an apparatus 600 for speech recognition according to an embodiment of the present disclosure. For ease of illustration, only portions relevant to embodiments of the present disclosure are shown. Referring to fig. 6, the apparatus includes: an acquisition module 610 and a processing module 620.

The obtaining module 610 is configured to obtain a first corpus, where the first corpus includes at least one keyword category flag, each keyword category flag corresponds to a category of keyword set, and the at least one keyword category flag is used to replace a corresponding keyword in the first corpus.

The processing module 620 is configured to perform training according to the first training corpus, and generate a basic language model.

The obtaining module 610 is further configured to obtain at least one second training corpus, where the at least one second training corpus corresponds to the at least one keyword category flag one-to-one, and the second training corpus includes a set of keywords corresponding to the keyword category flag.

The processing module 620 is further configured to generate at least one keyword-language model according to the at least one second training corpus, respectively.

The processing module 620 is further configured to generate a decoder based on the base language model and the at least one keyword-language model, the decoder configured to decode input speech to output text conforming to natural language.

In one embodiment of the present disclosure, the processing module 610 is specifically configured to: generating a decoder according to an acoustic model, a pronunciation dictionary and a language model, wherein the acoustic model is used for converting the voice into a phoneme sequence, and the pronunciation dictionary is used for converting the phoneme sequence into a word sequence; the language model is used for converting the word sequence into a text conforming to natural language; the language model comprises a basic language model and at least one keyword language model, and the language model is used for calling the keyword language model corresponding to the keyword category mark to search and returning the corresponding keyword under the condition that the basic language model outputs the keyword category mark.

In one embodiment of the present disclosure, the processing module 610 is further configured to decode the input speech according to a decoder to output text conforming to a natural language, where the processing module 610 is specifically configured to: pruning operation is carried out according to a first weighted finite state transducer, and the first weighted finite state transducer is generated according to a basic language model; in the pruning operation process of the first weighted finite state machines, if the output of the currently processed edge is a keyword category mark, determining a second weighted finite state machine corresponding to the keyword category mark from at least one second weighted finite state machine, wherein the at least one second weighted finite state machine is generated according to at least one keyword language model respectively; searching keywords in a second weighted finite state transducer corresponding to the keyword class mark and outputting the keywords; and after the search keyword is finished, returning to the first weighted finite state transducer to continue pruning operation.

In one embodiment of the present disclosure, the processing module 620 is specifically configured to: before the search keyword starts, outputting a keyword start flag for indicating that keywords corresponding to the keyword category flag are to be started to be output; after the search keyword is ended, a keyword ending mark is output, and the keyword ending mark is used for indicating that the keywords corresponding to the keyword class mark are ended.

Fig. 7 is a schematic structural diagram of an apparatus 700 for voice recognition according to an embodiment of the present disclosure. For ease of illustration, only portions relevant to embodiments of the present disclosure are shown. Referring to fig. 7, the apparatus includes: an acquisition module 710 and a processing module 720.

The acquiring module 710 is configured to acquire a voice to be recognized.

A processing module 720 for inputting speech to a decoder to output text conforming to natural language, wherein the decoder comprises an acoustic model for converting the speech into a sequence of phonemes, a pronunciation dictionary for converting the sequence of phonemes into a sequence of words, and a language model; the language model is used to convert word sequences into text that conforms to natural language.

The language model comprises a basic language model and at least one keyword language model, wherein a first training corpus used by the basic language model comprises at least one keyword category mark, the keyword category mark is used for replacing a corresponding keyword in the first training corpus, and each keyword category mark corresponds to a keyword set of one category; the at least one keyword-language model is generated by training at least one second training corpus respectively, the at least one second training corpus corresponds to the at least one keyword category mark one by one, and the second training corpus comprises a set of keywords corresponding to the keyword category mark; the language model is used for calling the keyword language model corresponding to the keyword category mark to search and returning the corresponding keywords under the condition that the basic language model outputs the keyword category mark.

In one embodiment of the present disclosure, the processing module 720 is specifically configured to: pruning operation is carried out according to a first weighted finite state transducer, and the first weighted finite state transducer is generated according to a basic language model; in the pruning operation process of the first weighted finite state machines, if the output of the currently processed edge is a keyword category mark, determining a second weighted finite state machine corresponding to the keyword category mark from at least one second weighted finite state machine, wherein the at least one second weighted finite state machine is generated according to at least one keyword language model respectively; searching keywords in a second weighted finite state transducer corresponding to the keyword class mark and outputting the keywords; and after the search keyword is finished, returning to the first weighted finite state transducer to continue pruning operation.

In one embodiment of the present disclosure, the processing module 720 is specifically further configured to: before the search keyword starts, outputting a keyword start flag for indicating that keywords corresponding to the keyword category flag are to be started to be output; after the search keyword is ended, a keyword ending mark is output, and the keyword ending mark is used for indicating that the keywords corresponding to the keyword class mark are ended.

The device provided in this embodiment may be used to execute the technical solution of the foregoing method embodiment, and its implementation principle and technical effects are similar, and this embodiment will not be described herein again.

In order to achieve the above embodiments, the embodiments of the present disclosure further provide an electronic device.

Fig. 8 is a schematic structural diagram of an electronic device 800 according to an embodiment of the disclosure, where the electronic device 800 may be a terminal device or a server. The electronic device may be used to perform the steps of the method shown in fig. 4 or fig. 5.

The terminal device may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a personal digital assistant (Personal Digital Assistant, PDA for short), a tablet (Portable Android Device, PAD for short), a portable multimedia player (Portable Media Player, PMP for short), an in-vehicle terminal (e.g., an in-vehicle navigation terminal), and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 8 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 8, the electronic device 800 may include a processing means (e.g., a central processor, a graphics processor, etc.) 801 that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 802 or a program loaded from a storage 808 into a random access Memory (Random Access Memory, RAM) 803. In the RAM 803, various programs and data required for the operation of the electronic device 800 are also stored. The processing device 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

In general, the following devices may be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 807 including, for example, a liquid crystal display (Liquid Crystal Display, LCD for short), a speaker, a vibrator, and the like; storage 808 including, for example, magnetic tape, hard disk, etc.; communication means 809. The communication means 809 may allow the electronic device 800 to communicate wirelessly or by wire with other devices to exchange data. While fig. 8 shows an electronic device 800 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via communication device 809, or installed from storage device 808, or installed from ROM 802. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 801.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer-readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the methods shown in the above-described embodiments.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (Local Area Network, LAN for short) or a wide area network (Wide Area Network, WAN for short), or it may be connected to an external computer (e.g., connected via the internet using an internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The name of the unit does not in any way constitute a limitation of the unit itself, for example the first acquisition unit may also be described as "unit acquiring at least two internet protocol addresses".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In a first aspect, according to one or more embodiments of the present disclosure, there is provided a method for speech recognition, comprising:

acquiring voice to be recognized;

inputting the voice to a decoder to output a text conforming to a natural language, wherein the decoder comprises an acoustic model for converting the voice into a phoneme sequence, a pronunciation dictionary for converting the phoneme sequence into a word sequence, and a language model for converting the word sequence into the text conforming to the natural language;

the language model comprises a basic language model and at least one keyword language model, wherein a first training corpus used by the basic language model comprises at least one keyword class mark, the keyword class mark is used for replacing a corresponding keyword in the first training corpus, and each keyword class mark corresponds to a keyword set of one class;

the at least one keyword language model is generated by training at least one second training corpus respectively, the at least one second training corpus corresponds to the at least one keyword class mark one by one, and the second training corpus comprises a set of keywords corresponding to the keyword class mark;

And the language model is used for calling the keyword language model corresponding to the keyword category mark to search and returning the corresponding keywords under the condition that the basic language model outputs the keyword category mark.

According to one or more embodiments of the present disclosure, the inputting the speech to a decoder to output text conforming to natural language includes: pruning operation is carried out according to a first weighted finite state conversion machine, and the first weighted finite state conversion machine is generated according to the basic language model; in the pruning operation process of the first weighted finite state conversion machine, if the output of the currently processed edge is a keyword category mark, determining a second weighted finite state conversion machine corresponding to the keyword category mark from at least one second weighted finite state conversion machine, wherein the at least one second weighted finite state conversion machine is generated according to the at least one keyword language model respectively; searching keywords in a second weighted finite state transducer corresponding to the keyword class mark and outputting the keywords; and after the search keyword is finished, returning to the first weighted finite state transducer to continue pruning operation.

According to one or more embodiments of the present disclosure, the decoding of the input speech according to the decoder outputs a keyword start flag indicating that outputting of keywords corresponding to the keyword class flag is to be started before the search for keywords starts; and outputting a keyword ending mark after the search keyword is ended, wherein the keyword ending mark is used for indicating that the keywords corresponding to the keyword class mark are ended.

In a second aspect, according to one or more embodiments of the present disclosure, there is provided a method for speech recognition, comprising:

acquiring a first training corpus, wherein the first training corpus comprises at least one keyword category mark, each keyword category mark corresponds to a keyword set of one category, and the at least one keyword category mark is used for replacing a corresponding keyword in the first training corpus;

training according to the first training corpus to generate a basic language model;

acquiring at least one second training corpus, wherein the at least one second training corpus corresponds to the at least one keyword category mark one by one, and the second training corpus comprises a set of keywords corresponding to the keyword category mark;

Generating at least one keyword-language model according to the at least one second training corpus respectively;

and generating a decoder according to the basic language model and the at least one keyword language model, wherein the decoder is used for decoding input voice so as to output text conforming to natural language.

According to one or more embodiments of the present disclosure, the generating a decoder from the base language model and the at least one keyword language model includes: generating the decoder according to an acoustic model, a pronunciation dictionary and a language model, wherein the acoustic model is used for converting the voice into a phoneme sequence, the pronunciation dictionary is used for converting the phoneme sequence into a word sequence, and the language model is used for converting the word sequence into the text conforming to the natural language; the language model comprises the basic language model and the at least one keyword language model, and is used for calling the keyword language model corresponding to the keyword category mark to search and returning the corresponding keyword under the condition that the basic language model outputs the keyword category mark.

According to one or more embodiments of the present disclosure, the method further comprises: decoding the input speech according to the decoder to output text conforming to natural language, wherein the decoding the input speech according to the decoder comprises: pruning operation is carried out according to a first weighted finite state conversion machine, and the first weighted finite state conversion machine is generated according to the basic language model; in the pruning operation process of the first weighted finite state conversion machine, if the output of the currently processed edge is a keyword category mark, determining a second weighted finite state conversion machine corresponding to the keyword category mark from at least one second weighted finite state conversion machine, wherein the at least one second weighted finite state conversion machine is generated according to the at least one keyword language model respectively; searching keywords in a second weighted finite state transducer corresponding to the keyword class mark and outputting the keywords; and after the search keyword is finished, returning to the first weighted finite state transducer to continue pruning operation.

According to one or more embodiments of the present disclosure, the decoding of input speech according to the decoder includes: before the search keyword starts, outputting a keyword start flag, wherein the keyword start flag is used for indicating keywords corresponding to the keyword category flag to be started to be output; and outputting a keyword ending mark after the search keyword is ended, wherein the keyword ending mark is used for indicating that the keywords corresponding to the keyword class mark are ended.

In a third aspect, according to one or more embodiments of the present disclosure, there is provided an apparatus for speech recognition, comprising:

the acquisition module is used for acquiring the voice to be recognized;

a processing module for inputting the speech to a decoder to output a natural language compliant text, wherein the decoder comprises an acoustic model for converting the speech into a phoneme sequence, a pronunciation dictionary for converting the phoneme sequence into a word sequence, and a language model for converting the word sequence into the natural language compliant text;

According to one or more embodiments of the present disclosure, the processing module is specifically configured to: pruning operation is carried out according to a first weighted finite state conversion machine, and the first weighted finite state conversion machine is generated according to the basic language model; in the pruning operation process of the first weighted finite state conversion machine, if the output of the currently processed edge is a keyword category mark, determining a second weighted finite state conversion machine corresponding to the keyword category mark from at least one second weighted finite state conversion machine, wherein the at least one second weighted finite state conversion machine is generated according to the at least one keyword language model respectively; searching keywords in a second weighted finite state transducer corresponding to the keyword class mark and outputting the keywords; and after the search keyword is finished, returning to the first weighted finite state transducer to continue pruning operation.

According to one or more embodiments of the present disclosure, the processing module is specifically configured to: before the search keyword starts, outputting a keyword start flag, wherein the keyword start flag is used for indicating keywords corresponding to the keyword category flag to be started to be output; and outputting a keyword ending mark after the search keyword is ended, wherein the keyword ending mark is used for indicating that the keywords corresponding to the keyword class mark are ended.

In a fourth aspect, according to one or more embodiments of the present disclosure, there is provided an apparatus for speech recognition, comprising:

the system comprises an acquisition module, a first training corpus and a second training corpus, wherein the first training corpus comprises at least one keyword category mark, each keyword category mark corresponds to a keyword set of one category, and the at least one keyword category mark is used for replacing a corresponding keyword in the first training corpus;

the processing module is used for training according to the first training corpus and generating a basic language model;

the acquisition module is further used for acquiring at least one second training corpus, the at least one second training corpus corresponds to the at least one keyword category mark one by one, and the second training corpus comprises a set of keywords corresponding to the keyword category mark;

the processing module is further used for generating at least one keyword language model according to the at least one second training corpus respectively;

the processing module is further configured to generate a decoder according to the base language model and the at least one keyword-language model, the decoder being configured to decode input speech to output text conforming to natural language.

According to one or more embodiments of the present disclosure, the processing module is specifically configured to generate the decoder according to an acoustic model for converting the speech into a phoneme sequence, a pronunciation dictionary for converting the phoneme sequence into a word sequence, and a language model for converting the word sequence into the text conforming to the natural language; the language model comprises the basic language model and the at least one keyword language model, and is used for calling the keyword language model corresponding to the keyword category mark to search and returning the corresponding keyword under the condition that the basic language model outputs the keyword category mark.

According to one or more embodiments of the present disclosure, the processing module is further configured to: decoding the input voice by the decoder to output text conforming to natural language, wherein the processing module is specifically configured to:

pruning operation is carried out according to a first weighted finite state conversion machine, and the first weighted finite state conversion machine is generated according to the basic language model; in the pruning operation process of the first weighted finite state conversion machine, if the output of the currently processed edge is a keyword category mark, determining a second weighted finite state conversion machine corresponding to the keyword category mark from at least one second weighted finite state conversion machine, wherein the at least one second weighted finite state conversion machine is generated according to the at least one keyword language model respectively; searching keywords in a second weighted finite state transducer corresponding to the keyword class mark and outputting the keywords; and after the search keyword is finished, returning to the first weighted finite state transducer to continue pruning operation.

In a fifth aspect, according to one or more embodiments of the present disclosure, there is provided an electronic device comprising: at least one processor and memory;

the memory stores computer-executable instructions;

the at least one processor executes the computer-executable instructions stored in the memory such that the at least one processor performs the method for speech recognition as described in the first aspect and the various possible designs of the first aspect, or performs the method for speech recognition as described in the second aspect and the various possible designs of the second aspect.

In a fourth aspect, according to one or more embodiments of the present disclosure, there is provided a computer-readable storage medium having stored therein computer-executable instructions which, when executed by a processor, implement the method for speech recognition as described above in the first aspect and the various possible aspects of the first aspect, or perform the method for speech recognition as described above in the second aspect and the various possible designs of the second aspect.

In a fifth aspect, according to one or more embodiments of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method for speech recognition as described in the above first aspect and the various possible designs of the first aspect, or performs the method for speech recognition as described in the above second aspect and the various possible designs of the second aspect.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims

1. A method for speech recognition, comprising:

acquiring voice to be recognized;

2. The method of claim 1, wherein the inputting the speech to a decoder to output text conforming to natural language comprises:

pruning operation is carried out according to a first weighted finite state conversion machine, and the first weighted finite state conversion machine is generated according to the basic language model;

in the pruning operation process of the first weighted finite state conversion machine, if the output of the currently processed edge is a keyword category mark, determining a second weighted finite state conversion machine corresponding to the keyword category mark from at least one second weighted finite state conversion machine, wherein the at least one second weighted finite state conversion machine is generated according to the at least one keyword language model respectively;

Searching keywords in a second weighted finite state transducer corresponding to the keyword class mark and outputting the keywords;

and after the search keyword is finished, returning to the first weighted finite state transducer to continue pruning operation.

3. The method of claim 2, wherein the inputting the speech to a decoder to output text conforming to natural language comprises:

before the search keyword starts, outputting a keyword start flag, wherein the keyword start flag is used for indicating keywords corresponding to the keyword category flag to be started to be output;

and outputting a keyword ending mark after the search keyword is ended, wherein the keyword ending mark is used for indicating that the keywords corresponding to the keyword class mark are ended.

4. A method for speech recognition, comprising:

5. The method of claim 4, wherein the generating a decoder from the base language model and the at least one keyword-language model comprises:

generating the decoder according to an acoustic model, a pronunciation dictionary and a language model, wherein the acoustic model is used for converting the voice into a phoneme sequence, the pronunciation dictionary is used for converting the phoneme sequence into a word sequence, and the language model is used for converting the word sequence into the text conforming to the natural language;

the language model comprises the basic language model and the at least one keyword language model, and is used for calling the keyword language model corresponding to the keyword category mark to search and returning the corresponding keyword under the condition that the basic language model outputs the keyword category mark.

6. The method of claim 4 or 5, wherein the method further comprises: decoding the input speech according to the decoder to output text conforming to natural language, wherein the decoding the input speech according to the decoder comprises:

7. The method of claim 6, wherein decoding the input speech according to the decoder comprises:

8. An apparatus for speech recognition, comprising:

the acquisition module is used for acquiring the voice to be recognized;

9. An apparatus for speech recognition, comprising:

10. An electronic device, comprising: a processor and a memory;

the memory stores computer-executable instructions;

the processor executing computer-executable instructions stored in the memory, causing the processor to perform the method of any one of claims 1 to 3 or for performing the method of any one of claims 4 to 7.

11. A computer readable storage medium having stored therein computer executable instructions which when executed by a processor, implement the method of any one of claims 1 to 3 or for performing the method of any one of claims 4 to 7.

12. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the method according to any of claims 1 to 3 or is used to perform the method according to any of claims 4 to 7.