CN115101054A

CN115101054A - Voice recognition method, device and equipment based on hot word graph and storage medium

Info

Publication number: CN115101054A
Application number: CN202210836962.5A
Authority: CN
Inventors: 庄子扬; 魏韬; 马骏; 王少军
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-07-15
Filing date: 2022-07-15
Publication date: 2022-09-23

Abstract

The application relates to an artificial intelligence technology, and discloses a voice recognition method, a device, equipment and a storage medium based on a hot word graph, wherein the method comprises the following steps: acquiring voice data; performing feature extraction on the voice data to obtain acoustic features; inputting the acoustic features into an ASR model in frames for recognition processing to obtain a plurality of candidate words and acoustic probabilities corresponding to the candidate words; performing bundle search on the candidate words and the acoustic probabilities thereof corresponding to the frames to obtain a preset number of statement combinations and corresponding acoustic scores thereof, and acquiring corresponding hot word scores from a hot word graph based on the statement combinations, wherein the hot word graph is constructed based on a preset hot word list and a near confusion set; determining a recognition result based on the acoustic score and the hotword score. The method and the device improve the accuracy of voice recognition.

Description

Voice recognition method, device and equipment based on hot word graph and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a storage medium for speech recognition based on a thermal word graph.

Background

With the continuous development of society and technology, the application of speech recognition technology is more and more common, and in the existing speech recognition application, the recognition effect of common words is better, but the recognition accuracy is not high for some special names, song names, place names or special words in a certain field, such as the name "song xx", song name "international song", place name "beauty business district" and speech recognition professional word "decoder". For this situation, in the prior art, a proper noun is often recognized by adding a hot word, but the recognition accuracy is still not high by performing matching recognition of the proper noun only by the hot word. Therefore, how to solve the problem of low speech recognition accuracy becomes an urgent problem to be solved.

Disclosure of Invention

The application provides a voice recognition method, a device, equipment and a storage medium based on a hot word graph, which aim to solve the problem of low accuracy rate of the existing voice recognition.

In order to solve the above problem, the present application provides a speech recognition method based on a hot word graph, including:

acquiring voice data;

performing feature extraction on the voice data to obtain acoustic features;

the acoustic features are input into an ASR model in a frame-by-frame mode for recognition processing, and a plurality of candidate words and corresponding acoustic probabilities thereof are obtained;

performing bundle search on the candidate words and the acoustic probabilities thereof corresponding to each frame to obtain a preset number of sentence combinations and corresponding acoustic scores thereof, and acquiring corresponding hotword scores from a hotword graph based on the sentence combinations, wherein the hotword graph is constructed based on a preset hotword list and a near-speech confusion set;

determining a recognition result based on the acoustic score and the hotword score.

Further, the hot word graph is constructed based on a preset hot word list and a near confusion set, and the construction comprises:

splitting the hot words in the preset hot word list to obtain words to be processed;

constructing an arc line connecting each node in the hot word graph by using the words to be processed, and setting corresponding arc weights, wherein the words to be processed correspond to the arc lines one by one, and a plurality of words to be processed corresponding to hot words form a closed loop in the hot word graph;

determining corresponding near words from the near confusion set based on the words to be processed;

and constructing an arc line by using the phonetic near characters, placing the arc line at a first position of the corresponding word to be processed in the hot word graph, and setting the arc weight for the arc line corresponding to the phonetic near characters.

Further, after the constructing arcs connecting nodes in the hot word graph by using the words to be processed, the method further includes:

a backspacing arc is arranged on each node, the backspacing arc is an arc connecting each node and an initial node, and the weight corresponding to the backspacing arc is the inverse number of the existing weight of each node;

the obtaining of the corresponding hotword score from the hotword graph based on the sentence combination comprises:

sequentially traversing the hot word graph based on the arrangement sequence of the words in the sentence combination;

when one or more words in the sentence combination are determined, judging whether an arc connecting a next word in the sentence combination and a second position in the hot word graph is related according to the determined second position of the one or more words in the hot word graph:

if the arc weight is related, the arc weight corresponding to the related arc is obtained;

if not, returning to the initial node through the backspacing arc corresponding to the second position, and acquiring the arc weight corresponding to the backspacing arc;

and obtaining the score of the hotword based on the arc weight corresponding to each character in the sentence combination.

Further, the extracting the features of the voice data to obtain the acoustic features includes:

pre-emphasis the voice data by a filter;

framing the pre-emphasized data to obtain multiple frames of data to be processed;

windowing the data to be processed;

performing fast Fourier transform on the windowed data to be processed to obtain an energy spectrum;

and performing feature extraction on the energy spectrum through a Mel filter bank to obtain a first acoustic feature.

Further, after the performing feature extraction on the energy spectrum through the mel filter bank to obtain a first acoustic feature, the method further includes:

and performing discrete cosine transform on the first acoustic feature to obtain a second acoustic feature.

Further, the determining a recognition result based on the acoustic score and the hotword score includes:

determining a total score of each sentence combination through the acoustic score and the hotword score;

and combining the sentences with the highest total score as the recognition result.

Further, the determining the total score of each sentence combination through the acoustic score and the hotword score comprises:

determining a total score for each of the sentence combinations by the following formula;

y ^* ＝argmaxlogP(y|x)+λlogP _C (y) wherein y ^* Representing sentence combinationsP (y | x) represents the acoustic score, P _C (y) represents a hotword score, λ represents a hotword enhancement coefficient;

in order to solve the above problem, the present application further provides a speech recognition apparatus based on a hot word graph, the apparatus including:

the acquisition module is used for acquiring voice data;

the feature extraction module is used for extracting features of the voice data to obtain acoustic features;

the recognition module is used for inputting the acoustic features into an ASR model in a framing mode for recognition processing to obtain a plurality of candidate words and corresponding acoustic probabilities thereof;

the calculation module is used for performing bundle search on the candidate words and the acoustic probabilities thereof corresponding to the frames to obtain a preset number of sentence combinations and corresponding acoustic scores thereof, and acquiring corresponding hotword scores from a hotword graph based on the sentence combinations, wherein the hotword graph is constructed based on a preset hotword list and a near-speech confusion set;

a determination module to determine a recognition result based on the acoustic score and the hotword score.

In order to solve the above problem, the present application also provides a computer device, including:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a hot word graph-based speech recognition method as described above.

To solve the above problem, the present application also provides a non-transitory computer-readable storage medium having stored thereon computer-readable instructions that, when executed by a processor, implement the hot-word-map-based speech recognition method as described above.

Compared with the prior art, the method, the device, the equipment and the storage medium for voice recognition based on the hot word graph provided by the embodiment of the application have the following beneficial effects:

acquiring voice data, and performing feature extraction on the voice data to obtain acoustic features, so that the voice data is preprocessed, and the processing efficiency and quality of subsequent steps are improved; inputting the acoustic features into an ASR model for recognition processing to obtain a plurality of candidate words and acoustic probabilities corresponding to the candidate words; performing bundle search on the candidate words and the acoustic probabilities thereof corresponding to each frame to obtain a preset number of sentence combinations and corresponding acoustic scores thereof, and acquiring corresponding hotword scores from a hotword graph based on the sentence combinations, wherein the hotword graph is constructed based on a preset hotword list and a near-speech confusion set; by deriving an acoustic score and a hotword score, respectively, a recognition result is determined based on the acoustic score and the hotword score. The accuracy of voice recognition is improved.

Drawings

In order to more clearly illustrate the solution of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the description below are some embodiments of the present application, and other drawings can be obtained by those skilled in the art without inventive efforts.

Fig. 1 is a schematic flowchart of a speech recognition method based on a hot word graph according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating one embodiment of step S4 of FIG. 1;

FIG. 3 is a basic diagram illustrating an embodiment of the present application;

FIG. 4 is a hotword diagram illustrating an embodiment of the present application;

FIG. 5 is a flowchart illustrating one embodiment of step S5 of FIG. 1;

FIG. 6 is a block diagram of a thermal word graph based speech recognition apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the foregoing drawings are used for distinguishing between different objects and not for describing a particular sequential order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. One skilled in the art will explicitly or implicitly appreciate that the embodiments described herein can be combined with other embodiments.

The application provides a speech recognition method based on a hot word graph. Referring to fig. 1, fig. 1 is a schematic flowchart of a speech recognition method based on a thermal word graph according to an embodiment of the present application.

In this embodiment, the method for recognizing a speech based on a hot word graph includes:

s1, acquiring voice data;

specifically, the speech recognition method based on the hot word graph can be used in the actual application scenes such as real-time or non-real-time text conversion processing process of the speech, intelligent speech assistants and the like, so that when the method is used for real-time speech to text processing or intelligent speech assistants and the like, the way for acquiring the speech data is that the speech data input by the user is received in real time; in a non-real-time scenario, such as non-real-time speech-to-text processing, speech data is obtained from a database or a memory.

S2, extracting the characteristics of the voice data to obtain acoustic characteristics;

specifically, the accuracy of the candidate word recognition through the ASR model in the follow-up process is improved by segmenting the voice data by frames and extracting the acoustic features corresponding to the frames. In the present application, the acoustic features may employ MFCC or Fbank acoustic features.

pre-emphasis the voice data by a filter;

windowing the data to be processed;

Specifically, in this embodiment, in order to obtain the fbank (filter banks) characteristic, a filter is first used to balance the spectrum corresponding to the voice data, so as to increase the amplitude of the high frequency part, because the voice signal tends to have a spectrum tilt phenomenon, i.e. the amplitude of the high frequency part is smaller than that of the low frequency part, and further, a first order filter is used to perform the processing.

After pre-emphasis, dividing the pre-emphasized data into frames and short time frames, thereby obtaining multiple frames of data to be processed; the frequencies in the signal may vary over time (unstable), and some signal processing algorithms (such as fourier transforms) generally desire the signal to be stable, i.e., it is meaningless to process the entire signal, because the frequency profile of the signal is lost over time. To avoid this, the signal needs to be framed, and the signal within each frame is considered to be temporally invariant. In this application, the frame length is 25ms and the coverage is 10 ms.

After framing, windowing is carried out on the data to be processed of each frame, and the purpose is to enable the two ends of the frame to be smoothly attenuated, so that the intensity of side lobes after subsequent Fourier transform can be reduced, and a higher-quality frequency spectrum can be obtained. Commonly used windows are: rectangular windows, Hamming windows, Hanning windows (Hanning).

For the windowed data to be processed of each frame, an N-point Fast Fourier Transform (FFT) is performed, where N is usually 256 or 512, and the energy spectrum is calculated by using the following formula:

finally, applying Mel-filter banks to the energy spectrum, Fbank features, i.e. the first acoustic features, can be extracted, where Mel-filter banks are a series of triangular filters, usually 40 or 80, with a response of 1 at the center frequency and attenuation to 0 at the center of the filter on both sides.

The voice data are subjected to feature extraction to obtain a first acoustic feature, and the processing quality and efficiency of a subsequent ASR model are improved.

Still further, after the performing feature extraction on the energy spectrum through the mel filter bank to obtain a first acoustic feature, the method further includes:

Specifically, after the first acoustic feature is obtained, the Fbank feature can be further processed to obtain an MFCC feature, that is, a second acoustic feature, the Fbank feature is subjected to discrete cosine transform to compress a related filter coefficient, for ASR, 2-13 dimensions are usually adopted, and a thrown-out information includes a filter bank coefficient rapid change part.

The Fbank characteristic is more expected to accord with the essence of the sound signal, the receiving characteristic of the human ear is fitted, and the calculation amount of the MFCC characteristic is larger compared with that of the MFCC characteristic; the FBank characteristic has high correlation (the adjacent filter banks have overlap), and the MFCC has better discrimination; the Fbank feature contains more information than the MFCC feature, etc.

And the voice data is further subjected to feature extraction to obtain a second acoustic feature, so that the processing quality and efficiency of a subsequent ASR model are improved.

S3, inputting the acoustic features into an ASR model in frames for recognition processing to obtain a plurality of candidate words and corresponding acoustic probabilities thereof;

specifically, the ASR model is an end-to-end model, and the conventional acoustic model, the pronunciation dictionary, and the language model are fused together, and a single objective function consistent with an ASR target is used to optimize the entire network, so that global optimality is ensured, characters and even words can be directly output, and the acoustic features are input into the ASR model in frames for recognition, so as to obtain candidate words corresponding to each frame and acoustic probabilities corresponding to the candidate words.

S4, performing bundle search on the candidate words corresponding to each frame and the acoustic probabilities thereof to obtain a preset number of sentence combinations and corresponding acoustic scores thereof, and acquiring corresponding hotword scores from a hotword graph based on the sentence combinations, wherein the hotword graph is constructed based on a preset hotword list and a near-speech confusion set;

specifically, a beam search is performed based on a candidate word corresponding to each frame and an acoustic probability thereof, in an embodiment of the present application, the beam width is 3, first, three words are determined based on a candidate word corresponding to a first frame and the acoustic probability thereof, a word with the first three acoustic probabilities is taken, then, a word of a next frame needs to be determined on the basis of the three words determined by the first frame, and further, a current frame is determined on the basis of the three words determined by the first frame according to the corresponding candidate word and the acoustic probability thereof, so that the current frame is equivalent to a conditional probability, the words with the first three conditional probabilities are taken, and the words are analogized in sequence until the words of each frame are determined completely, and a preset number of sentence combinations and corresponding acoustic scores thereof are obtained, where the acoustic scores are products of the acoustic probabilities of the words in the sentence combinations. And traversing the hot word graph in sequence based on the arrangement sequence of the words in the sentence combination, for example, after determining that the first word in the sentence combination exists, performing a second query on the basis of reheating the corresponding first word in the word graph, namely, the second word and the first word need to be related until the traversal of the words in the sentence combination is completed, or the words in the sentence combination are not continuously related, thereby exiting the traversal. The hot word graph is constructed based on a preset hot word list and a near confusion set, and is a tree-like graph; the hot words are specific names of people, song names, place names or special words in a certain field.

Further, as shown in fig. 2, the constructing of the hot word graph based on the preset hot word list and the near confusion set includes:

constructing an arc line connecting each node in the hot word graph by using the words to be processed, and setting corresponding arc weights, wherein the words to be processed correspond to the arc lines one by one, and a plurality of words to be processed corresponding to the hot words form a closed loop in the hot word graph;

Specifically, the method is particularly important for the tree construction process, and the construction quality of the tree is related to the identification accuracy of the hotword.

Firstly, acquiring a preset hot word list from a database, splitting hot words in the preset hot word list to obtain a plurality of characters corresponding to each hot word, and if the hot words are 'safe mansion', 'euonymus repair' and 'Weida', splitting the hot words into 'flat, safe, big, mansion', 'Europe, Yang, repair', 'Weida' to-be-processed characters; after splitting each hot word, constructing a hot word graph, wherein characters corresponding to all the hot words are constructed from initial nodes, as shown in fig. 3, the basic graph is constructed according to the preset hot word table, the initial nodes are represented by nodes 0 in the application, the characters to be processed are utilized to construct arcs connecting the nodes in the hot word graph, and the arcs are sequentially constructed according to the arrangement sequence of the characters to be processed corresponding to the hot words, wherein the characters on each arc represent input, output and weight, and a complete hot word starts from the node 0 and ends at the node 0 to form a closed loop.

After obtaining a basic graph based on the hot words in a preset hot word table, based on the word to be processed, obtaining a near word corresponding to the word to be processed from the near confusion set, as for the hot word "wida", there is a near word "unique" (here, for example only, there are a large number of near words in actual situations), in the basic graph, first determining the position of the "dimension" word, between a node 0 and a node 3, as shown in fig. 4, so as to construct an arc corresponding to the "unique" to be placed between the node 0 and the node 3, and simultaneously, for the arc, corresponding input, output and weight are respectively unique, dimension and 1; since the phonetic characters are only used for matching to improve the recognition rate of the hot words, the phonetic characters are corrected when being output, namely, the dimension is output, and the weight is consistent. Thereby constructing and obtaining a complete hotword graph. Wherein the weight represents the reward score for which the match was successful.

In practice, the corresponding content on each arc may not be displayed, which will be used in subsequent passes as an implicit feature of the arc.

The hot word graph is constructed by utilizing the preset hot word list and the near-sound confusion set, so that the accuracy of subsequent hot word recognition is improved.

Still further, after the constructing the arc connecting each node in the hot word graph by using the word to be processed, the method further includes:

if the arc weight is related, arc weight corresponding to the related arc is obtained;

Specifically, as shown in fig. 3 or fig. 4, a fallback arc is further disposed on each node, so that when partial prefixes of the hotwords are matched, and the exit mechanism when the following words are not matched, for example, when the hot words are 'safety mansion' and the existing sentence combination is 'safety man', the 'safety man' is utilized to sequentially traverse the hot word graph, the subsequent contents of 'safety' in the 'safety building' which can be successfully matched are not matched, at the moment, the traversing step needs to be quitted, since there is a score of 2 with a match to the safe two word, i.e. already at node 4, since the match was not successful, there is no hotword score possible, so the arc of regression through node 4 falls back to node 0, while since the arc of regression is weighted-2, therefore, after returning to the node 0, the corresponding hotword score is 0, and the specific algorithm of the hotword score is the weight sum corresponding to the path passing through when the hotword score is matched, namely the weight sum passing through the arc. The weight corresponding to the back-off arc is the inverse number of the existing weight of each node, for example, if node 1 and node 2 are both nodes that are 0 node and arrive through an arc, the corresponding weight is 1, and similarly, node 4 is a node that is 0 node and arrives through two-day arcs, the corresponding weight is 2, so the weight of the back-off arc corresponding to node 4 is-2.

After the hot word graph is constructed, sequentially traversing the hot word graph based on the arrangement sequence of all characters in the sentence combination when the sentence combination is subsequently utilized to traverse the hot word graph; according to the arrangement sequence of the characters in the sentence combination, sequentially traversing the hot word graph, namely, firstly determining a first character in the sentence combination, traversing the hot word graph by using the first character, judging whether the hot word graph has a matched character, if so, then judging a second character, if not, directly judging the hot word as 0, judging the second character based on the position of the first character in the hot word graph, and associating the position of the matched second character with the position of the first character in the hot word graph, for example, the 'An' in the 'safe building' is required to be connected behind the 'Ping', namely, the first character and the second character are all connected with a node 1. And analogizing in sequence until all the words in the sentence combination are traversed. Judging whether the next word in the sentence combination is related to an arc line connected with the second position in the hot word graph or not according to the determined second position of the one or more words in the hot word graph, namely the determined one or more words in the sentence combination are equivalent to a root node, and judging whether the next word is a child node below the root node or not

If yes, obtaining arc weight corresponding to the associated arc;

if the node is not the subnode below the node, no correlation exists, returning to the initial node through the backspacing arc corresponding to the second position, and acquiring the arc weight corresponding to the backspacing arc;

and obtaining the score of the hotword based on the arc weight corresponding to each character in the sentence combination, namely according to the matching path in the sentence combination, namely the arc weight corresponding to the passing arc.

The recognition accuracy of the specific nouns in the speech recognition is improved by using the hot word graph to recognize the hot words, and the accuracy of the whole speech recognition is further improved.

And S5, determining a recognition result based on the acoustic score and the hotword score.

Specifically, after the acoustic score and the hotword score of each sentence combination are obtained, the total score of each sentence combination is obtained according to a series of transformations, and the sentence combination with the highest total score is selected as the recognition result.

Further, as shown in fig. 5, the determining the recognition result based on the acoustic score and the hotword score includes:

Specifically, after combining the acoustic score and the hotword score corresponding to each sentence, performing a series of linear transformations on the acoustic score and the hotword score respectively, and then adding the acoustic score and the hotword score to obtain a total score of each sentence combination, and combining the sentence with the highest total score to serve as the recognition result.

By determining the final recognition result based on the acoustic score and the hotword score, the accuracy of speech recognition is improved.

Still further, the determining the total score of each sentence combination through the acoustic score and the hotword score includes:

y ^* ＝argmaxlogP(y|x)+λlogP _C (y) wherein y ^* Represents the total score of the sentence combination, P (y | x) represents the acoustic score, P _C (y) denotes a hotword score, and λ denotes a hotword enhancement coefficient.

By utilizing the formula to calculate, the accuracy of calculation is improved, and the accuracy of voice recognition is improved.

It is emphasized that, in order to further ensure the privacy and security of the data, all the data of the voice data and its corresponding recognition result may also be stored in the nodes of a block chain.

The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

The voice recognition method based on the hot word graph of the embodiment obtains voice data, performs feature extraction on the voice data to obtain acoustic features, realizes preprocessing on the voice data, and improves the processing efficiency and quality of subsequent steps; inputting the acoustic features into an ASR model for recognition processing to obtain a plurality of candidate words and acoustic probabilities corresponding to the candidate words; performing bundle search on the candidate words and the acoustic probabilities thereof corresponding to the frames to obtain a preset number of statement combinations and corresponding acoustic scores thereof, and acquiring corresponding hot word scores from a hot word graph based on the statement combinations, wherein the hot word graph is constructed based on a preset hot word list and a near confusion set; by deriving an acoustic score and a hotword score, respectively, a recognition result is determined based on the acoustic score and the hotword score. The accuracy of voice recognition is improved.

The present embodiment further provides a speech recognition apparatus based on a thermal word graph, which is a functional block diagram of the speech recognition apparatus based on a thermal word graph according to the present application, as shown in fig. 6.

The speech recognition apparatus 100 based on a thermal word graph according to the present application may be installed in an electronic device. According to the implemented functions, the speech recognition device 100 based on the hot word graph can comprise an acquisition module 101, a feature extraction module 102, a recognition module 103, a calculation module 104 and a determination module 105. A module, which may also be referred to as a unit in this application, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.

In the present embodiment, the functions of the respective modules/units are as follows:

an obtaining module 101, configured to obtain voice data; (ii) a

A feature extraction module 102, configured to perform feature extraction on the voice data to obtain an acoustic feature;

further, the feature extraction module 102 includes a pre-emphasis sub-module, a framing sub-module, a windowing sub-module, a first transformation sub-module, and a mel-filtering sub-module;

the pre-emphasis sub-module is used for pre-emphasizing the voice data through a filter;

the framing submodule is used for framing the pre-emphasized data to obtain multiple frames of data to be processed;

the windowing submodule is used for windowing the data to be processed;

the first transformation submodule is used for carrying out fast Fourier transformation on the windowed data to be processed to obtain an energy spectrum;

the Mel filtering submodule is used for extracting the characteristics of the energy spectrum through a Mel filter bank to obtain first acoustic characteristics.

Through the cooperation of the pre-emphasis sub-module, the framing sub-module, the windowing sub-module, the first transformation sub-module and the Mel filtering sub-module, feature extraction is carried out on voice data to obtain first acoustic features, and the processing quality and efficiency of a subsequent ASR model are improved.

Still further, the feature extraction module 102 further includes a second transformation submodule;

and the second transformation submodule is used for carrying out discrete cosine transformation on the first acoustic feature to obtain a second acoustic feature.

And performing further feature extraction on the voice data through a second transformation submodule to obtain a second acoustic feature, and also improving the processing quality and efficiency of a subsequent ASR model.

The recognition module 103 is configured to frame the acoustic features and input the frame into an ASR model to perform recognition processing, so as to obtain a plurality of candidate words and acoustic probabilities corresponding to the candidate words;

the calculation module 104 is configured to perform bundle search on the candidate words and the acoustic probabilities thereof corresponding to each frame to obtain a preset number of sentence combinations and acoustic scores thereof, and obtain corresponding hotword scores from a hotword graph based on the sentence combinations, where the hotword graph is constructed based on a preset hotword list and a near-speech confusion set;

further, the calculation module 104 includes a splitting sub-module, a basic diagram construction sub-module, a phonetic near character acquisition sub-module and a hot word diagram construction sub-module;

the splitting submodule is used for splitting the hot words in the preset hot word list to obtain the words to be processed;

the basic graph building submodule is used for building an arc line connecting each node in the hot word graph by using the words to be processed, and setting corresponding arc weights, wherein the words to be processed correspond to the arc lines in a one-to-one mode, and a plurality of words to be processed corresponding to hot words form a closed loop in the hot word graph;

the phonetic-near character acquisition submodule is used for determining corresponding phonetic-near characters from the phonetic-near confusion set based on the characters to be processed;

the hot word graph constructing submodule is used for utilizing the phonetic near characters to construct arc lines to be arranged at the first positions of the corresponding to-be-processed characters in the hot word graph, and setting the arc weights for the arc lines corresponding to the phonetic near characters.

The hot word graph is constructed by utilizing a preset hot word list and a sound near confusion set through the matching of the splitting sub-module, the basic graph constructing sub-module, the sound near character obtaining sub-module and the hot word graph constructing sub-module, so that the accuracy of subsequent hot word identification is improved.

Still further, the calculation module 104 further includes a backspacing arc setting sub-module, a traversal sub-module, an association determination sub-module, a weight obtaining sub-module, and a score calculation sub-module;

the node setting sub-module is used for setting a back-off arc on each node, wherein the back-off arc is an arc connecting each node and an initial node, and the weight corresponding to the back-off arc is the inverse number of the existing weight of each node;

the traversal submodule is used for sequentially traversing the hot word graph based on the arrangement sequence of the words in the statement combination;

an association determining sub-module, configured to, when one or more words in the sentence combination are determined, determine whether an arc connecting a next word in the sentence combination and a second position in the hot word graph is associated according to the determined second position of the one or more words in the hot word graph:

the weight obtaining submodule is used for obtaining arc weights corresponding to the related arcs if the arcs are related;

if no association exists, returning to the initial node through the backspacing arc corresponding to the second position, and acquiring the arc weight corresponding to the backspacing arc;

and the score calculating submodule is used for obtaining the score of the hotword based on the arc weight corresponding to each character in the sentence combination.

Through the cooperation of the backspacing arc setting submodule, the traversing submodule, the association determining submodule, the weight obtaining submodule and the score calculating submodule, the hot word graph is used for identifying the hot words, the identification accuracy of the specific nouns in the voice identification is improved, and the accuracy of the whole voice identification is further improved.

A determining module 105, configured to determine a recognition result based on the acoustic score and the hotword score.

Further, the determination module 105 includes a total score determination sub-module and a sentence determination sub-module;

the total score determining submodule is used for determining the total score of each sentence combination through the acoustic score and the hotword score;

and the sentence determining submodule is used for combining the sentences with the highest total score as the recognition result.

And a final recognition result is obtained based on the acoustic score and the hotword score through the matching of the total score determining submodule and the sentence determining submodule, so that the accuracy of the voice recognition is improved.

Still further, the total score determining submodule includes a score calculating unit;

the score calculating unit is used for determining the total score of each sentence combination through the following formula;

The score calculating unit calculates by using the formula, so that the calculation accuracy is improved, and the accuracy of voice recognition is improved.

By adopting the device, the voice recognition device 100 based on the hot word graph obtains voice data through the matching use of the obtaining module 101, the sending module 102, the receiving module 103, the token sending module 104 and the result output module 105, performs feature extraction on the voice data to obtain acoustic features, realizes the preprocessing of the voice data, and improves the processing efficiency and quality of subsequent steps; inputting the acoustic features into an ASR model for recognition processing to obtain a plurality of candidate words and acoustic probabilities corresponding to the candidate words; performing bundle search on the candidate words and the acoustic probabilities thereof corresponding to the frames to obtain a preset number of statement combinations and corresponding acoustic scores thereof, and acquiring corresponding hot word scores from a hot word graph based on the statement combinations, wherein the hot word graph is constructed based on a preset hot word list and a near confusion set; by deriving an acoustic score and a hotword score, respectively, a recognition result is determined based on the acoustic score and the hotword score. The accuracy of voice recognition is improved.

The embodiment of the application also provides computer equipment. Referring to fig. 7, fig. 7 is a block diagram of a basic structure of a computer device according to the present embodiment.

The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It is noted that only computer device 4 having components 41-43 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

The memory 41 includes at least one type of readable storage medium including flash memory, hard disks, multimedia cards, card-type memory (e.g., SD or DX memory, etc.), Random Access Memory (RAM), Static Random Access Memory (SRAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Programmable Read Only Memory (PROM), magnetic memory, magnetic disks, optical disks, etc. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 4. Of course, the memory 41 may also include both an internal storage unit of the computer device 4 and an external storage device thereof. In this embodiment, the memory 41 is generally used for storing an operating system and various types of application software installed in the computer device 4, such as computer readable instructions of a thermal word graph-based speech recognition method. Further, the memory 41 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 42 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute computer readable instructions stored in the memory 41 or process data, such as executing computer readable instructions of the thermal word graph-based speech recognition method.

The network interface 43 may comprise a wireless network interface or a wired network interface, and the network interface 43 is generally used for establishing a communication connection between the computer device 4 and other electronic devices.

In this embodiment, when the processor executes the computer readable instructions stored in the memory, the steps of the speech recognition method based on the hot word graph according to the above embodiment are implemented, so as to obtain speech data, perform feature extraction on the speech data, obtain acoustic features, implement preprocessing on the speech data, and improve the processing efficiency and quality of subsequent steps; inputting the acoustic features into an ASR model for recognition processing to obtain a plurality of candidate words and acoustic probabilities corresponding to the candidate words; performing bundle search on the candidate words and the acoustic probabilities thereof corresponding to each frame to obtain a preset number of sentence combinations and corresponding acoustic scores thereof, and acquiring corresponding hotword scores from a hotword graph based on the sentence combinations, wherein the hotword graph is constructed based on a preset hotword list and a near-speech confusion set; by deriving an acoustic score and a hotword score, respectively, a recognition result is determined based on the acoustic score and the hotword score. The accuracy of voice recognition is improved.

The embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions can be executed by at least one processor, so that the at least one processor performs the steps of the above-mentioned speech recognition method based on a hot word graph to obtain speech data, perform feature extraction on the speech data to obtain acoustic features, implement preprocessing on the speech data, and improve the processing efficiency and quality of subsequent steps; inputting the acoustic features into an ASR model for recognition processing to obtain a plurality of candidate words and acoustic probabilities corresponding to the candidate words; performing bundle search on the candidate words and the acoustic probabilities thereof corresponding to each frame to obtain a preset number of sentence combinations and corresponding acoustic scores thereof, and acquiring corresponding hotword scores from a hotword graph based on the sentence combinations, wherein the hotword graph is constructed based on a preset hotword list and a near-speech confusion set; the recognition result is determined based on the acoustic score and the hotword score by respectively obtaining the acoustic score and the hotword score. The accuracy of voice recognition is improved.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

The speech recognition device, the computer device, and the computer-readable storage medium based on the thermal word graph according to the above embodiments of the present application have the same technical effects as the speech recognition method based on the thermal word graph according to the above embodiments, and are not expanded herein.

It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields, and all the equivalent structures are within the protection scope of the present application.

Claims

1. A speech recognition method based on a hot word graph, the method comprising:

acquiring voice data;

performing feature extraction on the voice data to obtain acoustic features;

2. The method of claim 1, wherein the constructing the thermal word graph based on a pre-defined thermal word list and a near-phonetic confusion set comprises:

3. The method according to claim 2, further comprising, after the constructing arcs connecting nodes in the hot word graph by using the words to be processed, the following steps:

traversing the hot word graph in sequence based on the arrangement sequence of the words in the sentence combination;

4. The method according to claim 1, wherein the performing feature extraction on the speech data to obtain acoustic features comprises:

pre-emphasis the voice data by a filter;

windowing the data to be processed;

5. The method of claim 4, wherein after the extracting the features of the energy spectrum through the Mel filter bank to obtain the first acoustic features, the method further comprises:

6. The method of claim 1, wherein determining the recognition result based on the acoustic score and the hotword score comprises:

and combining the sentences with the highest total scores as the recognition results.

7. The method of claim 6, wherein determining the overall score for each sentence combination from the acoustic score and the hotword score comprises:

8. A device for speech recognition based on a hot word graph, the device comprising:

the acquisition module is used for acquiring voice data;

9. A computer device, characterized in that the computer device comprises:

at least one processor; and the number of the first and second groups,

the memory stores computer readable instructions that, when executed by the processor, implement the thermal word graph-based speech recognition method of any of claims 1 to 7.

10. A computer-readable storage medium having computer-readable instructions stored thereon which, when executed by a processor, implement the method for thermal word map based speech recognition according to any one of claims 1 to 7.