CN112447169A

CN112447169A - Word boundary estimation method and device and electronic equipment

Info

Publication number: CN112447169A
Application number: CN201910832104.1A
Authority: CN
Inventors: 陈孝良; 王江; 冯大航; 常乐
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2019-09-04
Filing date: 2019-09-04
Publication date: 2021-03-05
Anticipated expiration: 2039-09-04
Also published as: CN112447169B

Abstract

The invention provides a word boundary estimation method, a word boundary estimation device and electronic equipment, which are used for acquiring voice data to be subjected to voice recognition; framing the voice data, and extracting acoustic features of each frame of voice; for each frame of voice, calculating posterior probability of the acoustic features on each acoustic modeling unit; and searching in a WFST model based on the posterior probability to obtain the recognition result of the voice data and the time boundary of the tail of each word in the recognition result. Namely, the invention can realize that time boundary information is added to each word in the speech recognition process.

Description

Word boundary estimation method and device and electronic equipment

Technical Field

The invention relates to the field of voice recognition, in particular to a word boundary estimation method and device and electronic equipment.

Background

Word boundary estimation in speech recognition belongs to the technical field of speech recognition. For a given speech signal, we can obtain the corresponding text information by speech recognition technology.

In some specific scenarios, however, it is desirable to add accurate time boundary information to each word during recognition. For example, in a customer service scene, the customer service staff can find that the customer service staff says some unintelligent speech words through a text transcribed by a voice recognition technology, and at the moment, the position of a corresponding word in a recording can be quickly positioned according to time boundary information added for the word.

Disclosure of Invention

In view of the above, the present invention provides a method, an apparatus, and an electronic device for word boundary estimation, so as to solve the problem that time boundary information needs to be added to each word in a speech recognition process.

In order to solve the technical problems, the invention adopts the following technical scheme:

a word boundary estimation method, comprising:

acquiring voice data to be subjected to voice recognition;

framing the voice data, and extracting acoustic features of each frame of voice;

for each frame of voice, calculating posterior probability of the acoustic features on each acoustic modeling unit;

searching in a WFST model based on the posterior probability to obtain the recognition result of the voice data and the time boundary of the tail of each word in the recognition result; the end-of-word time boundary is determined based on a null edge; the empty edge output is empty.

Optionally, searching in the WFST model based on the posterior probability to obtain the recognition result of the speech data and the end-of-word time boundary of each word in the recognition result, including:

obtaining a WFST optimization model in the WFST model; the WFST optimization model identifies that the ending time of words in the voice data is inconsistent with the actual ending time;

in the WFST optimization model searching process, the WFST output of the current word is stored in a token; the token comprises: outputting words and time information of the words;

judging whether the time boundary of the word end of the current word is determined or not;

and if the time boundary of the word end of the current word is determined, updating the content stored in the token.

Optionally, the determining the time boundary of the end of word of the current word includes:

acquiring a group of empty edges adjacent to the current word output;

and taking the tail time of the time information in the token corresponding to the edge which is output as the null last in the group of the null edges as the word end time boundary of the current word.

Optionally, searching in a WFST optimization model based on the posterior probability to obtain the recognition result of the speech data and the end-of-word time boundary of each word in the recognition result, further comprising:

judging whether the time boundary of the end of word of each word in the recognition result is determined or not;

if not, returning to the step of executing the WFST output of the current word in the process of searching the WFST optimization model and saving the WFST output of the current word in the token.

Optionally, after determining that the time boundary of the end of word of each word in the recognition result of the speech data is determined, the method further includes:

selecting an output result in the token with the minimum cost in all tokens at the current moment as a voice recognition result of the voice data; the speech recognition result includes: and the recognition result of the voice data and the time boundary of the tail of each word in the recognition result.

Optionally, the input of the margin is null.

in the WFST optimization model searching process, if a speech recognition result of a word in speech data is recognized, the time of obtaining the speech recognition result is used as the time boundary of the end of word.

A word boundary estimating apparatus comprising:

the data acquisition module is used for acquiring voice data to be subjected to voice recognition;

the feature extraction module is used for framing the voice data and extracting the acoustic features of each frame of voice;

the probability calculation module is used for calculating the posterior probability of the acoustic features on each acoustic modeling unit for each frame of voice;

the time determining module is used for searching in a WFST model based on the posterior probability to obtain the recognition result of the voice data and the time boundary of the tail part of each word in the recognition result; the end-of-word time boundary is determined based on a null edge; the empty edge output is empty.

Optionally, the time determination module includes:

the model obtaining submodule is used for obtaining a WFST optimization model in the WFST model; the WFST optimization model identifies that the ending time of words in the voice data is inconsistent with the actual ending time;

the information storage submodule is used for storing WFST output of the current word in the token in the WFST optimization model searching process; the token comprises: outputting words and time information of the words; the WFST model comprises the WFST optimization model, and the WFST optimization model identifies that an end time of a word in the speech data is inconsistent with an actual end time;

the first judgment submodule is used for judging whether the time boundary of the end of word of the current word is determined or not;

and the updating submodule is used for updating the content stored in the token if the time boundary of the end of word of the current word is determined.

Optionally, the first determining sub-module includes:

the empty edge acquisition unit is used for acquiring a group of empty edges which are adjacent to the current word output;

and the time determining unit is used for taking the tail time of the time information in the token corresponding to the last output empty edge in the group of empty edges as the word end time boundary of the current word.

Optionally, the time determination module further comprises:

the second judgment submodule is used for judging whether the time boundary of the end of word of each word in the recognition result is determined or not;

and the information storage sub-module is further configured to store the WFST output of the current word in the token in the WFST optimization model search process if the second judgment sub-module judges that the time boundary of the end of word of each word in the recognition result is not determined.

Optionally, the time determination module further comprises:

the result determining submodule is used for selecting an output result in the token with the minimum cost in all the tokens at the current moment as a voice recognition result of the voice data; the speech recognition result includes: and the recognition result of the voice data and the time boundary of the tail of each word in the recognition result.

Optionally, the time determining module is configured to, when searching in the WFST model based on the posterior probability to obtain the recognition result of the speech data and the end-of-word time boundary of each word in the recognition result, specifically:

acquiring a WFST optimization model in the WFST model, and in the WFST optimization model searching process, if a speech recognition result of a word in speech data is recognized, taking the time of the obtained speech recognition result as a word tail time boundary of the word; the WFST optimization model identifies that the end times of words in the speech data are not consistent with the actual end times.

An electronic device, comprising: a memory and a processor;

wherein the memory is used for storing programs;

the processor calls a program and is used to:

acquiring voice data to be subjected to voice recognition;

Compared with the prior art, the invention has the following beneficial effects:

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a schematic diagram of the internal structure of a WFST model according to an embodiment of the present invention;

fig. 2 is a flowchart of a method for estimating word boundaries according to an embodiment of the present invention;

FIG. 3 is a flowchart of another word boundary estimation method according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a word boundary estimation device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a word boundary estimation method, which mainly depends on a Weighted Finite State Transducer (WFST) model, and the WFST model is explained.

The WFST model is generally composed of several basic modules:

1. an acoustic model; the speech recognition system is mostly modeled based on a first-order Hidden Markov Model (HMM). The acoustic model itself defines some of the more generalized acoustic modeling units. Generally, an HMM is composed of a plurality of states, which are the smallest modeling units of an acoustic model.

2. A pronunciation dictionary; the pronunciation dictionary contains a vocabulary set and pronunciations thereof that can be processed by the speech recognition system. The pronunciation dictionary actually provides a mapping of the acoustic model to the language model.

3. A language model; the language model models the language targeted by the speech recognition system and establishes the correlation between language vocabularies. In general, a regular language model or a statistical language model may be used as the speech recognition language model. In practical applications, the offline command word recognition system with limited resources is based on a regular language model, and the large vocabulary continuous speech recognition system is based on a statistical language model, including but not limited to an N-gram model, a recurrent neural network model, and the like.

4. A decoder; the decoder is one of the cores of a speech recognition system, and its task is to find a word string that can output an input signal with the maximum probability, based on acoustics, language models, and dictionaries. The relationship between the modules can be more clearly understood from a mathematical point of view.

In the embodiment of the present invention, a classification model of a modeling unit in an acoustic model modeled by GMM (gaussian mixture model) and DNN (deep neural networks) may be used.

HMM (hidden markov model) models are widely used for acoustic modeling of large vocabulary continuous speech recognition systems because they can describe the time-varying and short-time stationarity of speech well.

The present invention further improves the existing WSFT to enable it to recognize the time boundary of the end of word of each word in the speech data.

Referring to fig. 1, WFST is a weighted finite state transformer for large scale speech recognition, each labeled with input a and output B symbols. Thus, the constructed network (WFST) is used to generate a mapping from a sequence of input symbols or strings to an output string. WFST weights state transitions in addition to input and output symbols. The weight value may be an encoding probability, duration, or any other number accumulated along the path, such as 0.5 in fig. 3, to calculate an overall weight that maps an input string to an output string. WFST is generally used for speech recognition and represents various possible path choices and their corresponding probabilities for outputting recognition results after inputting speech signals in speech processing.

Referring to fig. 2, the word boundary estimating method may include:

and S11, acquiring acoustic characteristics of the voice data to be subjected to voice recognition.

In particular implementations, a user may input voice data through an electronic device configured with a sound card device such as a microphone.

The electronic device may be a mobile device, such as a mobile phone, a tablet computer, a personal digital assistant, a wearable device (such as glasses, a watch, and the like), or a fixed device, such as a personal computer, a smart television, a smart home/household appliance (such as an air conditioner, an electric cooker), and the like, which is not limited in this embodiment of the present invention.

S12, framing the voice data, and extracting the acoustic features of each frame of voice.

After the voice data is acquired, the voice data is framed, and the acoustic features of each frame of sub voice are extracted. The acoustic features may include: MFCC, Fbank, etc.

And S13, calculating the posterior probability of the acoustic features on each acoustic modeling unit for each frame of voice.

In this embodiment, a posterior probability of each frame of speech on each acoustic modeling unit is estimated by using a deep neural network algorithm DNN. The DNN is obtained through a large amount of data training, the input of the DNN is acoustic features, and the input of the DNN is posterior probability. The posterior probability refers to the weight value of the edge of WFST and is used for finding the optimal path.

And S14, searching in a WFST model based on the posterior probability to obtain the recognition result of the voice data and the time boundary of the tail of each word in the recognition result.

The WFST model in this embodiment is the WFST model described above. The end-of-word time boundary is determined based on a null edge; and the last tail time corresponding to the empty edge is the time boundary of the word end of the current word. In addition, the output of the empty edge is empty, and if the front edge of the word end is not pause, namely the front edge is continuous words, if the meal is eaten and there is no pause between meals, the input of the empty edge is not empty. If there is a pause between eating and cooking, such as 1 second pause, the input of the margin is also empty.

In the embodiment of the present disclosure, when the output is empty during the search in the WFST model:

when the input is empty, it is likely to be a pause in speech; the suffix can be determined by the input not being empty, possibly by nonsense speech words (like pauses), or by redundant end-tones after valid information is recognized. In general, if a pause is identified and a complete word is identified before the pause by the WFST model, the end of the word can be identified; if the redundant tail tone after the valid information is identified, it is likely that the voice information is identified in advance in the WFST optimization algorithm, that is, the input is not null, which is common in the WFST optimization algorithm, and the specific method for confirming the time of the tail word is described in the following scheme.

In this embodiment, voice data to be subjected to voice recognition is acquired; framing the voice data, and extracting acoustic features of each frame of voice; for each frame of voice, calculating posterior probability of the acoustic features on each acoustic modeling unit; and searching in a WFST model based on the posterior probability to obtain the recognition result of the voice data and the time boundary of the tail of each word in the recognition result. Namely, the invention can realize that time boundary information is added to each word in the speech recognition process.

In addition, in the aspect of word time boundary estimation in the recognition process, a decoding graph of the HCLG is constructed, in the decoding process, word graph lattice information at each moment is stored, and after decoding is completed, lattice is traced back to obtain a corresponding recognition result and time boundary information of the recognition result. Specifically, the method comprises the following steps: searching in a WFST model based on the posterior probability to obtain the recognition result of the voice data and the end-of-word time boundary of each word in the recognition result, comprising:

and acquiring a WFST optimization model in the WFST model, and in the WFST optimization model searching process, if a speech recognition result of a word in speech data is recognized, taking the time of the obtained speech recognition result as the time boundary of the word at the end of the word. The WFST model comprises the WFST optimization model, the WFST optimization model does not need to completely carry out Viterbi search on each frame of a voice, namely, a final result can be obtained without searching an actual voice suffix, and voice recognition is realized. That is, the WFST optimization model recognizes that the ending time of the words in the speech data is inconsistent with the actual ending time, and the ending time of the recognized words is generally earlier than the actual ending time. The optimization algorithm used by the WFST optimization model may be an output push algorithm, or a weight push algorithm. Optimization operations in WFST, including null shift elimination (relocation), determinization (subtraction), weight shifting (weight shifting), and minimization (subtraction). In this embodiment, the weight shifting (weight shifting) is generated for the time shift.

For example, suppose that the speech is we, women, and when e in men is recognized during speech recognition, the speech recognition result of the speech "we" is recognized as the text "we". This time is taken as the "our" end of word time boundary. In practice, however, the speech "we" has no consequence, i.e., the output on the WFST side is shifted forward, i.e., the tagged end time is not the true end time.

In addition, in this decoding method, for a long recognition task, the size of generated lattice increases with time, the memory consumed by the system becomes larger, and the time overhead of collecting dead paths in lattice also gradually increases. The invention further removes the generation process of lattice in the decoding process, in the token transmission process, the decoding result is directly stored in the token, when a word is output, the current time is marked as the corresponding word ending time, when the word appears, the current time is not taken as the word ending time, but the time boundary information of the current word is updated in a plurality of frames in real time, if the current output is empty, the current time is updated as the time boundary of the current word until the next word is met.

In another embodiment of the present invention, the detailed description of step S14 "search in WFST model based on the posterior probability to obtain the recognition result of the speech data and the time boundary of the end of word in the recognition result" with reference to fig. 3 specifically includes:

s24, obtaining a WFST optimization model in the WFST model; the WFST optimization model identifies that the end times of words in the speech data are not consistent with the actual end times.

S25, in the process of WFST optimization model search, the WFST output of the current word is stored in the token.

The WFST model comprises the WFST optimization model, and the WFST optimization model identifies that an end time of a word in the speech data is inconsistent with an actual end time. The WFST optimization model is the WFST optimization model described above.

The token comprises: the word is output and the time information of the word is output, that is, the word is output and the time information of the word is output in a token passing manner in this embodiment, and what is stored in the token is the recognition result and the recognition time of the currently recognized word.

The number of the tokens can be multiple, when searching in WFST, one token is configured for each WFST searching path to save time information, namely the number of the tokens is the same as that of the searching paths. When one WFST operation is completed, 1 or more tokens with high probability are retained from all possibilities, so the number of tokens is dynamically changed. And finally, selecting a token corresponding to the edge with the highest probability (namely, the lowest cost) during output, and taking out the information in the token as a final recognition result and corresponding word end information.

Token passing is also called 'label transmission', and a control method for local network data transmission is mostly used for ring networks.

The token consists of a dedicated block of information, and a typical token consists of a sequence of 8 bits of "1". When all nodes of the network are idle, the token is passed from one node to the next. When a node requests to send information, it must obtain a token and take it off the network before sending. Once the data has been transferred, the token is forwarded to the next node, each node having means to send/receive the token. Collisions never occur with this method of transmission because only one node has the possibility to transmit data at a certain moment. The biggest problem is that the token is lost or destroyed in the transmission process, so that the node cannot find the token and cannot transmit information.

During the search, the WFST optimization model may be searched using the viterbi algorithm.

S26, judging whether the time boundary of the end of word of the current word is determined; if yes, go to step S27.

In practical application, the process of determining the time boundary of the end of word of the current word may be:

and acquiring a group of hollow edges output next to the current word, and taking the tail time of the time information in the token corresponding to the edge with the last output as the hollow edge in the group of hollow edges as the word end time boundary of the current word.

When a user speaks a word habitually during voice, pausing, invalid voice is generated at the moment, the invalid voice is represented by a blank edge in a WFST optimization model, if the user eats a meal, a pause exists between 'we' and 'eating', and the pause time can be used as ending time; in the WFST optimization model, searching a group of null edges with null output and input which are adjacent to each word, obtaining the last input of the plurality of null edges not being null and output as the null edge after finding the group of null edges (only one possible) adjacent to the word, taking the tail time of the null edge which is output as the tail time boundary and the input of the null edge which is not null. For example, assuming that "we" corresponds to a blank edge later, the blank edge is continued for a period of time before the next word appears, and the last time when the next word appears, i.e., the end time of the blank edge, is used as the end-of-word time boundary of "us".

That is, it is determined that the null edge obtained in the WFST model is the end of a word, it is necessary to determine the position of the null edge where the input is not null and the output is null, and among a plurality of null edges where the input immediately after the current word is null and the output is not null, the last input is not null and the null edge where the output is null can be output as the end of a word.

And finally, selecting the last one of the empty sides with the inputs not being empty and the outputs being empty as a suffix, and storing the time information of the empty side as the suffix time in the token.

And S27, updating the content stored in the token.

And if the time boundary of the word end of the current word is determined, releasing the token, and updating the content in the token into the word to be recognized and the time information of the word.

S28, judging whether the time boundary of the end of word of each word in the voice data is determined or not; if so, go to step S29, otherwise, go back to step S25.

And S29, selecting the WFST in the token with the minimum cost in all the tokens at the current moment and outputting the WFST as the voice recognition result of the voice data.

The speech recognition result includes: and the recognition result of the voice data and the time boundary of the tail of each word in the recognition result.

And when the time boundary of the end of word of each word in the voice data is determined, selecting the WFST in the token with the minimum cost in all tokens at the current moment to output as the voice recognition result of the voice data, and when the time boundary of the end of word of each word in the voice data is not determined, continuously recognizing the next voice.

In the embodiment, more accurate word boundary information can be obtained, and the problem of inaccurate word boundary record is solved.

It should be noted that, for the specific implementation process of steps S21-23 in this embodiment, please refer to the corresponding description in the foregoing embodiments, which is not described herein again.

Optionally, on the basis of the above embodiment of the word boundary estimation method, another embodiment of the present invention provides a word boundary estimation apparatus, referring to fig. 4, which may include:

a data acquisition module 101, configured to acquire voice data to be subjected to voice recognition;

a feature extraction module 102, configured to frame the voice data and extract an acoustic feature of each frame of voice;

a probability calculation module 103, configured to calculate, for each frame of speech, posterior probabilities of the acoustic features on the acoustic modeling units;

a time determining module 104, configured to search in a WFST model based on the posterior probability to obtain a recognition result of the voice data and a time boundary of an end of word of each word in the recognition result; the end-of-word time boundary is determined based on a null edge; the empty edge output is empty. In addition, the margin input may also be null.

The time determination module is configured to, when searching in the WFST model based on the posterior probability to obtain the recognition result of the speech data and the end-of-word time boundary of each word in the recognition result, specifically:

It should be noted that, for the working process of each module in this embodiment, please refer to the corresponding description in the above embodiments, which is not described herein again.

In another embodiment of the present invention, the time determination module includes:

the information storage submodule is used for storing WFST output of the current word in the token in the WFST optimization model searching process; the token comprises: outputting words and time information of the words;

Further, the first judgment sub-module includes:

The time determination module further comprises:

It should be noted that, for the working processes of each module and sub-module in this embodiment, please refer to the corresponding description in the above embodiments, which is not described herein again.

Optionally, on the basis of the above embodiment of the word boundary estimation method, another embodiment of the present invention provides an electronic device, including: a memory and a processor;

wherein the memory is used for storing programs;

the processor calls a program and is used to:

acquiring voice data to be subjected to voice recognition;

Further, based on the posterior probability, searching in a WFST model to obtain the recognition result of the voice data and the end-of-word time boundary of each word in the recognition result, including:

Further, the determining the time boundary of the end of word of the current word includes:

acquiring a group of empty edges adjacent to the current word output;

Further, based on the posterior probability, searching in a WFST optimization model to obtain the recognition result of the voice data and the end-of-word time boundary of each word in the recognition result, further comprising:

Further, after determining that the time boundary of the end of word of each word in the recognition result of the voice data is determined, the method further includes:

Further, the input of the margin is null.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A word boundary estimation method, comprising:

acquiring voice data to be subjected to voice recognition;

2. The method of claim 1, wherein searching in WFST model based on the posterior probability to obtain the recognition result of the speech data and the time boundary of the end of word in the recognition result comprises:

in the WFST optimization model searching process, storing the WFST output of the current word in a token; the token comprises: outputting words and time information of the words;

3. The method of estimating word boundaries as claimed in claim 2, wherein said determining the time boundaries of the end of word of the current word comprises:

acquiring a group of empty edges adjacent to the current word output;

4. The method of claim 3, wherein the step of searching in WFST model based on the posterior probability to obtain the recognition result of the speech data and the time boundary of the end of word of each word in the recognition result further comprises:

5. The method of claim 4, wherein if the time boundary of the end of word of each word in the recognition result of the speech data is determined, the method further comprises:

6. The word boundary estimation method according to any one of claims 1 to 5, wherein an input of the null edge is null.

7. The method of claim 1, wherein searching in WFST model based on the posterior probability to obtain the recognition result of the speech data and the time boundary of the end of word in the recognition result comprises:

in the WFST optimization model searching process, if a speech recognition result of a word in speech data is recognized, the time of obtaining the speech recognition result is used as the time boundary of the word end.

8. A word boundary estimating apparatus, characterized by comprising:

9. The word boundary estimation method of claim 8, wherein the time determination module comprises:

10. The word boundary estimation method according to claim 9, wherein the first judgment sub-module includes:

11. The word boundary estimation method of claim 10, wherein the time determination module further comprises:

12. The word boundary estimation method of claim 11, wherein the time determination module further comprises:

13. The method according to claim 8, wherein the time determination module is configured to, when searching in a WFST model based on the posterior probability to obtain the recognition result of the speech data and the end-of-word time boundary of each word in the recognition result, specifically:

14. An electronic device, comprising: a memory and a processor;

wherein the memory is used for storing programs;

the processor calls a program and is used to:

acquiring voice data to be subjected to voice recognition;

searching in a WFST model based on the posterior probability to obtain the recognition result of the voice data and the time boundary of the tail of each word in the recognition result; the end-of-word time boundary is determined based on a null edge;

the empty edge output is empty.