CN112447169B

CN112447169B - Word boundary estimation method and device and electronic equipment

Info

Publication number: CN112447169B
Application number: CN201910832104.1A
Authority: CN
Inventors: 陈孝良; 王江; 冯大航; 常乐
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2019-09-04
Filing date: 2019-09-04
Publication date: 2024-04-19
Anticipated expiration: 2039-09-04
Also published as: CN112447169A

Abstract

The invention provides a word boundary estimation method, a word boundary estimation device and electronic equipment, and voice data to be subjected to voice recognition are obtained; framing the voice data and extracting the acoustic characteristics of each frame of voice; for each frame of speech, calculating a posterior probability of the acoustic features on each acoustic modeling unit; searching in a WFST model based on the posterior probability to obtain a recognition result of the voice data and a word tail time boundary of each word in the recognition result. Namely, by the method and the device, the time boundary information can be added to each word in the voice recognition process.

Description

Word boundary estimation method and device and electronic equipment

Technical Field

The present invention relates to the field of speech recognition, and in particular, to a word boundary estimation method, apparatus, and electronic device.

Background

Word boundary estimation in speech recognition belongs to the technical field of speech recognition. For a given piece of speech signal, we can obtain the corresponding text information by speech recognition technology.

In some specific scenarios, however, accurate time boundary information needs to be added to each word during recognition. For example, in a customer service scene, a customer service person is found to speak some non-civilized words in a text transcribed by a voice recognition technology, and at the moment, the position of the corresponding word in the recording can be quickly positioned according to time boundary information added for the word.

Disclosure of Invention

In view of the above, the present invention provides a word boundary estimation method, apparatus and electronic device, so as to solve the problem of adding time boundary information to each word in the voice recognition process.

In order to solve the technical problems, the invention adopts the following technical scheme:

a word boundary estimation method, comprising:

Acquiring voice data to be subjected to voice recognition;

Framing the voice data and extracting the acoustic characteristics of each frame of voice;

for each frame of speech, calculating a posterior probability of the acoustic features on each acoustic modeling unit;

Searching in a WFST model based on the posterior probability to obtain a recognition result of the voice data and a word tail time boundary of each word in the recognition result; the word tail time boundary is determined based on blank edges; and the blank side output is blank.

Optionally, searching in the WFST model based on the posterior probability to obtain a recognition result of the voice data and a word tail time boundary of each word in the recognition result, including:

obtaining a WFST optimization model in the WFST model; the WFST optimization model identifies that the ending time of words in the voice data is inconsistent with the actual ending time;

During the WFST optimization model searching process, storing WFST output of the current word in a token; the token comprises: outputting words and time information of the output words;

judging whether to determine the word tail time boundary of the current word;

and if the word tail time boundary of the current word is determined, updating the content stored in the token.

Optionally, the determining the word ending time boundary of the current word includes:

Acquiring a group of blank edges which are close to the current word output;

and taking the end time of the time information in the token corresponding to the last edge which is output as the blank in the blank group as the word ending time boundary of the current word.

Optionally, based on the posterior probability, searching in a WFST optimization model to obtain a recognition result of the voice data and a word tail time boundary of each word in the recognition result, and further including:

judging whether the word tail time boundary of each word in the recognition result is determined or not;

If not, returning to the step of storing the WFST output of the current word in the token in the WFST optimization model searching process.

Optionally, after determining that the end time boundary of each word in the recognition result of the voice data has been determined, the method further includes:

Selecting the output result in the token with the minimum cost in all tokens at the current moment as the voice recognition result of the voice data; the voice recognition result comprises: and the recognition result of the voice data and the word tail time boundary of each word in the recognition result.

Optionally, the input of the blank edge is blank.

In the WFST optimization model searching process, if the voice recognition result of the word in the voice data is recognized, the time for obtaining the voice recognition result is taken as the word tail time boundary of the word.

A word boundary estimation apparatus comprising:

the data acquisition module is used for acquiring voice data to be subjected to voice recognition;

The feature extraction module is used for framing the voice data and extracting the acoustic feature of each frame of voice;

the probability calculation module is used for calculating posterior probability of the acoustic features on each acoustic modeling unit for each frame of voice;

the time determining module is used for searching in the WFST model based on the posterior probability to obtain a recognition result of the voice data and a word tail time boundary of each word in the recognition result; the word tail time boundary is determined based on blank edges; and the blank side output is blank.

Optionally, the time determining module includes:

the model acquisition sub-module is used for acquiring a WFST optimization model in the WFST model; the WFST optimization model identifies that the ending time of words in the voice data is inconsistent with the actual ending time;

The information storage sub-module is used for storing the WFST output of the current word in the token in the WFST optimization model searching process; the token comprises: outputting words and time information of the output words; the WFST model comprises the WFST optimization model, and the WFST optimization model identifies that the ending time of a word in the voice data is inconsistent with the actual ending time;

The first judging submodule is used for judging whether the word tail time boundary of the current word is determined or not;

And the updating sub-module is used for updating the content stored in the token if the word tail time boundary of the current word is determined.

Optionally, the first judging submodule includes:

The blank edge obtaining unit is used for obtaining a group of blank edges which are close to the current word output;

And the time determining unit is used for taking the end time of the time information in the token corresponding to the last one of the blank sides as the word end time boundary of the current word.

Optionally, the time determining module further includes:

The second judging sub-module is used for judging whether the word tail time boundary of each word in the recognition result is determined or not;

and the information storage sub-module is further used for storing the WFST output of the current word in the token in the WFST optimization model searching process if the second judging sub-module judges that the word tail time boundary of each word in the identification result is not determined.

Optionally, the time determining module further includes:

The result determination submodule is used for selecting the output result in the token with the minimum cost in all tokens at the current moment as the voice recognition result of the voice data; the voice recognition result comprises: and the recognition result of the voice data and the word tail time boundary of each word in the recognition result.

Optionally, the time determining module is configured to search in a WFST model based on the posterior probability, and when obtaining the recognition result of the speech data and the word tail time boundary of each word in the recognition result, specifically configured to:

Acquiring a WFST optimization model in the WFST model, and in the process of searching the WFST optimization model, if a voice recognition result of a word in voice data is recognized, taking the time for obtaining the voice recognition result as a word tail time boundary of the word; the WFST optimization model identifies that the ending time of a word in the speech data does not coincide with the actual ending time.

An electronic device, comprising: a memory and a processor;

Wherein the memory is used for storing programs;

The processor invokes the program and is configured to:

Acquiring voice data to be subjected to voice recognition;

Compared with the prior art, the invention has the following beneficial effects:

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of the internal structure of a WFST model according to an embodiment of the present invention;

FIG. 2 is a flowchart of a word boundary estimation method according to an embodiment of the present invention;

FIG. 3 is a flowchart of another word boundary estimation method according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a word boundary estimating device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The embodiment of the invention provides a word boundary estimation method which mainly depends on a weighted finite state transducer WFST model, and the WFST model is explained.

The WFST model is typically composed of several basic blocks:

1. An acoustic model; speech recognition systems are often modeled using a first-order Hidden Markov Model (HMM). The acoustic model itself defines some more generalized acoustic modeling units. Generally, an HMM is composed of a plurality of states, which are the smallest modeling units of an acoustic model.

2. A pronunciation dictionary; the pronunciation dictionary contains a vocabulary that can be processed by the speech recognition system and its pronunciation. The pronunciation dictionary actually provides a mapping of the acoustic model to the language model.

3. A language model; the language model models the language aimed by the voice recognition system, and establishes the relativity between language words. In general, a rule language model and a statistical language model can be used as the speech recognition language model. In practical application, the offline command word recognition system of the limited resources is based on a rule language model, and the large-vocabulary continuous voice recognition system is based on a statistical language model, including but not limited to an N-gram model, a recurrent neural network model and the like.

4. A decoder; the decoder is one of the cores of a speech recognition system, and its task is to find a word string capable of outputting an input signal with the highest probability according to acoustics, language models and dictionaries. The relationship between the above modules can be more clearly understood from a mathematical perspective.

In an embodiment of the present invention, classification models of modeling units in acoustic models modeled by GMM (GaussianMixtureModel ) and DNN (DeepNeuralNetworks, deep neural network model) may be used.

Since HMM (HiddenMarkovModel ) models are well descriptive of the time-variability and short-time stationarity of speech, they have been widely used in acoustic modeling of large vocabulary continuous speech recognition systems.

The present invention further improves upon the existing WSFT to enable it to identify the end-of-word time boundaries for each word in the speech data.

Referring to fig. 1, wfst is a weighted finite state transducer for large-scale speech recognition, each state transducer labeled with input a and output B symbols. Thus, the constructed network (WFST) is used to generate a mapping from an input symbol sequence or string to an output string. WFST weights state transitions in addition to input and output symbols. The weight value may be the coding probability, duration, or any other number accumulated along the path, such as 0.5 in fig. 3, to calculate the overall weight that maps the input string to the output string. WFST is used for speech recognition and typically represents various possible path selections and their corresponding probabilities of outputting recognition results after inputting a speech signal in speech processing.

Referring to fig. 2, the word boundary estimation method may include:

s11, acquiring acoustic characteristics of voice data to be subjected to voice recognition.

In a specific implementation, the user may input voice data through an electronic device configured with a sound card device such as a microphone.

The electronic device may be a mobile device, such as a mobile phone, a tablet computer, a personal digital assistant, a wearable device (such as glasses, a watch, etc.), or a fixed device, such as a personal computer, an intelligent television, an intelligent home/appliance (such as an air conditioner, an electric cooker, etc.), which is not limited in the embodiments of the present invention.

S12, framing the voice data, and extracting acoustic characteristics of each frame of voice.

After the voice data is acquired, the voice data is framed, and the acoustic characteristics of the sub-voices of each frame are extracted. The acoustic features may include: MFCC, fbank, etc.

S13, for each frame of voice, calculating posterior probability of the acoustic features on each acoustic modeling unit.

In this embodiment, the posterior probability of each frame of speech on each acoustic modeling unit is estimated using the deep neural network algorithm DNN. DNN is obtained through extensive data training, and the input of DNN is acoustic characteristics and the input is posterior probability. The posterior probability refers to the weight value of the edge of WFST to find the optimal path.

And S14, searching in a WFST model based on the posterior probability to obtain a recognition result of the voice data and a word tail time boundary of each word in the recognition result.

The WFST model in this embodiment is the WFST model described above. The word tail time boundary is determined based on blank edges; and the last tail time corresponding to the blank edge is the word tail time boundary of the current word. In addition, the output of the blank side is blank, and if the front side of the word tail is non-stop, namely the front side is continuous words, such as eating, no stop exists between eating and meal, the blank side input is not blank. If there is a pause between eating and meal, such as 1 second, then the blank side input is also blank.

In the embodiment of the disclosure, when the condition that the output is empty occurs in the searching process in the WFST model:

When the input is empty, then there is a high likelihood of a pause in speech; when the input is not blank, the word tail can be determined by nonsensical voice words (like pauses) or redundant tail sounds after the valid information is identified. Typically, if a pause is identified, and a complete word is identified by the WFST model before the pause, then the word is identified as a prefix; if redundant tail sounds after effective information is identified, voice information is likely to be identified in advance in the WFST optimization algorithm, namely input is not null, which is common in the WFST optimization algorithm, and a specific method for identifying the tail time is shown in a scheme which is described later.

In this embodiment, voice data to be subjected to voice recognition is acquired; framing the voice data and extracting the acoustic characteristics of each frame of voice; for each frame of speech, calculating a posterior probability of the acoustic features on each acoustic modeling unit; searching in a WFST model based on the posterior probability to obtain a recognition result of the voice data and a word tail time boundary of each word in the recognition result. Namely, by the method and the device, the time boundary information can be added to each word in the voice recognition process.

In addition, in the aspect of word time boundary estimation in the identification process, a decoding diagram of HCLG is constructed, word diagram position information at each moment is stored in the decoding process, and after decoding is completed, position is traced back to obtain a corresponding identification result and time boundary information of the identification result. Specific: searching in a WFST model based on the posterior probability to obtain a recognition result of the voice data and a word tail time boundary of each word in the recognition result, wherein the method comprises the following steps:

And acquiring a WFST optimization model in the WFST model, and taking the time for acquiring the voice recognition result as the word tail time boundary of the word if the voice recognition result of the word in the voice data is recognized in the searching process of the WFST optimization model. The WFST model comprises the WFST optimization model, and the WFST optimization model does not need to completely carry out Viterbi search on each frame of voice, namely, the final result can be obtained without searching the actual voice word tail, so that voice recognition is realized, and the word tail information of a voice recognition mark is inconsistent with the actual word tail information. That is, the WFST optimization model identifies that the ending time of a word in the speech data does not coincide with the actual ending time, and generally the ending time of the identified word is earlier than the actual ending time. The optimization algorithm used by the WFST optimization model may be an output push algorithm or a weight push algorithm. Optimization operations in WFST, including empty transfer removal (epsilon remove), determinement operations (determinization), weight shifting (weight pushing), and minimization operations (minimization). The time shift in this embodiment is a weight shift (weight pushing).

For example, assume that the voice is we, and when the voice is recognized, when e in men is recognized, the voice recognition result of the voice "we" can be recognized as the word "we". This moment is taken as the "our" end-of-word time boundary. In practice, however, the speech "we" have no result, i.e. the output on the WFST side is advanced, i.e. the marked end-of-word time is not the actual end-of-word time.

In addition, in the decoding mode, for a long recognition task, the scale of the generated lattice is increased along with the increase of time, the memory consumed by the system is increased, and the time cost for recovering dead legs in the lattice is increased gradually. In the invention, the generation process of the lattice is removed in the decoding process, the decoding result is directly stored in the token transmission process, when word output is encountered, the current time is marked as the corresponding word end time, when the word appears, the current time is not regarded as the word end time, the time boundary information of the current word is updated in a plurality of frames after the word appears, if the current output is empty, the current time is updated as the time boundary of the current word until the next word is encountered.

In another embodiment of the present invention, step S14 "searching in WFST model based on the posterior probability, to obtain the recognition result of the speech data and the word tail time boundary of each word in the recognition result" is described in detail, referring to fig. 3, and specifically includes:

S24, obtaining a WFST optimization model in the WFST model; the WFST optimization model identifies that the ending time of a word in the speech data does not coincide with the actual ending time.

S25, storing WFST output of the current word in the token in the WFST optimization model searching process.

The WFST model includes the WFST optimization model, and the WFST optimization model identifies that an ending time of a word in the speech data does not coincide with an actual ending time. The WFST optimization model is the WFST optimization model.

The token comprises: the output word and the time information of the output word, that is, the output word and the time information of the output word are carried out by adopting a token passing mode in the embodiment, and the recognition result and the recognition time of the currently recognized word are stored in the token.

The number of tokens may be plural, and when searching in the WFST, one token is configured for each WFST search path to save time information, i.e. the number of tokens is the same as the number of search paths. When one WFST operation is completed, 1 or more with high probability are retained from all possible, so the number of tokens is dynamically changed. And selecting a token corresponding to the side with the highest probability (namely the lowest cost) during final output, and taking out the information in the token as a final recognition result and corresponding word tail information.

The token passing is also called as 'mark transmission', and a control method for local network data sending is used for ring network.

Tokens consist of dedicated blocks of information, a typical token consisting of consecutive 8 bits "1". When all nodes of the network are idle, the token is transferred from one node to the next. When a node requires to send information, it must acquire a token and take it off the network before sending it. Once the data is transferred, the token is transferred to the next node, each node being provided with means for transmitting/receiving the token. Using this transmission method no collision occurs, since only one node is possible to transmit data at a time. The biggest problem is that the token is lost or destroyed during the transmission process, so that the node cannot find the token and cannot transmit information.

During the search, a viterbi algorithm may be employed to search in the WFST optimization model.

S26, judging whether the word tail time boundary of the current word is determined; if yes, go to step S27.

In practical application, the process of determining the word end time boundary of the current word may be:

And acquiring a group of blank edges which are close to the output of the current word, and taking the end time of the time information in the token corresponding to the last blank edge which is output as the blank in the group of blank edges as the word tail time boundary of the current word.

When the user speaks, habitually speaking a word, stopping, generating invalid voice at the moment, and using blank edges in the WFST optimization model for representing the invalid voice, wherein if the user eats, a pause exists between the 'we' and the 'eats', and the pause time can be used as ending time; in the WFST optimization model, searching a group of blank edges with blank input of each word, outputting blank edges, if a group of blank edges (possibly only one) of the adjacent words are found, acquiring the blank edge with blank input of the last input of the plurality of blank edges, outputting the blank edge with blank input of the last input of the plurality of blank edges as a word tail time boundary. For example, assuming that "we" corresponds to a blank edge later, the blank edge will appear as the next word for a period of time, and the last time when the next word appears, i.e. the end time of the blank edge, is taken as the end-of-word time boundary of "we".

That is, in order to confirm that the blank edge obtained in the WFST model is the end of the word, it is necessary to determine that the input is not blank and output the position of the blank edge that is blank, and in a plurality of blank edges that are immediately behind the current word and whose output is not blank, the last input is not blank and the blank edge whose output is blank can be output as the end of the word.

And finally, selecting the last one of the blank sides with the blank inputs as the word tail, and storing the time information of the blank side as the word tail time in the token.

S27, updating the content stored in the token.

And if the word end time boundary of the current word is determined, releasing the token, and updating the content in the token into the word to be identified and the time information of the word.

S28, judging whether the word tail time boundary of each word in the voice data is determined; if yes, go to step S29, if not, go back to step S25.

S29, selecting WFST in the tokens with the minimum cost in all the tokens at the current moment to output as a voice recognition result of the voice data.

The voice recognition result comprises: and the recognition result of the voice data and the word tail time boundary of each word in the recognition result.

And when the word end time boundary of each word in the voice data is determined, selecting the WFST output in the token with the minimum cost in all tokens at the current moment as a voice recognition result of the voice data, and when the word end time boundary of each word in the voice data is not determined, continuing to recognize the next voice.

In the embodiment, the word boundary information with relatively high accuracy can be obtained, and the problem of inaccurate word boundary record is solved.

It should be noted that, for the specific implementation process of steps S21-23 in this embodiment, please refer to the corresponding description in the above embodiment, and the detailed description is omitted here.

Optionally, on the basis of the above embodiment of the word boundary estimation method, another embodiment of the present invention provides a word boundary estimation device, referring to fig. 4, which may include:

A data acquisition module 101, configured to acquire voice data to be subjected to voice recognition;

the feature extraction module 102 is configured to frame the voice data and extract an acoustic feature of each frame of voice;

A probability calculation module 103, configured to calculate, for each frame of speech, a posterior probability of the acoustic feature on each acoustic modeling unit;

the time determining module 104 is configured to search in a WFST model based on the posterior probability to obtain a recognition result of the speech data and a word tail time boundary of each word in the recognition result; the word tail time boundary is determined based on blank edges; and the blank side output is blank. In addition, the blank side input may also be blank.

The time determining module is configured to search in a WFST model based on the posterior probability to obtain a recognition result of the speech data and a word tail time boundary of each word in the recognition result, where the time determining module is specifically configured to:

It should be noted that, in the working process of each module in this embodiment, please refer to the corresponding description in the above embodiment, and no further description is given here.

In another embodiment of the present invention, the time determining module includes:

the information storage sub-module is used for storing the WFST output of the current word in the token in the WFST optimization model searching process; the token comprises: outputting words and time information of the output words;

Further, the first judging submodule includes:

The time determination module further includes:

It should be noted that, in the working process of each module and sub-module in this embodiment, please refer to the corresponding description in the above embodiment, and the description is omitted here.

Optionally, on the basis of the above embodiment of the word boundary estimation method, another embodiment of the present invention provides an electronic device, including: a memory and a processor;

Wherein the memory is used for storing programs;

The processor invokes the program and is configured to:

Acquiring voice data to be subjected to voice recognition;

Further, searching in a WFST model based on the posterior probability to obtain a recognition result of the voice data and a word tail time boundary of each word in the recognition result, wherein the method comprises the following steps:

judging whether to determine the word tail time boundary of the current word;

Further, the determining the word ending time boundary of the current word includes:

Acquiring a group of blank edges which are close to the current word output;

Further, searching in a WFST optimization model based on the posterior probability to obtain a recognition result of the voice data and a word tail time boundary of each word in the recognition result, and further comprising:

Further, if it is determined that the end time boundary of each word in the recognition result of the voice data has been determined, the method further includes:

Further, the input of the blank edge is blank.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A word boundary estimation method, comprising:

Acquiring voice data to be subjected to voice recognition;

2. The word boundary estimation method according to claim 1, wherein searching in a WFST model based on the posterior probability to obtain the recognition result of the speech data and the word end time boundary of each word in the recognition result comprises:

storing WFST output of the current word in a token in the WFST optimization model searching process; the token comprises: outputting words and time information of the output words;

judging whether to determine the word tail time boundary of the current word;

3. The word boundary estimation method according to claim 2, wherein determining the word ending time boundary of the current word includes:

Acquiring a group of blank edges which are close to the current word output;

4. The word boundary estimation method according to claim 3, wherein searching in a WFST model based on the posterior probability obtains a recognition result of the speech data and a word end time boundary of each word in the recognition result, further comprising:

5. The word boundary estimation method according to claim 4, further comprising, after determining that a word end time boundary of each word in the recognition result of the voice data has been determined:

6. The word boundary estimation method of any one of claims 1-5 wherein the input of the null edge is null.

7. A word boundary estimating apparatus, comprising:

8. The word boundary estimation device of claim 7, wherein the time determination module comprises:

9. The word boundary estimation device of claim 8, wherein the first judgment sub-module comprises:

10. The word boundary estimation device of claim 9, wherein the time determination module further comprises:

11. The word boundary estimation device of claim 10, wherein the time determination module further comprises:

12. An electronic device, comprising: a memory and a processor;

Wherein the memory is used for storing programs;

The processor invokes the program and is configured to:

Acquiring voice data to be subjected to voice recognition;