CN112259082B

CN112259082B - Real-time voice recognition method and system

Info

Publication number: CN112259082B
Application number: CN202011207353.0A
Authority: CN
Inventors: 蒋子缘
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2020-11-03
Filing date: 2020-11-03
Publication date: 2022-04-01
Anticipated expiration: 2040-11-03
Also published as: CN112259082A

Abstract

The embodiment of the invention provides a real-time voice recognition method. The method comprises the following steps: in the token passing process, at least one token of each frame from a first frame to an Nth frame in the collected real-time voice is determined; determining a current optimal recognition result path, wherein the current optimal recognition result path is formed by connecting at least N tokens from a first frame to an Nth frame; selecting tokens which have direct connection relation with the (i + 1) th frame in the ith frame in the path of the current best recognition result as truncation tokens, and extracting a grid formed by a plurality of paths of the recognition result from a historical token group formed by the truncation tokens from the initial token to the ith frame; the path of the best recognition result from the first frame to the ith frame is extracted from the mesh. The embodiment of the invention also provides a real-time voice recognition system. According to the embodiment of the invention, the historical path of the current token is limited by the truncation token, the identification result is fixed in advance, and the condition that the identification and the result are fixed because silence is not detected for a long time is avoided.

Description

Real-time voice recognition method and system

Technical Field

The invention relates to the field of voice recognition, in particular to a real-time voice recognition method and a real-time voice recognition system.

Background

The real-time speech recognition system is usually applied to intelligent devices with real-time conversation function or real-time transcription devices, and such intelligent devices have strong real-time display requirements. This is achieved by taking the currently best recognition result at intervals while the audio is continuously fed in. And the longer audio depends on a voice detection module, and recognition is finished in time when silence is found, so that a recognition result is fixed.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:

the voice detection module can not detect silence within a long time and can not cut off recognition and fix a result under the condition of continuous voice input. More sensitive speech detection modules are used for more frequent recognition truncation, often increasing the truncation when the recognition context is unreasonable, which greatly reduces the recognition accuracy. In addition, in the process of continuously inputting audio, the best result of each frame is not only increased in increment along with the input of the audio, but also the result before the change is possible. Depending on the language model and the acoustic model, earlier results may also vary. In some application scenarios, such as real-time voice transcription, such a situation may give a user a bad impression.

Disclosure of Invention

The method aims to solve the problems that in the prior art, silence is not detected for a long time, identification cannot be cut off, and results are fixed, so that the identification results before the early stage change, the early identification contents are influenced due to the contents after the early stage, and wrong changes may occur in some cases.

In a first aspect, an embodiment of the present invention provides a real-time speech recognition method, including:

in the token passing process, determining at least one token of each frame from a first frame to an Nth frame in the collected real-time voice, wherein the initial token in the token passing process is an initial token;

determining a path of a current best recognition result based on the state probability of each token of each frame, wherein the path of the current best recognition result is formed by connecting at least N tokens from a first frame to an Nth frame;

selecting tokens which have direct connection relation with the (i + 1) th frame in the ith frame in the path of the current best recognition result as truncation tokens, and extracting a grid formed by paths of a plurality of recognition results from a historical token group formed by the truncation tokens from the initial token to the ith frame;

and extracting a path of the best recognition result from the first frame to the ith frame from the grid.

In a second aspect, an embodiment of the present invention provides a real-time speech recognition system, including:

the system comprises a token determining program module, a token determining module and a processing module, wherein the token determining program module is used for determining at least one token of each frame from a first frame to an Nth frame in collected real-time voice in the process of token transmission, and the initial token in the process of token transmission is an initial token;

a best path determining program module, configured to determine a path of a current best recognition result based on state probabilities of tokens of each frame, where the path of the current best recognition result is formed by connecting at least N tokens from a first frame to an nth frame;

a truncation program module, configured to select a token in the ith frame in the path of the current best identification result, where the token has a direct connection relationship with the (i + 1) th frame, as a truncation token, and extract a grid formed by paths of multiple identification results from a historical token group formed by the initial token to the truncation token of the ith frame;

and the identification program module is used for extracting a path of the best identification result from the first frame to the ith frame from the grid.

In a third aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the real-time speech recognition method of any of the embodiments of the present invention.

In a fourth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the real-time speech recognition method according to any embodiment of the present invention.

The embodiment of the invention has the beneficial effects that: considering that the identification result in the optimal path is often less in variation before many frames in the process of token passing, and then the truncated token is selected. The historical path of the current token is limited by the cutoff token, the identification result is fixed in advance, and the problem that the identification and the result are fixed because silence is not detected for a long time is avoided.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart of a method for real-time speech recognition according to an embodiment of the present invention;

FIG. 2 is a process diagram of a fixed recognition result of a real-time speech recognition method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a real-time speech recognition system according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a real-time speech recognition method according to an embodiment of the present invention, which includes the following steps:

s11: in the token passing process, determining at least one token of each frame from a first frame to an Nth frame in the collected real-time voice, wherein the initial token in the token passing process is an initial token;

s12: determining a path of a current best recognition result based on the state probability of each token of each frame, wherein the path of the current best recognition result is formed by connecting at least N tokens from a first frame to an Nth frame;

s13: selecting tokens which have direct connection relation with the (i + 1) th frame in the ith frame in the path of the current best recognition result as truncation tokens, and extracting a grid formed by paths of a plurality of recognition results from a historical token group formed by the truncation tokens from the initial token to the ith frame;

s14: and extracting a path of the best recognition result from the first frame to the ith frame from the grid.

In this embodiment, speech recognition typically uses Token-passing to find the most likely recognition result sequence. Token-passing is an implementation of the viterbi algorithm.

For step S11, audio, such as the voice of the user speaking in the smart interaction or the voice of the person uttered by the television, is captured in real time. These audios were subjected to Token-passing for processing. During the process of Token passing, the progress of each frame in a section of search path at each state is recorded by a Token. Each Token records probability information of a certain frame and a certain state, and the history of the searched path. Further, at least one token may be collected from each of frames 1 through N in the collected real-time speech. As shown in fig. 2. There are 5 frames in the figure, e.g., the first frame has 1 token, the second frame has 3 tokens, and the third frame has 4 tokens. Token number 1 in the first frame is the start token.

For step S12, during the search, Token is copied to the state with connection to the current state and the probability of the increment and history are recorded. And will retain only some of the tokens with the highest probability during Token advancing. In this process, since the algorithm retains Token at multiple states per frame, the path represented by the Token at multiple states has the possibility of being the most likely path result at the last frame. After a plurality of paths exist, historical Token cutting is carried out by means of extra cost, and then the current best recognition result path is determined, wherein the best recognition result path is formed by connecting at least N tokens from a first frame to an Nth frame. As shown in fig. 2, the path from 1 frame to 5 frames is the path of the current best recognition result. For example, there are 6 tokens in the best recognition result path from frame 1 to frame 5.

For step S13, in the first path of the current best recognition result, for example, the token having the direct connection relationship between the frame 3 and the frame 4, the token number 6 of the frame 3 is selected as the truncated token. The portion of the recognition result in the best path many frames before tends to vary less due to the process of token passing. Based on this phenomenon, a token of a direct connection relationship is thus selected. From token 6 onwards, lattice is extracted from the historical token. The paths in the trellis are: 1-2-5-6; 1-2-6; 1-3-6.

In one embodiment, the i is related to the number of frames of the intermediate frames from the first frame to the nth frame. Considering tokens in the path that may have a direct connection relationship between frame 1 and frame 2, excessive truncation is avoided, and therefore, limitations are placed on the selection of i. For example, the first frame to the nth frame may be considered at the 4 th frame and the 5 th frame for a total of 8 frames.

For step S14, the path of the best recognition result is selected from the trellis, for example, the path 1-2-5-6 of the best recognition result in the truncation from frame 1 to frame 3 can also be selected according to the state probability. And further limiting the historical path of the current token and fixing the identification result in advance.

It can be seen from this embodiment that, in consideration of the process of token passing, the part of the recognition result in the optimal path before many frames often has small variation, and then the truncated token is selected. The historical path of the current token is limited by the cutoff token, the identification result is fixed in advance, and the problem that the identification and the result are fixed because silence is not detected for a long time is avoided.

As an implementation manner, in this embodiment, after extracting a path of the best recognition result from the first frame to the ith frame from the mesh, the method further includes:

and clipping the token of each frame from the (i + 1) th frame to the N (N) th frame based on the determined path of the best recognition result from the first frame to the ith frame.

In this embodiment, since the path of the best recognition result from the first frame to the i-th frame is fixed, the tokens of each of the i + 1-th frame to the N-th frame are clipped again. For example, in the above steps, the best recognition result of 1 frame to 3 frames is fixed. Subsequently, the tokens from frame 4 to frame 5 are clipped from the new pair.

As an embodiment, the cropping includes:

updating the probability of each token for each of the frames starting from the (i + 1) th frame to the nth frame; and/or

Removing paths in the historical set of tokens that do not include the truncated token.

And sequentially traversing each frame from the (i + 1) th frame to the current frame (namely the Nth frame) in a breadth-first mode, updating the probability of each Token, and removing paths which do not contain the Token 6 from the history of the tokens which can reach the Token 6. The probability of a node not having a path to reach from token number 6 is marked as infinitesimally small starting with the i +2 frame. Because the historical Token is firstly cut, the number of the traversed tokens in the step is much less than that in Token passing, and thus, large expenses are not caused. And continuing the search of the next frame, and pruning out tokens with infinite probability by beam pruning in the search process, thereby realizing subsequent truncation and fixing the recognition result in advance. Wherein the cropping is performed on a frame-by-frame basis. In the subsequent token passing process, at least one token of each frame is a clipped token.

According to the embodiment, the cutting is carried out for the subsequent cutting process, the subsequent searching process is reduced through cutting, and the efficiency is improved.

A complete detailed description is given based on fig. 2, in which (i) represents a red path, (ii) represents a yellow path, (iii) represents a blue path, and (iv) represents a black path. Lattice is extracted from the path from frame 1 to frame 3. The lattice node corresponding to Token No. 1 is set as the start node, and the lattice node corresponding to Token No. 6 is set as the end node. Via FST connect operation (the specific operation can be referred to by FST/connectitdoc of openfst website). The lattice arc corresponding to the red and yellow paths in frames 1 to 3 of FIG. 2 is retained, and the black path is deleted. The best result in lattice corresponds to the result of the red path in the figure. If there is a later lattice reset process, the result may change to the result corresponding to the yellow path. But so far the results for 1 to 3 frames have been fixed.

Then, from frame 3, frame by frame, a breadth-first traversal is performed with Token number 6 as root, and tokens that can be reached from T1 are recorded. After which the links to Token are cut. For example, Token No. 6 can only reach Token No. 7 at frame No. 3, and the probability of Token No. 7 is updated according to the probability on the path from Token No. 6 to Token No. 7. When Token 7 is processed, the link from Token 3 to Token 7 is deleted. Then,

Token numbers

9, 10, and 11 connected to Token number 6 and Token number 7 on frame number 4 are processed. Recalculating the probability of itself based on the new probabilities from Token nos. 6 and 7. The connection from Token No. 8 to Token No. 12 is broken. Token score number 12 that is not traversed is marked infinitesimally small. And so on. Then from the time of processing frame 6, Token number 15 is clipped because the probability is marked as infinite. Token number 13, and Token number 14 continue to pass down according to the Token passing algorithm.

Fig. 3 is a schematic structural diagram of a real-time speech recognition system according to an embodiment of the present invention, which can execute the real-time speech recognition method according to any of the above embodiments and is configured in a terminal.

The real-time speech recognition system provided by the embodiment comprises: a token determination program module 11, a best path determination program module 12, a truncation program module 13 and an identification program module 14.

The token determination program module 11 is configured to determine at least one token of each frame from a first frame to an nth frame in the collected real-time speech in a token passing process, where an initial token in the token passing process is an initial token; the best path determining program module 12 is configured to determine a path of a current best recognition result based on the state probabilities of the tokens of each frame, where the path of the current best recognition result is formed by connecting at least N tokens from a first frame to an nth frame; the truncation program module 13 is configured to select a token in the ith frame in the path of the current best recognition result, which has a direct connection relationship with the (i + 1) th frame, as a truncation token, and extract a grid formed by paths of multiple recognition results from a history token group formed by the start token and the truncation token of the ith frame; the recognition program module 14 is used to extract the path of the best recognition result from the first frame to the i-th frame from the mesh.

Further, the system is also configured to:

The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the real-time voice recognition method in any method embodiment;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

As a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium, which when executed by a processor, perform the real-time speech recognition method of any of the method embodiments described above.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

An embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the real-time speech recognition method of any of the embodiments of the present invention.

The client of the embodiment of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as tablet computers.

(3) Portable entertainment devices such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.

(4) Other electronic devices with data processing capabilities.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A real-time speech recognition method, comprising:

2. The method of claim 1, wherein the i is related to a number of frames of an intermediate frame of the first through nth frames.

3. The method of claim 1, wherein after extracting the path of best recognition result from the first frame to the ith frame from the mesh, the method further comprises:

4. The method of claim 3, wherein the cropping comprises:

5. The method of claim 3, wherein the cropping is on a frame-by-frame basis.

6. The method of claim 1, wherein at least one token of each frame of the captured real-time speech is determined to be a clipped token during the token passing.

7. A real-time speech recognition system comprising:

8. The system of claim 7, wherein the system is further configured to:

9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any of claims 1-6.

10. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.