CN112259082B - Real-time voice recognition method and system - Google Patents

Real-time voice recognition method and system Download PDF

Info

Publication number
CN112259082B
CN112259082B CN202011207353.0A CN202011207353A CN112259082B CN 112259082 B CN112259082 B CN 112259082B CN 202011207353 A CN202011207353 A CN 202011207353A CN 112259082 B CN112259082 B CN 112259082B
Authority
CN
China
Prior art keywords
frame
token
path
recognition result
tokens
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011207353.0A
Other languages
Chinese (zh)
Other versions
CN112259082A (en
Inventor
蒋子缘
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sipic Technology Co Ltd
Original Assignee
Sipic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sipic Technology Co Ltd filed Critical Sipic Technology Co Ltd
Priority to CN202011207353.0A priority Critical patent/CN112259082B/en
Publication of CN112259082A publication Critical patent/CN112259082A/en
Application granted granted Critical
Publication of CN112259082B publication Critical patent/CN112259082B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L15/222Barge in, i.e. overridable guidance for interrupting prompts
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/081Search algorithms, e.g. Baum-Welch or Viterbi
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/221Announcement of recognition results

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The embodiment of the invention provides a real-time voice recognition method. The method comprises the following steps: in the token passing process, at least one token of each frame from a first frame to an Nth frame in the collected real-time voice is determined; determining a current optimal recognition result path, wherein the current optimal recognition result path is formed by connecting at least N tokens from a first frame to an Nth frame; selecting tokens which have direct connection relation with the (i + 1) th frame in the ith frame in the path of the current best recognition result as truncation tokens, and extracting a grid formed by a plurality of paths of the recognition result from a historical token group formed by the truncation tokens from the initial token to the ith frame; the path of the best recognition result from the first frame to the ith frame is extracted from the mesh. The embodiment of the invention also provides a real-time voice recognition system. According to the embodiment of the invention, the historical path of the current token is limited by the truncation token, the identification result is fixed in advance, and the condition that the identification and the result are fixed because silence is not detected for a long time is avoided.

Description

Real-time voice recognition method and system
Technical Field
The invention relates to the field of voice recognition, in particular to a real-time voice recognition method and a real-time voice recognition system.
Background
The real-time speech recognition system is usually applied to intelligent devices with real-time conversation function or real-time transcription devices, and such intelligent devices have strong real-time display requirements. This is achieved by taking the currently best recognition result at intervals while the audio is continuously fed in. And the longer audio depends on a voice detection module, and recognition is finished in time when silence is found, so that a recognition result is fixed.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:
the voice detection module can not detect silence within a long time and can not cut off recognition and fix a result under the condition of continuous voice input. More sensitive speech detection modules are used for more frequent recognition truncation, often increasing the truncation when the recognition context is unreasonable, which greatly reduces the recognition accuracy. In addition, in the process of continuously inputting audio, the best result of each frame is not only increased in increment along with the input of the audio, but also the result before the change is possible. Depending on the language model and the acoustic model, earlier results may also vary. In some application scenarios, such as real-time voice transcription, such a situation may give a user a bad impression.
Disclosure of Invention
The method aims to solve the problems that in the prior art, silence is not detected for a long time, identification cannot be cut off, and results are fixed, so that the identification results before the early stage change, the early identification contents are influenced due to the contents after the early stage, and wrong changes may occur in some cases.
In a first aspect, an embodiment of the present invention provides a real-time speech recognition method, including:
in the token passing process, determining at least one token of each frame from a first frame to an Nth frame in the collected real-time voice, wherein the initial token in the token passing process is an initial token;
determining a path of a current best recognition result based on the state probability of each token of each frame, wherein the path of the current best recognition result is formed by connecting at least N tokens from a first frame to an Nth frame;
selecting tokens which have direct connection relation with the (i + 1) th frame in the ith frame in the path of the current best recognition result as truncation tokens, and extracting a grid formed by paths of a plurality of recognition results from a historical token group formed by the truncation tokens from the initial token to the ith frame;
and extracting a path of the best recognition result from the first frame to the ith frame from the grid.
In a second aspect, an embodiment of the present invention provides a real-time speech recognition system, including:
the system comprises a token determining program module, a token determining module and a processing module, wherein the token determining program module is used for determining at least one token of each frame from a first frame to an Nth frame in collected real-time voice in the process of token transmission, and the initial token in the process of token transmission is an initial token;
a best path determining program module, configured to determine a path of a current best recognition result based on state probabilities of tokens of each frame, where the path of the current best recognition result is formed by connecting at least N tokens from a first frame to an nth frame;
a truncation program module, configured to select a token in the ith frame in the path of the current best identification result, where the token has a direct connection relationship with the (i + 1) th frame, as a truncation token, and extract a grid formed by paths of multiple identification results from a historical token group formed by the initial token to the truncation token of the ith frame;
and the identification program module is used for extracting a path of the best identification result from the first frame to the ith frame from the grid.
In a third aspect, an electronic device is provided, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the real-time speech recognition method of any of the embodiments of the present invention.
In a fourth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the real-time speech recognition method according to any embodiment of the present invention.
The embodiment of the invention has the beneficial effects that: considering that the identification result in the optimal path is often less in variation before many frames in the process of token passing, and then the truncated token is selected. The historical path of the current token is limited by the cutoff token, the identification result is fixed in advance, and the problem that the identification and the result are fixed because silence is not detected for a long time is avoided.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a flow chart of a method for real-time speech recognition according to an embodiment of the present invention;
FIG. 2 is a process diagram of a fixed recognition result of a real-time speech recognition method according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a real-time speech recognition system according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a flowchart of a real-time speech recognition method according to an embodiment of the present invention, which includes the following steps:
s11: in the token passing process, determining at least one token of each frame from a first frame to an Nth frame in the collected real-time voice, wherein the initial token in the token passing process is an initial token;
s12: determining a path of a current best recognition result based on the state probability of each token of each frame, wherein the path of the current best recognition result is formed by connecting at least N tokens from a first frame to an Nth frame;
s13: selecting tokens which have direct connection relation with the (i + 1) th frame in the ith frame in the path of the current best recognition result as truncation tokens, and extracting a grid formed by paths of a plurality of recognition results from a historical token group formed by the truncation tokens from the initial token to the ith frame;
s14: and extracting a path of the best recognition result from the first frame to the ith frame from the grid.
In this embodiment, speech recognition typically uses Token-passing to find the most likely recognition result sequence. Token-passing is an implementation of the viterbi algorithm.
For step S11, audio, such as the voice of the user speaking in the smart interaction or the voice of the person uttered by the television, is captured in real time. These audios were subjected to Token-passing for processing. During the process of Token passing, the progress of each frame in a section of search path at each state is recorded by a Token. Each Token records probability information of a certain frame and a certain state, and the history of the searched path. Further, at least one token may be collected from each of frames 1 through N in the collected real-time speech. As shown in fig. 2. There are 5 frames in the figure, e.g., the first frame has 1 token, the second frame has 3 tokens, and the third frame has 4 tokens. Token number 1 in the first frame is the start token.
For step S12, during the search, Token is copied to the state with connection to the current state and the probability of the increment and history are recorded. And will retain only some of the tokens with the highest probability during Token advancing. In this process, since the algorithm retains Token at multiple states per frame, the path represented by the Token at multiple states has the possibility of being the most likely path result at the last frame. After a plurality of paths exist, historical Token cutting is carried out by means of extra cost, and then the current best recognition result path is determined, wherein the best recognition result path is formed by connecting at least N tokens from a first frame to an Nth frame. As shown in fig. 2, the path from 1 frame to 5 frames is the path of the current best recognition result. For example, there are 6 tokens in the best recognition result path from frame 1 to frame 5.
For step S13, in the first path of the current best recognition result, for example, the token having the direct connection relationship between the frame 3 and the frame 4, the token number 6 of the frame 3 is selected as the truncated token. The portion of the recognition result in the best path many frames before tends to vary less due to the process of token passing. Based on this phenomenon, a token of a direct connection relationship is thus selected. From token 6 onwards, lattice is extracted from the historical token. The paths in the trellis are: 1-2-5-6; 1-2-6; 1-3-6.
In one embodiment, the i is related to the number of frames of the intermediate frames from the first frame to the nth frame. Considering tokens in the path that may have a direct connection relationship between frame 1 and frame 2, excessive truncation is avoided, and therefore, limitations are placed on the selection of i. For example, the first frame to the nth frame may be considered at the 4 th frame and the 5 th frame for a total of 8 frames.
For step S14, the path of the best recognition result is selected from the trellis, for example, the path 1-2-5-6 of the best recognition result in the truncation from frame 1 to frame 3 can also be selected according to the state probability. And further limiting the historical path of the current token and fixing the identification result in advance.
It can be seen from this embodiment that, in consideration of the process of token passing, the part of the recognition result in the optimal path before many frames often has small variation, and then the truncated token is selected. The historical path of the current token is limited by the cutoff token, the identification result is fixed in advance, and the problem that the identification and the result are fixed because silence is not detected for a long time is avoided.
As an implementation manner, in this embodiment, after extracting a path of the best recognition result from the first frame to the ith frame from the mesh, the method further includes:
and clipping the token of each frame from the (i + 1) th frame to the N (N) th frame based on the determined path of the best recognition result from the first frame to the ith frame.
In this embodiment, since the path of the best recognition result from the first frame to the i-th frame is fixed, the tokens of each of the i + 1-th frame to the N-th frame are clipped again. For example, in the above steps, the best recognition result of 1 frame to 3 frames is fixed. Subsequently, the tokens from frame 4 to frame 5 are clipped from the new pair.
As an embodiment, the cropping includes:
updating the probability of each token for each of the frames starting from the (i + 1) th frame to the nth frame; and/or
Removing paths in the historical set of tokens that do not include the truncated token.
And sequentially traversing each frame from the (i + 1) th frame to the current frame (namely the Nth frame) in a breadth-first mode, updating the probability of each Token, and removing paths which do not contain the Token 6 from the history of the tokens which can reach the Token 6. The probability of a node not having a path to reach from token number 6 is marked as infinitesimally small starting with the i +2 frame. Because the historical Token is firstly cut, the number of the traversed tokens in the step is much less than that in Token passing, and thus, large expenses are not caused. And continuing the search of the next frame, and pruning out tokens with infinite probability by beam pruning in the search process, thereby realizing subsequent truncation and fixing the recognition result in advance. Wherein the cropping is performed on a frame-by-frame basis. In the subsequent token passing process, at least one token of each frame is a clipped token.
According to the embodiment, the cutting is carried out for the subsequent cutting process, the subsequent searching process is reduced through cutting, and the efficiency is improved.
A complete detailed description is given based on fig. 2, in which (i) represents a red path, (ii) represents a yellow path, (iii) represents a blue path, and (iv) represents a black path. Lattice is extracted from the path from frame 1 to frame 3. The lattice node corresponding to Token No. 1 is set as the start node, and the lattice node corresponding to Token No. 6 is set as the end node. Via FST connect operation (the specific operation can be referred to by FST/connectitdoc of openfst website). The lattice arc corresponding to the red and yellow paths in frames 1 to 3 of FIG. 2 is retained, and the black path is deleted. The best result in lattice corresponds to the result of the red path in the figure. If there is a later lattice reset process, the result may change to the result corresponding to the yellow path. But so far the results for 1 to 3 frames have been fixed.
Then, from frame 3, frame by frame, a breadth-first traversal is performed with Token number 6 as root, and tokens that can be reached from T1 are recorded. After which the links to Token are cut. For example, Token No. 6 can only reach Token No. 7 at frame No. 3, and the probability of Token No. 7 is updated according to the probability on the path from Token No. 6 to Token No. 7. When Token 7 is processed, the link from Token 3 to Token 7 is deleted. Then, Token numbers 9, 10, and 11 connected to Token number 6 and Token number 7 on frame number 4 are processed. Recalculating the probability of itself based on the new probabilities from Token nos. 6 and 7. The connection from Token No. 8 to Token No. 12 is broken. Token score number 12 that is not traversed is marked infinitesimally small. And so on. Then from the time of processing frame 6, Token number 15 is clipped because the probability is marked as infinite. Token number 13, and Token number 14 continue to pass down according to the Token passing algorithm.
Fig. 3 is a schematic structural diagram of a real-time speech recognition system according to an embodiment of the present invention, which can execute the real-time speech recognition method according to any of the above embodiments and is configured in a terminal.
The real-time speech recognition system provided by the embodiment comprises: a token determination program module 11, a best path determination program module 12, a truncation program module 13 and an identification program module 14.
The token determination program module 11 is configured to determine at least one token of each frame from a first frame to an nth frame in the collected real-time speech in a token passing process, where an initial token in the token passing process is an initial token; the best path determining program module 12 is configured to determine a path of a current best recognition result based on the state probabilities of the tokens of each frame, where the path of the current best recognition result is formed by connecting at least N tokens from a first frame to an nth frame; the truncation program module 13 is configured to select a token in the ith frame in the path of the current best recognition result, which has a direct connection relationship with the (i + 1) th frame, as a truncation token, and extract a grid formed by paths of multiple recognition results from a history token group formed by the start token and the truncation token of the ith frame; the recognition program module 14 is used to extract the path of the best recognition result from the first frame to the i-th frame from the mesh.
Further, the system is also configured to:
and clipping the token of each frame from the (i + 1) th frame to the N (N) th frame based on the determined path of the best recognition result from the first frame to the ith frame.
The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the real-time voice recognition method in any method embodiment;
as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:
in the token passing process, determining at least one token of each frame from a first frame to an Nth frame in the collected real-time voice, wherein the initial token in the token passing process is an initial token;
determining a path of a current best recognition result based on the state probability of each token of each frame, wherein the path of the current best recognition result is formed by connecting at least N tokens from a first frame to an Nth frame;
selecting tokens which have direct connection relation with the (i + 1) th frame in the ith frame in the path of the current best recognition result as truncation tokens, and extracting a grid formed by paths of a plurality of recognition results from a historical token group formed by the truncation tokens from the initial token to the ith frame;
and extracting a path of the best recognition result from the first frame to the ith frame from the grid.
As a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium, which when executed by a processor, perform the real-time speech recognition method of any of the method embodiments described above.
The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
An embodiment of the present invention further provides an electronic device, which includes: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the real-time speech recognition method of any of the embodiments of the present invention.
The client of the embodiment of the present application exists in various forms, including but not limited to:
(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.
(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as tablet computers.
(3) Portable entertainment devices such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.
(4) Other electronic devices with data processing capabilities.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A real-time speech recognition method, comprising:
in the token passing process, determining at least one token of each frame from a first frame to an Nth frame in the collected real-time voice, wherein the initial token in the token passing process is an initial token;
determining a path of a current best recognition result based on the state probability of each token of each frame, wherein the path of the current best recognition result is formed by connecting at least N tokens from a first frame to an Nth frame;
selecting tokens which have direct connection relation with the (i + 1) th frame in the ith frame in the path of the current best recognition result as truncation tokens, and extracting a grid formed by paths of a plurality of recognition results from a historical token group formed by the truncation tokens from the initial token to the ith frame;
and extracting a path of the best recognition result from the first frame to the ith frame from the grid.
2. The method of claim 1, wherein the i is related to a number of frames of an intermediate frame of the first through nth frames.
3. The method of claim 1, wherein after extracting the path of best recognition result from the first frame to the ith frame from the mesh, the method further comprises:
and clipping the token of each frame from the (i + 1) th frame to the N (N) th frame based on the determined path of the best recognition result from the first frame to the ith frame.
4. The method of claim 3, wherein the cropping comprises:
updating the probability of each token for each of the frames starting from the (i + 1) th frame to the nth frame; and/or
Removing paths in the historical set of tokens that do not include the truncated token.
5. The method of claim 3, wherein the cropping is on a frame-by-frame basis.
6. The method of claim 1, wherein at least one token of each frame of the captured real-time speech is determined to be a clipped token during the token passing.
7. A real-time speech recognition system comprising:
the system comprises a token determining program module, a token determining module and a processing module, wherein the token determining program module is used for determining at least one token of each frame from a first frame to an Nth frame in collected real-time voice in the process of token transmission, and the initial token in the process of token transmission is an initial token;
a best path determining program module, configured to determine a path of a current best recognition result based on state probabilities of tokens of each frame, where the path of the current best recognition result is formed by connecting at least N tokens from a first frame to an nth frame;
a truncation program module, configured to select a token in the ith frame in the path of the current best identification result, where the token has a direct connection relationship with the (i + 1) th frame, as a truncation token, and extract a grid formed by paths of multiple identification results from a historical token group formed by the initial token to the truncation token of the ith frame;
and the identification program module is used for extracting a path of the best identification result from the first frame to the ith frame from the grid.
8. The system of claim 7, wherein the system is further configured to:
and clipping the token of each frame from the (i + 1) th frame to the N (N) th frame based on the determined path of the best recognition result from the first frame to the ith frame.
9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any of claims 1-6.
10. A storage medium on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
CN202011207353.0A 2020-11-03 2020-11-03 Real-time voice recognition method and system Active CN112259082B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011207353.0A CN112259082B (en) 2020-11-03 2020-11-03 Real-time voice recognition method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011207353.0A CN112259082B (en) 2020-11-03 2020-11-03 Real-time voice recognition method and system

Publications (2)

Publication Number Publication Date
CN112259082A CN112259082A (en) 2021-01-22
CN112259082B true CN112259082B (en) 2022-04-01

Family

ID=74268264

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011207353.0A Active CN112259082B (en) 2020-11-03 2020-11-03 Real-time voice recognition method and system

Country Status (1)

Country Link
CN (1) CN112259082B (en)

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08508583A (en) * 1993-03-31 1996-09-10 ブリテイッシュ・テレコミュニケーションズ・パブリック・リミテッド・カンパニー Connection speech recognition
US6272455B1 (en) * 1997-10-22 2001-08-07 Lucent Technologies, Inc. Method and apparatus for understanding natural language
CN1201284C (en) * 2002-11-15 2005-05-11 中国科学院声学研究所 Rapid decoding method for voice identifying system
CN104157285B (en) * 2013-05-14 2016-01-20 腾讯科技(深圳)有限公司 Audio recognition method, device and electronic equipment
US9530404B2 (en) * 2014-10-06 2016-12-27 Intel Corporation System and method of automatic speech recognition using on-the-fly word lattice generation with word histories
CN110164421B (en) * 2018-12-14 2022-03-11 腾讯科技(深圳)有限公司 Voice decoding method, device and storage medium
CN110047477B (en) * 2019-04-04 2021-04-09 北京清微智能科技有限公司 Optimization method, equipment and system of weighted finite state converter
CN111862943B (en) * 2019-04-30 2023-07-25 北京地平线机器人技术研发有限公司 Speech recognition method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112259082A (en) 2021-01-22

Similar Documents

Publication Publication Date Title
CN110164435B (en) Speech recognition method, device, equipment and computer readable storage medium
CN111145737B (en) Voice test method and device and electronic equipment
CN111797632B (en) Information processing method and device and electronic equipment
CN108682420B (en) Audio and video call dialect recognition method and terminal equipment
CN110473528B (en) Speech recognition method and apparatus, storage medium, and electronic apparatus
CN111178081B (en) Semantic recognition method, server, electronic device and computer storage medium
CN105336342A (en) Method and system for evaluating speech recognition results
CN110600008A (en) Voice wake-up optimization method and system
CN112951211B (en) Voice awakening method and device
US20220399013A1 (en) Response method, terminal, and storage medium
CN111816216A (en) Voice activity detection method and device
CN111488813B (en) Video emotion marking method and device, electronic equipment and storage medium
CN111554324A (en) Intelligent language fluency identification method and device, electronic equipment and storage medium
CN112908301A (en) Voice recognition method, device, storage medium and equipment
CN112259082B (en) Real-time voice recognition method and system
CN111259189A (en) Music classification method and device
CN116524906A (en) Training data generation method and system for voice recognition and electronic equipment
EP4254400A1 (en) Method and device for determining user intent
CN109510907B (en) Ring tone setting method and device
CN114038487A (en) Audio extraction method, device, equipment and readable storage medium
CN113793623B (en) Sound effect setting method, device, equipment and computer readable storage medium
CN112735394B (en) Semantic parsing method and device for voice
CN114141250A (en) Lyric recognition method and device, electronic equipment and readable storage medium
CN114005436A (en) Method, device and storage medium for determining voice endpoint
CN114420136A (en) Method and device for training voiceprint recognition model and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant after: Sipic Technology Co.,Ltd.

Address before: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant before: AI SPEECH Ltd.

GR01 Patent grant
GR01 Patent grant