CN111477217B - Command word recognition method and device - Google Patents

Command word recognition method and device Download PDF

Info

Publication number
CN111477217B
CN111477217B CN202010268839.9A CN202010268839A CN111477217B CN 111477217 B CN111477217 B CN 111477217B CN 202010268839 A CN202010268839 A CN 202010268839A CN 111477217 B CN111477217 B CN 111477217B
Authority
CN
China
Prior art keywords
state node
state
words
command
compound
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010268839.9A
Other languages
Chinese (zh)
Other versions
CN111477217A (en
Inventor
张猛
冯大航
陈孝良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing SoundAI Technology Co Ltd
Original Assignee
Beijing SoundAI Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing SoundAI Technology Co Ltd filed Critical Beijing SoundAI Technology Co Ltd
Priority to CN202010268839.9A priority Critical patent/CN111477217B/en
Publication of CN111477217A publication Critical patent/CN111477217A/en
Application granted granted Critical
Publication of CN111477217B publication Critical patent/CN111477217B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/083Recognition networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/008Multichannel audio signal coding or decoding using interchannel correlation to reduce redundancy, e.g. joint-stereo, intensity-coding or matrixing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Mathematical Physics (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The application discloses a command word recognition method and a device, wherein the method comprises the following steps: acquiring a voice frame to be recognized; decoding the voice frame to be recognized based on a decoding network and a preset decoding algorithm to recognize command words corresponding to the voice frame to be recognized; each path from the head state node to the tail state node in each state node group of the decoding network is uniquely corresponding to a command word; each state node group comprises a composite state node group; the compound state node group comprises compound state nodes, and the output degree and/or the input degree of the compound state nodes are not less than 2, so that a plurality of paths of the compound state node group correspond to a plurality of command words.

Description

Command word recognition method and device
Technical Field
The present application relates to the field of speech recognition, and in particular, to a method and apparatus for recognizing command words.
Background
Speech recognition technology has been widely used in various industries. The recognition of command words is an important branch of the application of speech recognition technology. The recognition of the command word is to recognize a command that a person shouts on the device, for example, in an intelligent elevator scene, the person shouts on the device to "go to five floors", the device recognizes that the command word is "go to five floors", and a program in the device executes the command. The recognition of the command word is generally based on a decoding network, and a certain decoding algorithm is applied to finally recognize the command word, wherein the decoding network is a plurality of groups of state nodes with logic directions in a program, and each command word corresponds to one group of state nodes.
At present, decoding algorithms, while becoming increasingly sophisticated, implement recognition of command words on a variety of devices. However, the storage space belongs to a scarce resource, the storage of the state node of each command word of the decoding network has obvious influence on the storage space, and as the number of command words increases, the storage space is difficult to support the storage of the state node of the command word, which is a problem to be solved urgently.
Disclosure of Invention
The application provides a command word recognition method and device, which solve the problem that a storage space is difficult to support state node storage of command words in the prior art.
In a first aspect, the present application provides a command word recognition method, including: acquiring a voice frame to be recognized; decoding the voice frame to be recognized based on a decoding network and a preset decoding algorithm to recognize command words corresponding to the voice frame to be recognized; each path from the head state node to the tail state node in each state node group of the decoding network is uniquely corresponding to a command word; each state node group comprises a composite state node group; the compound state node group comprises compound state nodes, and the output degree and/or the input degree of the compound state nodes are not less than 2, so that a plurality of paths of the compound state node group correspond to a plurality of command words.
In the above method, since each state node group includes a composite state node group, and the outgoing degree and/or the incoming degree of the composite state node is not less than 2, a plurality of paths of the composite state node group correspond to a plurality of command words, that is, the composite state node is a state node in which a plurality of command words are multiplexed, thereby saving state nodes and saving storage space.
Optionally, the composite state node is a first state node; the ingress of the first state node is 1, and the egress of the first state node is not less than 2; the first N words of the plurality of command words are identical; n is a positive integer; a plurality of state nodes from a first state node of the compound state node group to the first state node correspond to the first N words of the plurality of command words; the preset decoding algorithm is provided with judgment control conditions at the first state node.
According to the method, the first N characters of the command words can be stored simultaneously by multiplexing the first state node to the state nodes of the first state node, and the control conditions are judged at the first state node, so that the first state node is jumped to different state nodes and corresponds to the command words, and the storage space is saved.
Optionally, the composite state node is a second state node; the ingress of the second state node is not less than 2, and the egress of the second state node is 1; the last M words of the command words are identical; m is a positive integer; a plurality of state nodes from the second state node to a tail state node of the composite state node group correspond to the last M words of the plurality of command words.
In the method, the second state node can be multiplexed to a plurality of state nodes of the tail state node, and the last M words of a plurality of command words are stored at the same time, so that the storage space is saved.
Optionally, the composite state node group includes a first state node and a second state node; the ingress of the first state node is 1, and the egress of the first state node is not less than 2; the ingress of the second state node is not less than 2, and the egress of the second state node is 1; the second state node precedes the first state node; the middle continuous K characters of the command words are identical; k is an integer; a plurality of state nodes from the second state node to the first state node of the composite state node group correspond to the K words of the plurality of command words; the preset decoding algorithm is provided with judgment control conditions at the first state node.
In the above manner, the second state node can be multiplexed to the plurality of state nodes of the first state node, K continuous words in the middle of the plurality of command words are stored at the same time, and the control conditions are judged at the first state node, so that the state node jumps to different state nodes and corresponds to the plurality of command words, and the storage space is saved.
Optionally, the command word recognition method is executed through a terminal device, and the terminal device comprises a single chip microcomputer device.
Under the mode, because the storage space of the single-chip microcomputer equipment is smaller, when the command word recognition method is applied to the single-chip microcomputer equipment, the proportion of the reduction of the storage space is larger, and the storage space is more obvious.
Optionally, the decoding network and a preset decoding algorithm are based on to decode the voice frame to be recognized, so as to recognize the command word corresponding to the voice frame to be recognized; comprising the following steps: based on a decoding network and a preset decoding algorithm, identifying each state score on each path from a head state node to a tail state node in each state node group; and determining command words corresponding to paths with highest state scores in the paths as command words corresponding to the voice frames to be recognized.
In the method, based on the decoding network and a preset decoding algorithm, each state score on each path is identified, and the command word corresponding to the path with the highest state score is determined, so that the method for determining the command word through the state score is provided.
Optionally, determining a command word corresponding to a path with the highest status score in the paths, and before the command word corresponding to the voice frame to be recognized, further includes: determining that the state score of at least one path in the paths is greater than a preset threshold value; determining command words corresponding to paths with highest state scores in the paths as command words corresponding to the voice frames to be recognized; comprising the following steps: and determining command words corresponding to the paths with the highest state scores from at least one path with the state scores larger than a preset threshold.
In the method, the state score of at least one path in each path is determined to be larger than the preset threshold value, so that the command word with the at least one path is determined to meet a certain precision, the command word corresponding to the path with the highest state score can be determined directly from the command words with the at least one path, and the command word of the voice frame to be recognized is determined on the basis of the precision.
In a second aspect, the present application provides a command word recognition apparatus, comprising: the acquisition module is used for acquiring the voice frame to be identified; the decoding module is used for decoding the voice frame to be recognized based on a decoding network and a preset decoding algorithm so as to recognize command words corresponding to the voice frame to be recognized; each path from the head state node to the tail state node in each state node group of the decoding network is uniquely corresponding to a command word; each state node group comprises a composite state node group; the compound state node group comprises compound state nodes, and the output degree and/or the input degree of the compound state nodes are not less than 2, so that a plurality of paths of the compound state node group correspond to a plurality of command words.
Optionally, the composite state node is a first state node; the ingress of the first state node is 1, and the egress of the first state node is not less than 2; the first N words of the plurality of command words are identical; n is a positive integer; a plurality of state nodes from a first state node of the compound state node group to the first state node correspond to the first N words of the plurality of command words; the preset decoding algorithm is provided with judgment control conditions at the first state node.
Optionally, the composite state node is a second state node; the ingress of the second state node is not less than 2, and the egress of the second state node is 1; the last M words of the command words are identical; m is a positive integer; a plurality of state nodes from the second state node to a tail state node of the composite state node group correspond to the last M words of the plurality of command words.
Optionally, the composite state node group includes a first state node and a second state node; the ingress of the first state node is 1, and the egress of the first state node is not less than 2; the ingress of the second state node is not less than 2, and the egress of the second state node is 1; the second state node precedes the first state node; the middle continuous K characters of the command words are identical; k is an integer; a plurality of state nodes from the second state node to the first state node of the composite state node group correspond to the K words of the plurality of command words; the preset decoding algorithm is provided with judgment control conditions at the first state node.
Optionally, the device is a single-chip microcomputer device.
Optionally, the decoding module is specifically configured to: based on a decoding network and a preset decoding algorithm, identifying each state score on each path from a head state node to a tail state node in each state node group; and determining command words corresponding to paths with highest state scores in the paths as command words corresponding to the voice frames to be recognized.
Optionally, the decoding module is further configured to: determining that the state score of at least one path in the paths is greater than a preset threshold value; the decoding module is specifically configured to: and determining command words corresponding to the paths with the highest state scores from at least one path with the state scores larger than a preset threshold.
The advantages of the foregoing second aspect and the advantages of the foregoing optional apparatuses of the second aspect may refer to the advantages of the foregoing first aspect and the advantages of the foregoing optional methods of the first aspect, and will not be described herein.
In a third aspect, the present application provides a computer device comprising a program or instructions which, when executed, is operable to perform the above-described first aspect and the respective alternative methods of the first aspect.
In a fourth aspect, the present application provides a storage medium comprising a program or instructions which, when executed, is adapted to carry out the above-described first aspect and the respective alternative methods of the first aspect.
Drawings
FIG. 1 is a flowchart illustrating steps of a command word recognition method according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a command word recognition device according to an embodiment of the present application.
Detailed Description
In order to better understand the above technical solutions, the following detailed description will be made with reference to the accompanying drawings and specific embodiments of the present application, and it should be understood that specific features in the embodiments and examples of the present application are detailed descriptions of the technical solutions of the present application, and not limiting the technical solutions of the present application, and the technical features in the embodiments and examples of the present application may be combined with each other without conflict.
Command word: (command word):
the command word is a command that a person shouts towards the device, and then a program in the device recognizes the command and then makes a corresponding action. The command words include offline command words (offline command word) and online command words (online command word). After a set of command words is generated, all the command words may not be deployed on different devices. The off-line command words, that is to say the command word recognition function, is not necessary for the device, and the dynamic library and model for these command words at system start-up is not necessarily present, nor does it affect the normal use of other functions of the system, if not present. If the corresponding dynamic library and model exist, the system will initialize the offline command word recognition function, and if not, the system will not initialize the function. Corresponding to the system is an online command word, such as a command word corresponding to a wake-up function, wherein the wake-up function is necessary, otherwise, the system cannot work normally.
Decoding network (Decoding network):
the command word needs to be decoded when being identified, and the decoding needs to be based on a decoding network and finally identify the command word by applying a certain decoding algorithm, wherein the decoding network is a plurality of state nodes with logic orientation in a program. The nodes form a graph, and jump between the nodes in different states during decoding.
At present, although the decoding algorithm for identifying command words in voice frames is mature, the identification of command words is realized on various devices. However, as the number of command words increases, the storage space is difficult to support the storage of the status nodes of the command words, which is a problem to be solved.
To this end, as shown in fig. 1, the present application provides a command word recognition method.
Step 101: and acquiring a voice frame to be recognized.
Step 102: and decoding the voice frame to be recognized based on a decoding network and a preset decoding algorithm, so as to recognize command words corresponding to the voice frame to be recognized.
In the implementation steps 101 to 102, it should be noted that, a section of speech frame to be recognized includes a plurality of frame voices, and a plurality of frame voices correspond to one state, each of the plurality of states (e.g., three states) is combined into a phoneme, and a plurality of phonemes are combined into an english word (or a single word or phrase of chinese). That is, a speech recognition result of a section of speech frame to be recognized can be obtained according to the matching of each frame of speech and the state. Whereas the decoding network is essentially a graph of a plurality of state nodes, each of which is a phoneme (or state of a factor) of a word. The decoding network comprises groups of state nodes (in practice paths of a plurality of state nodes), each group of state nodes corresponding to a word (or a single word, phrase of chinese), so that the decoding network is the basis for speech recognition. The rule of judging whether each frame of voice is matched with the state or not is a decoding algorithm according to the existing decoding network and the acquired voice frame to be recognized. Further, the speech recognition process is how to search a matching best path (i.e. a matching set of state nodes) in the decoding network according to the decoding algorithm, and the probability that the speech frame to be recognized corresponds to the path is the greatest, which is called "decoding".
In each state node group of the decoding network, each path from the head state node to the tail state node is uniquely corresponding to a command word. Each state node group comprises a plurality of connected state nodes, the degree of each state node comprises an input degree and an output degree, the degree of each state node is not 0, wherein the input degree of the first state node is 0, but the output degree is not 0; the tail state node has an out-degree of 0, but an in-degree of not 0. Each word in the command word may correspond to one or more status nodes; for example, each word corresponds to 3 state nodes.
Each state node group may have only one path or may have multiple paths. If the ingress and egress of a state node in a state node group is not greater than 1, it is apparent that the state node group has only one path (which may be referred to as a single state node group). Correspondingly, each state node group can also comprise a composite state node group; the compound state node group comprises compound state nodes, and the output degree and/or the input degree of the compound state nodes are not less than 2, so that a plurality of paths of the compound state node group correspond to a plurality of command words.
In an alternative embodiment (hereinafter referred to as embodiment (1)) for the case where the degree of egress of the composite state node is not less than 2, the composite state node is the first state node; the ingress of the first state node is 1, and the egress of the first state node is not less than 2; the first N words of the command words are identical, and N is a positive integer; a plurality of state nodes from a first state node of the compound state node group to the first state node correspond to the first N words of the plurality of command words; the preset decoding algorithm is provided with judgment control conditions at the first state node.
For example, each of the command words may occupy three state nodes when constructing the decoding network. Taking the two command words "call Zhang three" and "call Lifour" as examples, the first 2 words of these two command words are the same "call", i.e. N=2 at this time. The 6 state nodes formed by the two words of "call" can be multiplexed, namely, the 3 rd state node of the "call" word can be set as the first state node in the 6 state nodes formed by the two words of "call", and the outgoing degree of the first state node is 2 and the incoming degree is 1. Then the direction of one jump of the first state node is 6 state nodes formed by 'Zhang Sanj', and the other jump direction is 6 state nodes formed by 'Li Si', and the specific jump can be realized by judging the control condition of the first state node; the more command words, only need to increase the output of the first state node, can be pushed in this way, and will not be described here again.
In an alternative embodiment (hereinafter referred to as embodiment (2)) for the case where the degree of ingress of the composite state node is not less than 2, the composite state node is a second state node; the ingress of the second state node is not less than 2, and the egress of the second state node is 1; the last M words of the command words are identical; m is a positive integer; a plurality of state nodes from the second state node to a tail state node of the composite state node group correspond to the last M words of the plurality of command words.
For example, each of the command words occupies three state nodes when constructing the decoding network. There are two command words: "read information"; "send information". Obviously, the last 2 words of the two command words are identical "information", i.e. m=2 at this time. Then the 6 state nodes consisting of the two words "information" can be multiplexed. The first state node of the 6 state nodes formed by the two words of the information can be taken as the second state node, and at the moment, the input degree of the second state node is 2, and the output degree is 1. Then the direction of jumping to the second state node is 6 state nodes formed by 'reading', the direction of jumping to the second state node is 6 state nodes formed by 'direction', and the direction of jumping to the second state node from which direction is not needed to be judged by the second state node, therefore, the judgment control condition can not be set on the second state node; the more command words, only the entering degree of the second state node needs to be increased, and the more commands can be pushed, and the details are not repeated here.
The cases where the embodiment (1) and the embodiment (2) are applied may exist at the same time, and the embodiment (1) and the embodiment (2) are combined with each other. For example, in an elevator usage scenario, the command words "go to XX," such as go to first floor, go to second floor, "go to" and "floor" are repeated in each command word, so that the "go to" and "floor" status nodes need only be constructed once in the decoding network. I.e. n=1 at this time, and m=1. The first state node is the last state node of the "go", and the second state node is the first state node of the "building".
Further, these words may be repeated in "XX", for example, "ten" in "going to ten" and "ten" in "going to ten" are repeated, and the state node of "ten" needs to be constructed only once, and no repeated construction is required, where the last state node of "ten" may be used as the first state node, and the first state node of "one" may be used as the second state node. Thus, the last state node of the ten has two jumping possibilities, such as jumping to a building or one, in which case, the control condition is only judged and controlled in the program, and no extra storage space is consumed.
In the alternative embodiment (hereinafter referred to as embodiment (3)), the composite state node group includes a first state node and a second state node; the ingress of the first state node is 1, and the egress of the first state node is not less than 2; the ingress of the second state node is not less than 2, and the egress of the second state node is 1; the second state node precedes the first state node; the middle continuous K characters of the command words are identical; k is an integer; a plurality of state nodes from the second state node to the first state node of the composite state node group correspond to the K words of the plurality of command words; the preset decoding algorithm is provided with judgment control conditions at the first state node.
For example, each of the command words occupies three state nodes when constructing the decoding network. There are two command words: "shut down after printing the file"; "light screen before playing file". Obviously, the last 2 words "file" of these two command words are identical, i.e. k=2 at this time. Then the 6 state nodes consisting of the two words "file" can be multiplexed. The first state node of the 6 state nodes formed by the two words of the file can be taken as the second state node, and at the moment, the input degree of the second state node is 2, and the output degree is 1. On the one hand, one direction of jumping to the second state node is 6 state nodes formed by printing, and the other direction of jumping to the second state node is 6 state nodes formed by playing; on the other hand, one jump direction of the first state node is 6 state nodes formed by 'power off', the other jump direction is 6 state nodes formed by 'bright screen', and the specific jump can be realized by judging control conditions of the first state node; the direction of the jump to the second state node is not required to be judged by the second state node, so that the judgment control condition is not required to be set on the second state node; more command words only need to increase the output degree of the first state node and/or the input degree of the second state node, and can be pushed in this way, and the details are not repeated here.
The command word recognition methods of steps 101 to 102 may be performed by a terminal device. The terminal device may comprise a low storage space (e.g. low memory) terminal device, typically a low storage space terminal device such as a single chip device. At this time, the influence of each state node on the storage space of the whole terminal device is obvious, after redundant state nodes are removed, the reduction ratio of the storage space is larger, the storage space saving is more obvious, and the meaning of multiplexing the composite state nodes is larger.
Further, from the control cost point of view, for some low-end devices, the cost of the low-end devices is low, and the configured memory is low, for example, only 512KB of memory, so that the meaning of optimizing the memory is great. For example, low-end devices are typically very high-volume, and if memory is not optimized, 1MB of memory may be configured in hardware, and 512KB of memory may be sufficient after optimization. Thus, assuming that each device saves 1 yuan, and the output is huge, the product is much less, and the cost of saving is also considerable.
It should be noted that, in an alternative embodiment of step 102, step 102 may be performed in the following manner:
step (a): and identifying each state score on each path from the head state node to the tail state node in each state node group based on a decoding network and a preset decoding algorithm.
Step (b): and determining command words corresponding to paths with highest state scores in the paths as command words corresponding to the voice frames to be recognized.
The decoding network in the above embodiment may be any one of the steps 101 to 102 and the optional embodiment or the decoding network combined with each other in the present application, and the preset decoding algorithm may be a viterbi decoding algorithm (Viterbi decoding algorithm) or the like.
Before the step (b), it may be determined that the state score of at least one path in the paths is greater than a preset threshold, and it is determined that a relationship between the state scores of the at least one path is compared, and then the path with the highest state score is found. After determining that the status score of at least one of the paths is greater than a preset threshold, step (b) may be performed as follows: and determining command words corresponding to the paths with the highest state scores from at least one path with the state scores larger than a preset threshold.
As shown in fig. 2, the present application provides a command word recognition apparatus, comprising: an acquisition module 201, configured to acquire a speech frame to be identified; the decoding module 202 is configured to decode the speech frame to be recognized based on a decoding network and a preset decoding algorithm, so as to recognize a command word corresponding to the speech frame to be recognized; each path from the head state node to the tail state node in each state node group of the decoding network is uniquely corresponding to a command word; each state node group comprises a composite state node group; the compound state node group comprises compound state nodes, and the output degree and/or the input degree of the compound state nodes are not less than 2, so that a plurality of paths of the compound state node group correspond to a plurality of command words.
Optionally, the composite state node is a first state node; the ingress of the first state node is 1, and the egress of the first state node is not less than 2; the first N words of the plurality of command words are identical; n is a positive integer; a plurality of state nodes from a first state node of the compound state node group to the first state node correspond to the first N words of the plurality of command words; the preset decoding algorithm is provided with judgment control conditions at the first state node.
Optionally, the composite state node is a second state node; the ingress of the second state node is not less than 2, and the egress of the second state node is 1; the last M words of the command words are identical; m is a positive integer; a plurality of state nodes from the second state node to a tail state node of the composite state node group correspond to the last M words of the plurality of command words.
Optionally, the composite state node group includes a first state node and a second state node; the ingress of the first state node is 1, and the egress of the first state node is not less than 2; the ingress of the second state node is not less than 2, and the egress of the second state node is 1; the second state node precedes the first state node; the middle continuous K characters of the command words are identical; k is an integer; a plurality of state nodes from the second state node to the first state node of the composite state node group correspond to the K words of the plurality of command words; the preset decoding algorithm is provided with judgment control conditions at the first state node.
Optionally, the device is a single-chip microcomputer device.
Optionally, the decoding module 202 is specifically configured to: based on a decoding network and a preset decoding algorithm, identifying each state score on each path from a head state node to a tail state node in each state node group; and determining command words corresponding to paths with highest state scores in the paths as command words corresponding to the voice frames to be recognized.
Optionally, the decoding module 202 is further configured to: determining that the state score of at least one path in the paths is greater than a preset threshold value; the decoding module 202 is specifically configured to: and determining command words corresponding to the paths with the highest state scores from at least one path with the state scores larger than a preset threshold.
The embodiment of the application provides a computer device, which comprises a program or an instruction, and the program or the instruction is used for executing the command word recognition method and any optional method provided by the embodiment of the application when being executed.
The embodiment of the application provides a storage medium comprising a program or an instruction, which when executed, is used for executing the command word recognition method and any optional method provided by the embodiment of the application.
Finally, it should be noted that: it will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (10)

1. A method of command word recognition, the method comprising:
acquiring a voice frame to be recognized;
decoding the voice frame to be recognized based on a decoding network and a preset decoding algorithm to recognize command words corresponding to the voice frame to be recognized;
each path from the head state node to the tail state node in each state node group of the decoding network is uniquely corresponding to a command word; each state node group comprises a composite state node group; the compound state node group comprises a plurality of compound state nodes, and the output degree and/or the input degree of the compound state nodes are not less than 2, so that a plurality of paths of the compound state node group correspond to a plurality of command words; and a plurality of nodes from a first state node to a compound state node with an input degree of 1 in the compound state node group correspond to the same words in front of the command words, a plurality of nodes from a compound state node with an output degree of 1 to a tail state node in the compound state node group correspond to the same words in back of the command words, and a plurality of nodes from the compound state node with the output degree of 1 to the compound state node with the input degree of 1 correspond to the same words in the middle of the command words.
2. The method of claim 1, wherein the composite state node is a first state node; the ingress of the first state node is 1, and the egress of the first state node is not less than 2; the first N words of the plurality of command words are identical; n is a positive integer; a plurality of state nodes from a first state node of the compound state node group to the first state node correspond to the first N words of the plurality of command words; the preset decoding algorithm is provided with judgment control conditions at the first state node.
3. The method of claim 1, wherein the composite state node is a second state node; the ingress of the second state node is not less than 2, and the egress of the second state node is 1; the last M words of the command words are identical; m is a positive integer; a plurality of state nodes from the second state node to a tail state node of the composite state node group correspond to the last M words of the plurality of command words.
4. The method of claim 1, wherein the composite state node group comprises a first state node and a second state node; the ingress of the first state node is 1, and the egress of the first state node is not less than 2; the ingress of the second state node is not less than 2, and the egress of the second state node is 1; the second state node precedes the first state node; the middle continuous K characters of the command words are identical; k is an integer; a plurality of state nodes from the second state node to the first state node of the composite state node group correspond to the K words of the plurality of command words; the preset decoding algorithm is provided with judgment control conditions at the first state node.
5. The method according to any one of claims 1 to 4, wherein the command word recognition method is performed by a terminal device, the terminal device comprising a single chip device.
6. The method according to any one of claims 1 to 4, wherein the speech frame to be recognized is decoded based on a decoding network and a preset decoding algorithm, so as to recognize command words corresponding to the speech frame to be recognized; comprising the following steps:
based on a decoding network and a preset decoding algorithm, identifying each state score on each path from a head state node to a tail state node in each state node group;
and determining command words corresponding to paths with highest state scores in the paths as command words corresponding to the voice frames to be recognized.
7. The method of claim 6, wherein determining the command word corresponding to the path with the highest status score in the paths, before being the command word corresponding to the speech frame to be recognized, further comprises:
determining that the state score of at least one path in the paths is greater than a preset threshold value;
determining command words corresponding to paths with highest state scores in the paths as command words corresponding to the voice frames to be recognized; comprising the following steps:
and determining command words corresponding to the paths with the highest state scores from at least one path with the state scores larger than a preset threshold.
8. A command word recognition apparatus, comprising:
the acquisition module is used for acquiring the voice frame to be identified;
the decoding module is used for decoding the voice frame to be recognized based on a decoding network and a preset decoding algorithm so as to recognize command words corresponding to the voice frame to be recognized; each path from the head state node to the tail state node in each state node group of the decoding network is uniquely corresponding to a command word; each state node group comprises a plurality of compound state node groups; the compound state node group comprises compound state nodes, and the output degree and/or the input degree of the compound state nodes are not less than 2, so that a plurality of paths of the compound state node group correspond to a plurality of command words; and a plurality of nodes from a first state node to a compound state node with an input degree of 1 in the compound state node group correspond to the same words in front of the command words, a plurality of nodes from a compound state node with an output degree of 1 to a tail state node in the compound state node group correspond to the same words in back of the command words, and a plurality of nodes from the compound state node with the output degree of 1 to the compound state node with the input degree of 1 correspond to the same words in the middle of the command words.
9. A computer device comprising a program or instructions which, when executed, performs the method of any of claims 1 to 7.
10. A storage medium comprising a program or instructions which, when executed, perform the method of any one of claims 1 to 7.
CN202010268839.9A 2020-04-08 2020-04-08 Command word recognition method and device Active CN111477217B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010268839.9A CN111477217B (en) 2020-04-08 2020-04-08 Command word recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010268839.9A CN111477217B (en) 2020-04-08 2020-04-08 Command word recognition method and device

Publications (2)

Publication Number Publication Date
CN111477217A CN111477217A (en) 2020-07-31
CN111477217B true CN111477217B (en) 2023-10-10

Family

ID=71750190

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010268839.9A Active CN111477217B (en) 2020-04-08 2020-04-08 Command word recognition method and device

Country Status (1)

Country Link
CN (1) CN111477217B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014101717A1 (en) * 2012-12-28 2014-07-03 安徽科大讯飞信息科技股份有限公司 Voice recognizing method and system for personalized user information
CN105321518A (en) * 2014-08-05 2016-02-10 中国科学院声学研究所 Rejection method for low-resource embedded speech recognition
CN110046276A (en) * 2019-04-19 2019-07-23 北京搜狗科技发展有限公司 The search method and device of keyword in a kind of voice
CN110322884A (en) * 2019-07-09 2019-10-11 科大讯飞股份有限公司 A kind of slotting word method, apparatus, equipment and the storage medium of decoding network
CN110827802A (en) * 2019-10-31 2020-02-21 苏州思必驰信息科技有限公司 Speech recognition training and decoding method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103971686B (en) * 2013-01-30 2015-06-10 腾讯科技(深圳)有限公司 Method and system for automatically recognizing voice

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014101717A1 (en) * 2012-12-28 2014-07-03 安徽科大讯飞信息科技股份有限公司 Voice recognizing method and system for personalized user information
CN105321518A (en) * 2014-08-05 2016-02-10 中国科学院声学研究所 Rejection method for low-resource embedded speech recognition
CN110046276A (en) * 2019-04-19 2019-07-23 北京搜狗科技发展有限公司 The search method and device of keyword in a kind of voice
CN110322884A (en) * 2019-07-09 2019-10-11 科大讯飞股份有限公司 A kind of slotting word method, apparatus, equipment and the storage medium of decoding network
CN110827802A (en) * 2019-10-31 2020-02-21 苏州思必驰信息科技有限公司 Speech recognition training and decoding method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘加 ; 陈谐 ; 单煜翔 ; 史永哲 ; .大规模词表连续语音识别引擎紧致动态网络的构建.清华大学学报(自然科学版).2012,(第11期),第1-3节. *

Also Published As

Publication number Publication date
CN111477217A (en) 2020-07-31

Similar Documents

Publication Publication Date Title
CN102971787B (en) Method and system for endpoint automatic detection of audio record
CN109769115B (en) Method, device and equipment for optimizing intelligent video analysis performance
US11302303B2 (en) Method and device for training an acoustic model
US7058575B2 (en) Integrating keyword spotting with graph decoder to improve the robustness of speech recognition
CN103594089A (en) Voice recognition method and electronic device
CN110767236A (en) Voice recognition method and device
CN111477217B (en) Command word recognition method and device
CN109375956A (en) A kind of method of reboot operation system, logical device and control equipment
CN110556102A (en) intention recognition and execution method, device, vehicle-mounted voice conversation system and computer storage medium
CN113192501B (en) Instruction word recognition method and device
CN112509557B (en) Speech recognition method and system based on non-deterministic word graph generation
CN113535366A (en) High-performance distributed combined multi-channel video real-time processing method
CN106775906A (en) Business flow processing method and device
CN113205809A (en) Voice wake-up method and device
CN116189677A (en) Method, system, equipment and storage medium for identifying multi-model voice command words
CN108847236A (en) Method and device for receiving voice information and method and device for analyzing voice information
CN111128172B (en) Voice recognition method, electronic equipment and storage medium
CN112269863A (en) Man-machine conversation data processing method and system of intelligent robot
US20030110032A1 (en) Fast search in speech recognition
Uzan et al. Greed is all you need: An evaluation of tokenizer inference methods
CN115831109A (en) Voice awakening method and device, storage medium and electronic equipment
CN112766470A (en) Feature data processing method, instruction sequence generation method, device and equipment
CN115294974A (en) Voice recognition method, device, equipment and storage medium
CN106469553A (en) Audio recognition method and device
CN111970311B (en) Session segmentation method, electronic device and computer readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant