CN115831100B

CN115831100B - Voice command word recognition method, device, equipment and storage medium

Info

Publication number: CN115831100B
Application number: CN202310149046.9A
Authority: CN
Inventors: 李�杰; 王广新; 杨汉丹
Original assignee: Shenzhen Youjie Zhixin Technology Co ltd
Current assignee: Shenzhen Youjie Zhixin Technology Co ltd
Priority date: 2023-02-22
Filing date: 2023-02-22
Publication date: 2023-05-05
Anticipated expiration: 2043-02-22
Also published as: CN115831100A

Abstract

The application relates to the technical field of voice recognition, and provides a voice command word recognition method, a device, equipment and a storage medium, wherein the method comprises the following steps: monitoring a voice command word sent by a user; recognizing voice command words sent by a user to obtain a plurality of voice command word recognition results to be determined and scores of the voice command word recognition results to be determined; judging whether the recognition result of the voice command word to be determined with the first score is a preset voice command word or not; the voice command word is a voice command word with the largest effective phoneme occupation ratio; effective phoneme ratio= (voice command word all phonemes-prefix phonemes)/voice command word all phonemes; if yes, determining the preset voice command word as the voice command word sent by the user; if not, determining the voice command word sent by the user by combining the method for delaying the voice command word recognition time. The method and the device improve the accuracy of voice command word recognition with the same prefix.

Description

Voice command word recognition method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, a device, and a storage medium for recognizing a speech command word.

Background

When the wake-up word and command word model is applied, the voice of the user is monitored in real time, and feedback is made when specific words are monitored. In order to meet the daily use habit of people, some command words are set, and have the same prefix, so that the distinction degree is reduced, and the command words are easy to identify by mistake. The explanation will be given by taking volume up, volume down, volume medium as an example.

'volume up': 'V_AA_L_Y_UW_M_AH_P'

'volume down': 'V_AA_L_Y_UW_M_D_AW_N'

'volume medium': 'V_AA_L_Y_UW_M_M_IY_D_IY_AH_M'

The left side is a keyword, the right side is a phoneme corresponding to the keyword, the right side is divided by underlining, the proportion of up, down, medium to the whole keyword is sequentially from small to large as seen from the phoneme proportion, and the misidentification is specifically shown in that after volume pronunciation is completed, the current identification score possibly exceeds the judgment threshold value of volume up, so that misidentification is caused, and even if you want to send volume down or volume medium sounds.

Disclosure of Invention

The present application aims to provide a method, a device, equipment and a storage medium for recognizing a voice command word, which aim to solve the technical problem that the voice command word with the same prefix is easy to be recognized by mistake.

In a first aspect, an embodiment of the present application provides a method for recognizing a voice command word, including:

monitoring a voice command word sent by a user;

recognizing the voice command words sent by the user to obtain a plurality of voice command word recognition results to be determined and scores of the voice command word recognition results to be determined;

judging whether the recognition result of the voice command word to be determined with the first score is a preset voice command word or not; the voice command word is a voice command word with the largest effective phoneme occupation ratio; effective phoneme ratio= (voice command word all phonemes-prefix phonemes)/voice command word all phonemes;

if yes, determining the preset voice command word as the voice command word sent by the user;

if not, determining the voice command word sent by the user by combining the method for delaying the voice command word recognition time.

Further, the step of determining the voice command word sent by the user by combining the method for delaying the voice command word recognition time comprises the following steps:

judging whether the score of the first voice command word recognition result to be determined is larger than a corresponding confidence threshold value or not;

if yes, determining the recognition result of the voice command word to be determined with the first score as the voice command word sent by the user;

if not, judging whether the recognition result of the voice command word to be determined with the second score is larger than the corresponding judgment threshold value;

if not, re-identifying;

if yes, updating the recognition result of the voice command word to be determined with the first score by using the recognition result of the voice command word to be determined with the second score, and judging whether the updated recognition result of the voice command word to be determined with the first score is the preset voice command word;

if not, determining the voice command word sent by the user by delaying the voice command word recognition time.

Further, the determining the voice command word sent by the user by delaying the voice command word recognition time includes:

judging whether the recognized voice command word recognition result is the preset voice command word and whether the score of the recognized voice command word recognition result is larger than the corresponding judgment threshold value or not in the voice command word delay recognition time;

if not, storing and updating the score of the voice command word recognition result in real time in a mode of replacing the low score with the high score, and judging whether the score of the voice command word recognition result to be determined with the first score is larger than the corresponding confidence threshold value or not when the delay time is over;

if not, re-identifying.

Further, before the step of determining whether the recognized voice command word recognition result is the preset voice command word and whether the score of the recognized voice command word recognition result is greater than the corresponding determination threshold value, the method includes:

matching the voice command word recognition result to be determined with the first updated score with the voice command words in the voice command word library to obtain matched voice command words;

searching delay recognition time corresponding to the matched voice command word in a voice command word-delay recognition time mapping library.

Further, the delay recognition time corresponding to each voice command word in the voice command word-delay recognition time mapping library is determined according to the pronunciation time of the voice command word and the pronunciation time of the voice command word with the same prefix and the longest pronunciation of the voice command word.

Further, the plurality of voice command word recognition results to be determined comprise a plurality of voice command word recognition results with the same prefix.

Further, the voice command word recognition result to be determined includes any two or three of volume up, volume down and volume medium.

In a second aspect, an embodiment of the present application provides a voice command word recognition apparatus, including:

the monitoring module is used for monitoring voice command words sent by a user;

the recognition module is used for recognizing the voice command words sent by the user to obtain a plurality of voice command word recognition results to be determined and scores of the voice command word recognition results to be determined;

the judging module is used for judging whether the recognition result of the voice command word to be determined with the first score is a preset voice command word or not; the voice command word is a voice command word with the largest effective phoneme occupation ratio; effective phoneme ratio= (voice command word all phonemes-prefix phonemes)/voice command word all phonemes;

In a third aspect, an embodiment of the present application provides a computer device, including a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the method for recognizing a speech command word according to any one of the above when executing the computer program.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the speech command word recognition method as described in any of the above.

According to the embodiment of the application, the voice command word with the largest effective phoneme proportion is used as the preset voice command word; the method comprises the steps that the effective phonemic duty ratio= (all phonemes of a voice command word-prefix phonemes)/all phonemes of the voice command word are monitored, voice command words sent by a user are monitored, the voice command words sent by the user are recognized, and a plurality of voice command word recognition results to be determined and scores of each voice command word recognition result to be determined are obtained; when the first voice command word recognition result to be determined is detected to be not the preset voice command word, the voice command word sent by the user is determined by combining the method for delaying the voice command word recognition time, so that the technical problem that the voice command word with the same prefix is easy to be recognized by mistake is solved, and the accuracy of voice command word recognition with the same prefix is improved. The voice command words with the same prefix and larger occupation have larger influence on the overall score, and can greatly reduce misrecognition. In addition, the processing logic is simple, the reasoning time is not additionally increased, and meanwhile, the problem of misidentification can be solved without retraining the model. In addition, the selection of command words is more flexible. Previously, in such a case, command words with large distinction and large pronunciation difference are selected intentionally, but the habituation of pronunciation is lost while the recognition is facilitated. For example, the problem of misidentification can be solved by changing volume up and volume down into increment volume and turn down volume, but the method is not in line with the daily habits of people.

Drawings

In order to more clearly illustrate the technical solutions of the present application, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart illustrating a method for recognizing a voice command word according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating a method for determining a voice command word issued by a user in combination with a method for delaying a voice command word recognition time according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of determining a voice command word issued by a user after a delay of voice command word recognition time according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a voice command word recognition device according to an embodiment of the present application;

fig. 5 is a schematic block diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, modules, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, modules, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any module and all combination of one or more of the associated listed items.

It will be understood by those skilled in the art that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs unless defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Embodiment one:

referring to fig. 1, an embodiment of the present application provides a voice command word recognition method, which includes steps S1-S5:

s1, monitoring a voice command word sent by a user.

S2, recognizing the voice command words sent by the user to obtain a plurality of voice command word recognition results to be determined and scores of the voice command word recognition results to be determined.

In the embodiment of the present application, it should be noted that, by identifying a voice command word sent by a user, a sequence of recognition results of the voice command word to be determined is obtained, that is, a plurality of recognition results of the voice command word to be determined are obtained, and in addition, each recognition result of the voice command word to be determined has a score corresponding to the recognition result. For example, as shown in the following table:

TABLE 1

Voice command word recognition result to be determined	Score of
		volume medium	90
volume down	80
		volume up	70

S3, judging whether the recognition result of the voice command word to be determined with the first score is a preset voice command word; the voice command word is a voice command word with the largest effective phoneme occupation ratio; effective phoneme ratio= (voice command word all phonemes-prefix phonemes)/voice command word all phonemes.

In the present embodiment, it is assumed that to avoid volume media, volume down, and volume up from being misidentified, and

'volume up': 'V_AA_L_Y_UW_M_AH_P'

'volume down': 'V_AA_L_Y_UW_M_D_AW_N'

'volume medium': 'V_AA_L_Y_UW_M_M_IY_D_IY_AH_M'

the left side is a voice command word, the right side is a phoneme corresponding to the voice command word, the voice command word is divided by underline, and the proportion of up, down, medium to the whole voice command word is sequentially from small to large when seen from the phoneme proportion, wherein the prefix of the three voice command words is volume. Because volume medium is the voice command word with the largest effective phoneme ratio among the three, the volume medium is used as the preset voice command word.

And S4, if so, determining the preset voice command word as the voice command word sent by the user.

In this embodiment of the present application, if the recognition result of the first voice command word to be determined is a preset voice command word, that is, the voice command word with the largest effective phoneme ratio, the first voice command word may be directly considered as the voice command word sent by the user.

S5, if not, determining the voice command word sent by the user by combining a method for delaying the voice command word recognition time.

For ease of understanding, all embodiments described below take volume media, volume down, and volume up as voice command words with the same prefix, and preset voice command words are described as volume media.

For example, when a voice command word issued by a user is monitored, if the voice command word with the first score is recognized as a volume medium, the volume medium is confirmed as the voice command word issued by the user, that is, the voice command word is considered as the voice command word actually issued by the user, and if the voice command word with the first score is recognized as volume down or volume up, the voice command word issued by the user needs to be determined by combining a method of delaying the recognition time of the voice command word.

In the embodiment of the application, if the first voice command word recognition result to be determined is not the preset voice command word, that is, is not the voice command word with the largest effective phoneme ratio, in order to avoid misrecognition, the voice command word sent by the user needs to be determined by combining a method for delaying the voice command word recognition time, so that the technical problem that the voice command word with the same prefix is easy to be misrecognized can be solved.

Referring to fig. 2, in one embodiment, the step of determining the voice command word issued by the user in combination with the method of delaying the voice command word recognition time includes step S51:

s51, judging whether the score of the first voice command word recognition result to be determined is larger than a corresponding confidence threshold value.

In this embodiment of the present application, the confidence threshold indicates that the recognition result of the voice command word is reliable, for example, greater than 90 minutes, and the recognition result is considered to be reliable, and can be directly determined. In addition, each voice command word has its corresponding confidence threshold, e.g., the confidence down threshold is T1 and the confidence up threshold is T2.

S52, if so, determining the recognition result of the voice command word to be determined with the first score as the voice command word sent by the user.

In the embodiment of the present application, if the voice command words with the same prefix include volume medium, volume down, and volume up. And if the recognition result of the voice command word to be determined with the first score is volume down, and the score of volume down is greater than the corresponding confidence threshold T1, then volume down can be determined as the voice command word sent by the user. Or if the recognition result of the voice command word to be determined with the first score is volume up, and the score of the volume up is greater than the corresponding confidence threshold T2, the volume up may be determined as the voice command word sent by the user.

And S53, if not, judging whether the recognition result of the voice command word to be determined with the second score is larger than a corresponding judgment threshold value.

In this embodiment of the present application, each voice command word has a corresponding judgment threshold, and if the score of the voice command word is between the judgment threshold and the confidence threshold, it needs to be determined that the recognition result of the voice command word to be determined is less certain, and further judgment is needed. If the score of the voice command word is less than the judgment threshold, then the voice command word is considered unreliable and can be skipped directly.

S531, if not, re-identifying.

S532, if yes, updating the recognition result of the voice command word to be determined with the first score by using the recognition result of the voice command word to be determined with the second score, and judging whether the updated recognition result of the voice command word to be determined with the first score is the preset voice command word.

In this embodiment of the present application, if the recognition result of the second to-be-determined voice command word is greater than the corresponding judgment threshold, then the recognition result of the first to-be-determined voice command word needs to be updated with the recognition result of the second to-be-determined voice command word, and whether the updated recognition result of the first to-be-determined voice command word is the preset voice command word is judged. For example, if the score 1 is volume up, volume up is not greater than its corresponding confidence threshold, the score 2 is volume medium, volume medium is greater than its corresponding judgment threshold, then volume medium is used to make the volume up more volume up, i.e., volume medium becomes the voice command word with the first score.

S5321, if yes, determining the preset voice command word as the voice command word sent by the user.

In this embodiment of the present application, taking the above example as an example, if the recognition result of the voice command word to be determined with the first updated score is volume medium, then volume medium is determined as the voice command word sent by the user.

S5322, if not, determining the voice command word sent by the user by delaying the voice command word recognition time.

In the embodiment of the present application, taking the above example as an example, if the updated voice command word is volume up or volume down, then it is necessary to determine the voice command word issued by the user by delaying the voice command word recognition time.

Referring to fig. 3, in one embodiment, the determining the voice command word issued by the user by delaying the voice command word recognition time includes S53221-S532232:

s53221, judging whether the recognized voice command word recognition result is the preset voice command word or not and whether the score of the recognized voice command word recognition result is larger than the corresponding judgment threshold value or not in the voice command word delay recognition time.

S53222, if yes, determining the preset voice command word as the voice command word sent by the user.

In this embodiment of the present application, taking the above example as an example, if the recognized voice command word is volume medium, and the score of volume medium is greater than the corresponding judgment threshold, then volume medium may be determined as the voice command word issued by the user.

S53223, if not, storing and updating the score of the voice command word recognition result in real time in a mode of replacing the low score with the high score, and judging whether the score of the voice command word recognition result to be determined with the first score is larger than the corresponding confidence threshold value or not when the delay time is over.

In the embodiment of the present application, taking the above example as an example, if the recognized voice command word is a volume up, the volume up is saved and the score of the volume up is updated with the score higher by the subsequent volume up. For example, the volume up score is 70 points at time t1 and 80 points at time t2, and the volume up score is updated by replacing 70 points with 80 points.

And S532231, if so, determining the recognition result of the voice command word to be determined with the first score as the voice command word sent by the user.

S532232, if not, re-identifying.

In one embodiment, before the step of determining whether the recognized voice command word recognition result is the preset voice command word and whether the score of the recognized voice command word recognition result is greater than the corresponding determination threshold value within the voice command word delay recognition time, the method includes:

In the embodiment of the present application, it should be noted that, because the pronunciation time of each voice command word is different, the delay recognition time corresponding to each voice command word needs to be designed. Therefore, it is necessary to construct a voice command word library and a voice command word-delay recognition time map library in advance, and the voice command word-delay recognition time map library may be used as the voice command word library. It should be noted that the speech command word library includes a plurality of speech command words with identical prefixes.

In one embodiment, the delay recognition time corresponding to each voice command word in the voice command word-delay recognition time mapping library is determined according to the pronunciation time of the voice command word itself and the pronunciation time of the voice command word with the same prefix and the longest pronunciation as the voice command word.

In the embodiment of the present application, the voice command words with the same prefix include volume medium, volume down, and volume up, and then the delay recognition time (which may also be understood as delay duration) corresponding to volume up is determined according to the time difference between the longest and the shortest pronunciation, for example, the volume sounds out more than the volume sounds out, and this value is set to 0.5s. The delay recognition time corresponding to volume down is 0.3s (medium sound is 0.3s more than down sound, which is set to 0.3 s)

In one embodiment, the plurality of voice command word recognition results to be determined include a plurality of voice command word recognition results having the same prefix.

In one embodiment, the voice command word recognition result to be determined includes any two or three of volume up, volume down, and volume medium.

Embodiment two:

referring to fig. 4, an embodiment of the present application provides a voice command word recognition device, including:

the monitoring module 1 is used for monitoring voice command words sent by a user;

the recognition module 2 is used for recognizing the voice command words sent by the user to obtain a plurality of voice command word recognition results to be determined and scores of the voice command word recognition results to be determined;

the judging module 3 is used for judging whether the recognition result of the voice command word to be determined with the first score is a preset voice command word or not; the voice command word is a voice command word with the largest effective phoneme occupation ratio; effective phoneme ratio= (voice command word all phonemes-prefix phonemes)/voice command word all phonemes;

the first determining module 4 is configured to determine, if yes, the preset voice command word as a voice command word sent by the user;

and the second determining module 5 is used for determining the voice command word sent by the user by combining the method of delaying the voice command word recognition time if not.

In one embodiment, the second determining module 5 specifically includes:

the first judging unit is used for judging whether the score of the voice command word recognition result to be determined with the first score is larger than a corresponding confidence threshold value;

the first determining unit is used for determining the recognition result of the voice command word to be determined with the first score as the voice command word sent by the user if the voice command word to be determined with the first score is the voice command word;

the second judging unit is used for judging whether the recognition result of the voice command word to be determined with the second score is larger than the corresponding judging threshold value or not if not;

the re-identification unit is used for re-identifying if not;

the updating judging unit is used for updating the recognition result of the voice command word to be determined with the first score by using the recognition result of the voice command word to be determined with the second score and judging whether the recognition result of the voice command word to be determined with the first score after updating is the preset voice command word or not if yes;

the second determining unit is used for determining the preset voice command word as the voice command word sent by the user if the preset voice command word is the voice command word;

and the third determining unit is used for determining the voice command word sent by the user by delaying the voice command word recognition time if not.

In one embodiment, the third determining unit is specifically configured to:

if not, re-identifying.

In one embodiment, before the step of determining whether the recognized voice command word recognition result is the preset voice command word and whether the score of the recognized voice command word recognition result is greater than the corresponding determination threshold value within the voice command word delay recognition time, the third determining unit is further configured to

Embodiment III:

referring to fig. 5, the embodiment of the present application further provides a computer device, and an internal structure of the computer device may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The nonvolatile storage medium stores an operating device, a computer program, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing data such as a command word recognition method and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. Further, the above-mentioned computer apparatus may be further provided with an input device, a display screen, and the like. The computer program is executed by a processor to realize a voice command word recognition method, comprising the following steps: monitoring a voice command word sent by a user; recognizing the voice command words sent by the user to obtain a plurality of voice command word recognition results to be determined and scores of the voice command word recognition results to be determined; judging whether the recognition result of the voice command word to be determined with the first score is a preset voice command word or not; the voice command word is a voice command word with the largest effective phoneme occupation ratio; effective phoneme ratio= (voice command word all phonemes-prefix phonemes)/voice command word all phonemes; if yes, determining the preset voice command word as the voice command word sent by the user; if not, determining the voice command word sent by the user by combining the method for delaying the voice command word recognition time. Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of a portion of the architecture in connection with the present application and is not intended to limit the computer device to which the present application is applied.

An embodiment of the present application further provides a computer readable storage medium having a computer program stored thereon, which when executed by a processor implements a method for recognizing a speech command word, including the steps of: monitoring a voice command word sent by a user; recognizing the voice command words sent by the user to obtain a plurality of voice command word recognition results to be determined and scores of the voice command word recognition results to be determined; judging whether the recognition result of the voice command word to be determined with the first score is a preset voice command word or not; the voice command word is a voice command word with the largest effective phoneme occupation ratio; effective phoneme ratio= (voice command word all phonemes-prefix phonemes)/voice command word all phonemes; if yes, determining the preset voice command word as the voice command word sent by the user; if not, determining the voice command word sent by the user by combining the method for delaying the voice command word recognition time.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in embodiments may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual speed data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.

The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the claims, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application, or direct or indirect application in other related technical fields are included in the scope of the claims of the present application.

Claims

1. A method for recognizing a voice command word, comprising:

monitoring a voice command word sent by a user;

if not, determining the voice command word sent by the user by combining a method for delaying the voice command word recognition time;

the method for determining the voice command words sent by the user by combining the delay voice command word recognition time comprises the following steps:

if not, re-identifying;

if not, determining the voice command word sent by the user by delaying the voice command word recognition time;

the step of determining the voice command word sent by the user by delaying the voice command word recognition time comprises the following steps:

if not, re-identifying.

2. The voice command word recognition method according to claim 1, wherein before the step of judging whether the recognized voice command word recognition result is the preset voice command word and whether the score of the recognized voice command word recognition result is greater than the corresponding judgment threshold value within the voice command word delay recognition time, comprising:

3. The voice command word recognition method according to claim 2, wherein the delay recognition time corresponding to each voice command word in the voice command word-delay recognition time map library is determined based on the pronunciation time of the voice command word itself and the pronunciation time of the longest-sounding voice command word having the same prefix as the voice command word.

4. The method of claim 1, wherein the plurality of recognition results of the voice command word to be determined comprises a plurality of recognition results of the voice command word having the same prefix.

5. The voice command word recognition method according to claim 4, wherein the voice command word recognition result to be determined includes any two or three of volume up, volume down, and volume medium.

6. A voice command word recognition apparatus, comprising:

the first determining module is used for determining the preset voice command word as the voice command word sent by the user if the voice command word is the preset voice command word;

the second determining module is used for determining the voice command word sent by the user by combining a method for delaying the voice command word recognition time if not;

the second determining module specifically includes:

the re-identification unit is used for re-identifying if not;

a third determining unit, configured to determine, if not, a voice command word issued by the user by delaying the voice command word recognition time;

the third determining unit is specifically configured to:

if not, re-identifying.

7. A computer device comprising a memory and a processor, the memory having stored therein a computer program, characterized in that the processor, when executing the computer program, implements the steps of the speech command word recognition method according to any one of claims 1 to 5.

8. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the speech command word recognition method according to any of claims 1 to 5.