WO2018219023A1 - 一种语音关键词识别方法、装置、终端及服务器 - Google Patents

一种语音关键词识别方法、装置、终端及服务器 Download PDF

Info

Publication number
WO2018219023A1
WO2018219023A1 PCT/CN2018/079769 CN2018079769W WO2018219023A1 WO 2018219023 A1 WO2018219023 A1 WO 2018219023A1 CN 2018079769 W CN2018079769 W CN 2018079769W WO 2018219023 A1 WO2018219023 A1 WO 2018219023A1
Authority
WO
WIPO (PCT)
Prior art keywords
keyword
frame
target
voice
sequence
Prior art date
Application number
PCT/CN2018/079769
Other languages
English (en)
French (fr)
Inventor
王珺
黄志恒
于蒙
蒲松柏
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2018219023A1 publication Critical patent/WO2018219023A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the present invention relates to the field of voice recognition technology, and in particular, to a voice keyword recognition method, device, terminal, and server.
  • voice wake-up technology is more and more widely used in electronic devices, which greatly facilitates the user's operation on electronic devices, allowing users to interact with electronic devices without manual interaction.
  • the word activates the corresponding processing module in the electronic device.
  • Apple's mobile phone uses the keyword "siri" as the voice keyword to activate the voice dialogue assistant function in the Apple mobile phone.
  • the Apple mobile phone detects that the user inputs the voice including the keyword "siri", it automatically activates the voice in the Apple mobile phone.
  • Dialogue Assistant feature When the Apple mobile phone detects that the user inputs the voice including the keyword "siri", it automatically activates the voice in the Apple mobile phone.
  • a voice keyword recognition method, device, terminal and server are provided to realize the recognition of voice keywords in voice, which is crucial for the development of voice wake-up technology.
  • an embodiment of the present invention provides a voice keyword recognition method, apparatus, terminal, and server to implement voice keyword recognition in voice.
  • the embodiment of the present invention provides the following technical solutions:
  • a voice keyword recognition method includes:
  • Selecting a keyword from the keyword sequence is determined as the target keyword, wherein the keyword sequence belongs to the voice keyword;
  • the keyword template corresponding to each keyword in the keyword sequence is determined to be located in the first Whether the hidden layer feature vector of the frame in the voice matches, wherein the keyword template indicates a hidden layer feature vector of the second target frame in the second voice including the target keyword;
  • the hidden layer feature vector of the frame in the first voice is successfully matched with the keyword template corresponding to each keyword in the keyword sequence, it is determined that the first voice includes the Speech keywords.
  • a voice keyword recognition device includes:
  • a first target frame determining unit configured to select a first target frame from a first frame sequence constituting the first voice
  • a target keyword determining unit configured to select a keyword from the keyword sequence as the target keyword, wherein the keyword sequence belongs to the voice keyword
  • a matching unit configured to determine, according to the keyword template corresponding to each keyword in the keyword sequence, that the key template of the first target frame is successfully matched with the keyword template corresponding to the target keyword Whether the hidden layer feature vector of the frame located in the first voice matches, wherein the keyword template indicates a hidden layer feature vector of the second target frame in the second voice including the target keyword;
  • the identifying unit is configured to determine, if the keyword template corresponding to each keyword in the keyword sequence is determined one by one, that the hidden layer feature vector of the frame located in the first voice is successfully matched, The voice keyword is included in a voice.
  • a terminal includes a memory for storing a program, and a processor calling the program, the program for:
  • Selecting a keyword from the keyword sequence is determined as the target keyword, wherein the keyword sequence belongs to the voice keyword;
  • the keyword template corresponding to each keyword in the keyword sequence is determined to be located in the first Whether the hidden layer feature vector of the frame in the voice matches, wherein the keyword template indicates a hidden layer feature vector of the second target frame in the second voice including the target keyword;
  • the hidden layer feature vector of the frame in the first voice is successfully matched with the keyword template corresponding to each keyword in the keyword sequence, it is determined that the first voice includes the Speech keywords.
  • a voice keyword recognition server includes a memory and a processor, the memory is used to store a program, and the processor calls the program, the program is used to:
  • Selecting a keyword from the keyword sequence is determined as the target keyword, wherein the keyword sequence belongs to the voice keyword;
  • the keyword template corresponding to each keyword in the keyword sequence is determined to be located in the first Whether the hidden layer feature vector of the frame in the voice matches, wherein the keyword template indicates a hidden layer feature vector of the second target frame in the second voice including the target keyword;
  • the hidden layer feature vector of the frame in the first voice is successfully matched with the keyword template corresponding to each keyword in the keyword sequence, it is determined that the first voice includes the Speech keywords.
  • a computer readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of the first aspect.
  • a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the first aspect.
  • the embodiment of the invention discloses a voice keyword recognition method, device, terminal and server, which determine a first target frame from a first frame sequence constituting the first voice; and determine a target from a keyword sequence included in the voice keyword a keyword; when it is determined that the hidden layer feature vector of the target frame is successfully matched with the keyword template corresponding to the target keyword (the keyword template indicates a hidden layer feature vector of the second target frame in the second voice including the target keyword), If the keyword templates corresponding to each keyword in the keyword sequence are determined one by one, it is determined that the hidden layer feature vector of the frame located in the first voice is successfully matched, and the manner in which the voice keyword is included in the first voice is determined. The recognition of the speech keywords in the first speech is effectively implemented. Further, the electronic device that facilitates using the voice wake-up technology automatically activates a processing module corresponding to the voice keyword when identifying that the voice keyword is included in the first voice.
  • FIG. 1 is a schematic structural diagram of a voice keyword recognition server according to an embodiment of the present application
  • FIG. 2 is a flowchart of a method for identifying a voice keyword according to an embodiment of the present application
  • FIG. 3 is a flowchart of another method for identifying a voice keyword according to an embodiment of the present application
  • FIG. 4 is a flowchart of a method for selecting a frame from a first frame sequence constituting a first voice to be determined as a first target frame according to an embodiment of the present disclosure
  • FIG. 5 is a flowchart of a method for selecting a keyword from a keyword sequence included in a voice keyword to be determined as a target keyword according to an embodiment of the present disclosure
  • FIG. 6 is a flowchart of a method for generating a keyword template corresponding to a target keyword according to an embodiment of the present disclosure
  • FIG. 7 is a flowchart of a method for selecting a frame with the highest degree of similarity with a target keyword as a second target frame from a second frame sequence based on a final layer feature vector corresponding to each frame according to an embodiment of the present application. ;
  • FIG. 8 is a flowchart of another voice keyword recognition method according to an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a voice keyword recognition apparatus according to an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a keyword template generating unit according to an embodiment of the present disclosure.
  • FIG. 11 is a schematic structural diagram of a second target frame determining unit according to an embodiment of the present disclosure.
  • the embodiment of the present application provides a voice keyword identification method, which is applied to a terminal or a server.
  • the terminal is an electronic device, for example, a mobile terminal, a desktop, or the like.
  • the terminal is an electronic device, for example, a mobile terminal, a desktop, or the like.
  • the above is only an optional manner of the terminal provided by the embodiment of the present application.
  • the inventor can arbitrarily set the specific expression of the terminal according to the requirements of the present application, which is not limited herein.
  • the function of the server (referred to herein as a voice keyword recognition server) to which the voice keyword identification method provided by the embodiment of the present application is applied may be implemented by a single server or a server cluster composed of multiple servers. There is no limit here.
  • the voice keyword recognition server includes a processor 11 and a memory 12.
  • the processor 11, the memory 12, and the communication interface 13 complete communication with each other via the communication bus 14.
  • the communication interface 13 may be an interface of the communication module, such as an interface of a Global System for Mobile Communication (GSM) module.
  • GSM Global System for Mobile Communication
  • the processor 11 is configured to execute a program.
  • the processor 11 may be a central processing unit CPU, or an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention.
  • CPU central processing unit
  • ASIC Application Specific Integrated Circuit
  • the memory 12 is used to store a program.
  • the program can include program code, the program code including computer operating instructions.
  • the program may include a program corresponding to the user interface editor described above.
  • the memory 12 may include a high speed random access memory (RAM) memory, and may also include a non-volatile memory (NVM), such as at least one disk memory.
  • RAM high speed random access memory
  • NVM non-volatile memory
  • the program can be specifically used to:
  • the matching is successful, if the keyword templates corresponding to each keyword in the keyword sequence are determined one by one, it is determined that the hidden layer feature vector of the frame located in the first voice is successfully matched, and the first voice is determined. Includes voice keywords.
  • the structure of a terminal provided by the embodiment of the present application includes at least the structure of the voice keyword recognition server as shown in FIG. 1 above.
  • the structure of the terminal refer to the description of the structure of the voice keyword recognition server. I will not repeat them here.
  • the embodiment of the present application provides a flowchart of a voice keyword recognition method, which is shown in FIG. 2 .
  • the method includes:
  • S201 Select a frame first target frame from a first frame sequence constituting the first voice
  • step S203 Determine whether the hidden layer feature vector of the first target frame is successfully matched with the keyword template corresponding to the target keyword, and the keyword template indicates the hidden layer feature vector of the second target frame in the second voice that includes the target keyword; If the hidden layer feature vector of the first target frame is successfully matched with the keyword template corresponding to the target keyword, step S204 is performed.
  • a voice model is pre-set, and the second voice (including the second voice sequence including the second frame sequence) of the target keyword is input into the voice model, and the hidden layer feature vector of the second target frame in the second voice is obtained.
  • the keyword template corresponding to the target keyword indicates the obtained hidden layer feature vector.
  • the speech model is generated based on a Long Short-Term Memory (LSTM) and a Connectionist Temporal Classification (CTC).
  • LSTM Long Short-Term Memory
  • CTC Connectionist Temporal Classification
  • the above is only an optional manner for generating a voice model provided by the embodiment of the present application.
  • the inventor can arbitrarily set the specific generation process of the voice model according to his own needs, which is not limited herein.
  • the first speech input speech model including the first frame sequence is included, and a hidden layer feature vector corresponding to the first target frame in the first speech is obtained.
  • the hidden layer feature vector of the first target frame is matched with the keyword template corresponding to the target keyword, and it is determined whether the hidden layer feature vector of the first target frame matches the keyword template corresponding to the target keyword, if the matching is successful.
  • Step S204 is successfully executed.
  • determining whether the hidden layer feature vector of the first target frame is successfully matched with the keyword template corresponding to the target keyword includes: calculating a hidden layer feature vector and a target keyword of the first target frame The cosine distance between the corresponding keyword templates; if the calculated cosine distance satisfies the preset value, it is determined that the hidden layer feature vector of the first target frame matches the keyword template corresponding to the target keyword; if the calculated cosine is obtained; If the distance does not meet the preset value, it is determined that the hidden layer feature vector of the first target frame is not successfully matched (failed) with the keyword template corresponding to the target keyword.
  • step S203 determining whether the keyword template corresponding to each keyword in the keyword sequence has been determined one by one has determined the hidden layer feature of the frame located in the first voice. The vector is successfully matched with it; if so, it is determined that the voice is included in the first voice.
  • FIG. 3 is a flowchart of another voice keyword recognition method according to an embodiment of the present application.
  • the method includes:
  • step S303 Determine whether the hidden layer feature vector of the first target frame is successfully matched with the keyword template corresponding to the target keyword, and the keyword template indicates the hidden layer feature vector of the second target frame in the second voice that includes the target keyword; If the hidden layer feature vector of the first target frame is successfully matched with the keyword template corresponding to the target keyword, step S304 is performed; if the matching is unsuccessful, the process returns to step S301;
  • step S304 Determine whether the keyword template corresponding to each keyword in the keyword sequence has been determined one by one, and the hidden layer feature vector of the frame located in the first voice has been determined to be successfully matched. If yes, step S305 is performed; Otherwise, return to step S301;
  • the keyword templates corresponding to each keyword in the keyword sequence are determined one by one, and the hidden layer feature vector of the frame located in the first voice is determined to be successfully matched, including: for each keyword sequence The keyword templates corresponding to the keywords have been determined that the hidden layer feature vector of the frame located in the first voice is successfully matched; and the keywords that match the keyword template are successfully sorted according to the order of successful matching.
  • the result obtained is a sequence of keywords.
  • a flow chart of a method for determining a frame from a first frame sequence constituting a first voice as a first target frame is provided. 4.
  • the method includes:
  • the determined frame is used as a first target frame determined from a first frame sequence constituting the first voice.
  • the first speech comprises a first sequence of frames
  • the first sequence of frames is composed of at least one frame arranged in sequence.
  • Determining a frame from the first frame sequence constituting the first speech as the first target frame includes: selecting one frame from the first frame sequence as the first target frame, and the first target frame is the slave in the first frame sequence The frame that is not the first target frame and is sorted in the first frame sequence.
  • a flow chart for selecting a keyword from a keyword sequence included in a voice keyword to be a target keyword is provided. Referring to FIG. 5 .
  • the method includes:
  • S501 Determine, from a keyword sequence included in the voice keyword, a next keyword adjacent to the keyword corresponding to the keyword template that has been successfully matched last time;
  • the keyword sequence is composed of multiple keywords that are sequentially sorted.
  • the keyword sequence included in the voice keyword is “Little Red Hello”
  • the keyword corresponding to the key template of the last successful match is “red”
  • the keyword sequence included in the voice keyword is The next keyword adjacent to the keyword corresponding to the last successful keyword template is the keyword "you”.
  • step S502 determining whether the number of times the next keyword is continuously determined as the target keyword reaches a preset threshold; if the number of times the next keyword is continuously determined as the target keyword does not reach the preset threshold, step S503 is performed; If the number of times the next keyword is continuously determined as the target keyword reaches the threshold, step S504 is performed;
  • the preset threshold is 30 times.
  • the foregoing is only an optional manner of the threshold provided by the embodiment of the present application.
  • the inventor may arbitrarily set the specific content of the threshold according to his own needs, which is not limited herein.
  • the first keyword in the keyword sequence is determined as the target keyword, including: the first keyword in the keyword sequence "Small" is determined as the target keyword.
  • a flow chart of a method for generating a keyword template corresponding to a target keyword is provided. Referring to FIG. 6 .
  • the method includes:
  • the process of generating a keyword template corresponding to the target keyword includes: determining a second voice that includes the target keyword, the second voice is composed of a second frame sequence, and the second frame sequence is composed of at least one frame that is sequentially arranged .
  • the second voice is used as the input information of the preset voice model, and the final layer feature vector corresponding to each frame in the second frame sequence is determined respectively.
  • a voice model is pre-set, and the input information of the voice model is a voice (eg, a second voice)/frame, and the output information may include a hidden layer feature vector and a final layer feature vector respectively corresponding to each frame input.
  • the second voice is used as the input information of the voice model, and the final layer feature vector corresponding to each frame in the second frame sequence included in the second voice is obtained.
  • one frame is selected as the second target frame from the second voice according to the end layer feature vector corresponding to each frame in the second frame sequence included in the second voice.
  • the second target frame is used as the input information of the voice model
  • the obtained process of the hidden layer feature vector corresponding to the second target frame may be implemented in step S602, where the second voice is used as the input of the preset voice model. And determining, by the information, a final layer feature vector corresponding to each frame in the second frame sequence, and a hidden layer feature vector corresponding to each frame in the second frame sequence respectively; and further, in the process of performing step S604, directly From the result of the "hidden layer feature vector corresponding to each frame in the second frame sequence" of step S602, the hidden layer feature vector corresponding to the second target frame is directly acquired.
  • step S602 the process of the “hidden layer feature vector corresponding to the second target frame obtained by using the second target frame as the input information of the voice model” is implemented in step S602, which is not limited herein.
  • the number of the second voices is at least one
  • the keyword template corresponding to the target keyword is generated according to the hidden layer feature vector corresponding to the second target frame, including: determining the second and the second voice respectively
  • the hidden layer feature vector corresponding to the two target frames is averaged for each determined hidden layer feature vector, and the obtained result is used as a keyword template corresponding to the target keyword.
  • a method for determining a second target frame from a second frame sequence based on a final layer feature vector corresponding to each frame is provided. Introduction.
  • the end layer feature vector corresponding to the frame includes: a similarity between the frame and each text in the preset text set in the voice model, and the target keyword is one in the file set. Text.
  • the final layer feature vector corresponding to the frame includes: the similarity between the frame and each of the 5200 Chinese characters.
  • Determining the second target frame from the second frame sequence based on the end layer feature vectors respectively corresponding to each frame comprising: selecting and targeting the target from the second frame sequence according to the final layer feature vector corresponding to each frame respectively The frame with the highest degree of similarity of words is used as the second target frame; wherein the degree of similarity between the frame and the target keyword is determined according to the similarity between the frame and each character in the text set.
  • the method includes:
  • S701 Determine at least one first candidate frame from the second frame sequence, where the similarity between the first candidate frame and the target keyword is smaller than the similarity between the first candidate frame and the at least one character in the text set, and the number of the at least one character is less than Default value
  • S702. Determine at least one second candidate frame from the at least one first candidate frame, where the at least one second candidate frame is each of the first candidate frames having the greatest similarity with the target keyword in the at least one first candidate frame.
  • S703. Determine a second target frame from the at least one second candidate frame.
  • the similarity between the second target frame and the target keyword is in the similarity between the second target frame and each character according to the order of similarity from high to low.
  • the ranking is higher than the ranking of each second candidate frame and the target keyword except the second target frame in the similarity between the second candidate frame and each character.
  • the frame with the highest degree of similarity with the target keyword is selected from the second frame sequence.
  • the understanding of the method of the second target frame is now illustrated by:
  • the preset text set in the voice model includes four characters, namely, text 1, text 2, respectively Text 3 and text 4, where text 3 is the target keyword.
  • the final layer feature vector 1 includes a similarity degree 11 between the frame 1 and the text 1, a similarity 12 between the frame 1 and the text 2, a similarity 13 between the frame 1 and the text 3, and a similarity 14 between the frame 1 and the character 4, wherein The similarity 11 is 20%, the similarity 12 is 30%, the similarity 13 is 15%, and the similarity 14 is 50%;
  • the final layer feature vector 2 includes the similarity 21 between the frame 2 and the text 1, the similarity 22 between the frame 2 and the text 2, the similarity 23 between the frame 2 and the text 3, and the similarity 24 between the frame 2 and the character 4, wherein the similarity 21 is 15%, similarity 22 is 5%, similarity 23 is 65%, and similarity 24 is 95%;
  • the final layer feature vector 3 includes the similarity degree 31 of the frame 3 and the text 1, the similarity 32 of the frame 3 to the text 2, the similarity 33 of the frame 3 to the character 3, and the similarity 34 of the frame 3 and the character 4, wherein the similarity 31 is 10%, similarity 32 is 20%, similarity 33 is 65%, and similarity 34 is 30%;
  • the final layer feature vector 4 includes the similarity 41 of the frame 4 to the text 1, the similarity 42 of the frame 4 to the text 2, the similarity 43 of the frame 4 to the character 3, and the similarity 44 of the frame 4 and the character 4, wherein the similarity 41 is 10%, similarity 42 is 20%, similarity 43 is 55%, and similarity 44 is 30%.
  • At least one first candidate frame from the second frame sequence the similarity between the first candidate frame and the target keyword is smaller than the similarity between the first candidate frame and the at least one character in the text set, and the number of the at least one character is less than
  • the preset value if the preset value is 3, indicates that at least one first candidate frame is determined from the second frame sequence, and specifically, the similarity between the first candidate frame and each character in the text set is from large to large
  • the small order is arranged to obtain a sequence, and the similarity between the first candidate frame and the target keyword is within the first 3 digits of the sequence (the similarity between the first candidate frame and the target keyword is located in the first and second positions of the sequence) Bit or third place).
  • at least one first candidate frame determined from the second frame sequence includes three, which are frame 2, frame 3, and frame 4.
  • At least one second candidate frame includes two, frame 2 and frame 3, respectively.
  • the similarity 33 corresponding to the frame 3 is ranked first in each similarity corresponding to the frame 3; the similarity 23 corresponding to the frame 2 corresponds to the frame 2
  • the rank in each of the similarities is the second digit, so the frame 3 corresponding to the first bit is selected as the second target frame.
  • the voice keyword recognition method provided by the embodiment of the present application is more clear and complete, and is convenient for those skilled in the art to understand.
  • the method includes:
  • each frame in the first frame sequence included in the corresponding first voice in the method is provided with a unique frame ID, wherein the sequence number of the frame in the first frame sequence is the frame ID of the frame.
  • the first frame sequence includes three frames that are sequentially sorted, frame 1, frame 3, and frame 2, respectively. Then, the sequence number of frame 1 is 1, the frame ID is 1, the sequence number of frame 3 is 2, the frame ID is 2, the sequence number of frame 2 is 3, and the frame ID is 3.
  • each keyword in the keyword sequence included in the voice keyword is set with a unique keyword ID, wherein the sequence number of the keyword in the keyword sequence is the keyword ID of the keyword.
  • the keyword sequence includes four keywords sorted in order, namely, keyword 1, keyword 3 keyword 2, and keyword 4. Then, the sequence number of the keyword 1 is 1, the keyword ID is 1, the sequence number of the keyword 3 is 2, the keyword ID is 2, the sequence number of the keyword 2 is 3, and the keyword ID is 3.
  • Keyword 4 has a serial number of 4 and a keyword ID of 4.
  • step S805 setting the counter s is the trigger initial value; n++; returning to step S802;
  • the trigger initial value is the threshold involved in the foregoing step S502.
  • the initial value of the trigger is 30.
  • s-- indicates that the counter count is decremented by one.
  • the voice keyword recognition method provided by the embodiment of the present application is more clear and complete, and is convenient for those skilled in the art to understand.
  • FIG. 9 is a schematic structural diagram of a voice keyword recognition apparatus according to an embodiment of the present application.
  • the device includes:
  • a first target frame determining unit 91 configured to select a first target frame from a first frame sequence constituting the first voice
  • the target keyword determining unit 92 is configured to select a keyword from the keyword sequence and determine the target keyword, wherein the keyword sequence belongs to the voice keyword;
  • the matching unit 93 is configured to: if the key layer template corresponding to the target keyword of the first target frame is successfully matched, the keyword template corresponding to each keyword in the keyword sequence is used one by one. Determining whether a hidden layer feature vector of a frame located in the first voice matches, wherein the keyword template indicates a hidden layer feature vector of a second target frame in the second voice including the target keyword;
  • the identifying unit 94 is configured to determine, if the keyword templates corresponding to each keyword in the keyword sequence are determined one by one, that the hidden layer feature vector of the frame located in the first voice is successfully matched, The voice keyword is included in the first voice. Further, the voice keyword recognition apparatus provided by the embodiment of the present application further includes: a return execution unit, configured to: when the matching fails, return to perform “selecting a frame from the first frame sequence constituting the first voice. Determine as the first target frame" step.
  • An embodiment of the present invention provides an optional structure of the first target frame determining unit 91.
  • the first target frame determining unit 91 includes:
  • a first determining unit configured to determine, from the first sequence of frames constituting the first voice, a frame that is never determined to be the first target frame;
  • a second determining unit configured to use the frame as the first target frame determined from the first frame sequence constituting the first voice.
  • An embodiment of the present invention provides an optional structure of the target keyword determining unit 92.
  • the target keyword determining unit 92 includes:
  • a third determining unit configured to determine, from the keyword sequence included in the voice keyword, a next keyword adjacent to a keyword corresponding to a keyword template that has been successfully matched last time;
  • a fourth determining unit configured to determine the next keyword as a target keyword if the number of times the next keyword is continuously determined as the target keyword does not reach a preset threshold
  • a fifth determining unit configured to determine, as the target keyword, the first keyword in the keyword sequence if the number of times the next keyword is continuously determined as the target keyword reaches the threshold.
  • the voice keyword recognition apparatus provided by the embodiment of the present application further includes: a keyword template generating unit.
  • FIG. 10 An optional structure of the keyword template generating unit provided by the embodiment of the present invention is shown in FIG. 10 .
  • the keyword template generating unit includes:
  • a second voice determining unit 101 configured to determine a second voice that includes the target keyword, where the second voice is composed of a second sequence of frames;
  • the final layer feature vector determining unit 102 is configured to determine, as the input information of the preset voice model, the second layer voice as a final layer feature vector corresponding to each frame in the second frame sequence;
  • a second target frame determining unit 103 configured to determine a second target frame from the second frame sequence according to a final layer feature vector corresponding to each frame respectively;
  • a keyword template generating sub-unit 104 configured to generate, with the target keyword, a hidden layer feature vector corresponding to the second target frame obtained by using the second target frame as input information of the voice model The corresponding keyword template.
  • the end layer feature vector corresponding to the frame includes: a similarity between the frame and each text in a preset text set in the voice model,
  • the target keyword is a character in the file set
  • the second target frame determining unit is specifically configured to: select and describe from the second frame sequence based on a final layer feature vector corresponding to each frame respectively The frame with the highest degree of similarity of the target keyword is used as the second target frame; wherein the degree of similarity between the frame and the target keyword is determined according to the similarity between the frame and each character in the text set.
  • An embodiment of the present invention provides an optional structure of the second target frame determining unit, which is shown in FIG.
  • the second target frame determining unit includes:
  • the first candidate frame determining unit 111 is configured to determine at least one first candidate frame from the second frame sequence, where the similarity between the first candidate frame and the target keyword is smaller than the first candidate frame and the Comparing the similarity of at least one character in the text set, the number of the at least one character being less than a preset value;
  • a second candidate frame determining unit 112 configured to determine at least one second candidate frame from the at least one first candidate frame, where the at least one second candidate frame is the target in the at least one first candidate frame Each of the first candidate frames having the highest similarity of the keywords;
  • a second target frame determining sub-unit 113 configured to determine a second target frame from the at least one second candidate frame, in order of high to low similarity, the second target frame and the target keyword
  • the similarity is located in the ranking of the similarity between the second target frame and each character, and the similarity between each of the second candidate frames except the second target frame and the target keyword is located in the The ranking in the similarity between the second candidate frame and each character.
  • the embodiment of the invention discloses a voice keyword recognition method, device, terminal and server, which determine a first target frame from a first frame sequence constituting the first voice; and determine a target from a keyword sequence included in the voice keyword a keyword; when it is determined that the hidden layer feature vector of the target frame is successfully matched with the keyword template corresponding to the target keyword (the keyword template indicates a hidden layer feature vector of the second target frame in the second voice including the target keyword), If the keyword templates corresponding to each keyword in the keyword sequence are determined one by one, it is determined that the hidden layer feature vector of the frame located in the first voice is successfully matched, and the manner in which the voice keyword is included in the first voice is determined. The recognition of the speech keywords in the first speech is effectively implemented. Further, the electronic device that facilitates using the voice wake-up technology automatically activates a processing module corresponding to the voice keyword when identifying that the voice keyword is included in the first voice.
  • the steps of a method or algorithm described in connection with the embodiments disclosed herein can be implemented directly in hardware, a software module executed by a processor, or a combination of both.
  • the software module can be placed in random access memory (RAM), memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or technical field. Any other form of storage medium known.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

一种语音关键词识别方法、装置、终端及服务器,所述方法包括,通过从构成第一语音的第一帧序列中选取一个帧确定为第一目标帧(S201);从语音关键词包括的关键字序列中选取一个关键字确定为目标关键字(S202);确定第一目标帧的隐层特征向量是否与目标关键字对应的关键字模板匹配成功(S203),若逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位于第一语音中的帧的隐层特征向量与其匹配成功,确定第一语音中包括语音关键词(S204);所述方法有效实现了对第一语音中的语音关键词的识别,进一步的,便于使用语音唤醒技术的电子设备在识别出第一语音中包括语音关键词时,自动激活与所述语音关键词相应的处理模块。

Description

一种语音关键词识别方法、装置、终端及服务器
本申请要求于2017年5月27日提交中国专利局、申请号为201710391388.6、发明名称为“一种语音关键词识别方法、装置、终端及服务器”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及语音识别技术领域,具体涉及一种语音关键词识别方法、装置、终端及服务器。
背景技术
随着科技的发展,语音唤醒技术在电子设备中的应用越来越广泛,其极大程度的方便了用户对电子设备的操作,允许用户与电子设备之间无需手动交互,即可通过语音关键词激活电子设备中相应的处理模块。
例如,苹果手机采用关键词“siri”作为激活苹果手机中的语音对话智能助理功能的语音关键词,当苹果手机检测到用户输入包括关键词“siri”的语音时,自动激活苹果手机中的语音对话智能助理功能。
有鉴于此,提供一种语音关键词识别方法、装置、终端及服务器,以实现对语音中的语音关键词的识别,对于语音唤醒技术的发展是至关重要的。
发明内容
有鉴于此,本发明实施例提供一种语音关键词识别方法、装置、终端及服务器,以实现对语音中的语音关键词的识别。
为实现上述目的,本发明实施例提供如下技术方案:
一种语音关键词识别方法,包括:
从构成第一语音的第一帧序列中选取一个第一目标帧;
从关键字序列中选取一个关键字确定为目标关键字,其中,所述关键字序列属于所述语音关键词;
若所述第一目标帧的隐层特征向量与所述目标关键字对应的关键字模板 匹配成功,则逐一针对关键字序列中的每个关键字对应的关键字模板,确定位于所述第一语音中的帧的隐层特征向量是否匹配,其中,所述关键字模板指示包括所述目标关键字的第二语音中的第二目标帧的隐层特征向量;
若逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位于所述第一语音中的帧的隐层特征向量与其匹配成功,则确定所述第一语音中包括所述语音关键词。
一种语音关键词识别装置,包括:
第一目标帧确定单元,用于从构成第一语音的第一帧序列中选取一个第一目标帧;
目标关键字确定单元,用于从关键字序列中选取一个关键字确定为目标关键字,其中,所述关键字序列属于所述语音关键词;
匹配单元,用于若所述第一目标帧的隐层特征向量与所述目标关键字对应的关键字模板匹配成功,则逐一针对关键字序列中的每个关键字对应的关键字模板,确定位于所述第一语音中的帧的隐层特征向量是否匹配,其中,所述关键字模板指示包括所述目标关键字的第二语音中的第二目标帧的隐层特征向量;
识别单元,用于若逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位于所述第一语音中的帧的隐层特征向量与其匹配成功,则确定所述第一语音中包括所述语音关键词。
一种终端,包括存储器和处理器,所述存储器用于存储程序,所述处理器调用所述程序,所述程序用于:
从构成第一语音的第一帧序列中选取一个第一目标帧;
从关键字序列中选取一个关键字确定为目标关键字,其中,所述关键字序列属于所述语音关键词;
若所述第一目标帧的隐层特征向量与所述目标关键字对应的关键字模板匹配成功,则逐一针对关键字序列中的每个关键字对应的关键字模板,确定位于所述第一语音中的帧的隐层特征向量是否匹配,其中,所述关键字模板指示包括所述目标关键字的第二语音中的第二目标帧的隐层特征向量;
若逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位 于所述第一语音中的帧的隐层特征向量与其匹配成功,则确定所述第一语音中包括所述语音关键词。
一种语音关键词识别服务器,包括存储器和处理器,所述存储器用于存储程序,所述处理器调用所述程序,所述程序用于:
从构成第一语音的第一帧序列中选取一个第一目标帧;
从关键字序列中选取一个关键字确定为目标关键字,其中,所述关键字序列属于所述语音关键词;
若所述第一目标帧的隐层特征向量与所述目标关键字对应的关键字模板匹配成功,则逐一针对关键字序列中的每个关键字对应的关键字模板,确定位于所述第一语音中的帧的隐层特征向量是否匹配,其中,所述关键字模板指示包括所述目标关键字的第二语音中的第二目标帧的隐层特征向量;
若逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位于所述第一语音中的帧的隐层特征向量与其匹配成功,则确定所述第一语音中包括所述语音关键词。
一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得计算机执行如第一方面所述的方法。
一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行如第一方面所述的方法。
本发明实施例公开了一种语音关键词识别方法、装置、终端及服务器,通过从构成第一语音的第一帧序列中确定第一目标帧;从语音关键词包括的关键字序列中确定目标关键字;在确定目标帧的隐层特征向量与目标关键字对应的关键字模板匹配成功时(关键字模板指示包括目标关键字的第二语音中的第二目标帧的隐层特征向量),若逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位于第一语音中的帧的隐层特征向量与其匹配成功,确定第一语音中包括语音关键词的方式,有效实现了对第一语音中的语音关键词的识别。进一步的,便于使用语音唤醒技术的电子设备在识别出第一语音中包括语音关键词时,自动激活与所述语音关键词相应的处理模块。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。
图1为本申请实施例提供的一种语音关键词识别服务器的结构示意图;
图2为本申请实施例提供的一种语音关键词识别方法的流程图;
图3为本申请实施例提供的另一种语音关键词识别方法的流程图;
图4为本申请实施例提供的一种从构成第一语音的第一帧序列中选取一个帧确定为第一目标帧的方法流程图;
图5为本申请实施例提供的一种从语音关键词包括的关键字序列中选取一个关键字确定为目标关键字的方法流程图;
图6为本申请实施例提供的一种与目标关键字对应的关键字模板的生成方法流程图;
图7为本申请实施例提供的一种基于分别与每个帧对应的终层特征向量,从第二帧序列中选取与目标关键字的相似程度最高的帧作为第二目标帧的方法流程图;
图8为本申请实施例提供的另一种语音关键词识别方法的流程图;
图9为本申请实施例提供的一种语音关键词识别装置的结构示意图;
图10为本申请实施例提供的一种关键字模板生成单元的详细结构示意图;
图11为本申请实施例提供的一种第二目标帧确定单元的详细结构示意图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
实施例:
本申请实施例提供一种语音关键词识别方法,应用于终端或服务器。
在本申请实施例中,可选的,终端为电子设备,例如,移动终端、台式机等。以上仅仅是本申请实施例提供的终端的可选方式,发明人可根据自己的需求任意设置终端的具体表现形式,在此不做限定。
可选的,应用本申请实施例提供的一种语音关键词识别方法的服务器(此处可称为语音关键词识别服务器)的功能可由单台服务器实现也可由多台服务器构成的服务器集群实现,在此不做限定。
以服务器为例,本申请实施例提供的一种语音关键词识别服务器的结构示意图,具体请参见图1。语音关键词识别服务器包括:处理器11和存储器12。
其中处理器11、存储器12、通信接口13通过通信总线14完成相互间的通信。
可选的,通信接口13可以为通信模块的接口,如全球移动通信***(Global System for Mobile Communication,GSM)模块的接口。处理器11,用于执行程序。
处理器11可能是一个中央处理器CPU,或者是特定集成电路(Application Specific Integrated Circuit,ASIC),或者是被配置成实施本发明实施例的一个或多个集成电路。
存储器12,用于存放程序。
程序可以包括程序代码,程序代码包括计算机操作指令。在本发明实施例中,程序可以包括上述用户界面编辑器对应的程序。
存储器12可能包含高速随机存取存储器(Random-Access Memory,RAM)存储器,也可能还包括非易失性存储器(non-volatile memory,NVM),例如至少一个磁盘存储器。
其中,程序可具体用于:
从构成第一语音的第一帧序列中选取一个帧确定为第一目标帧;
从语音关键词包括的关键字序列中选取一个关键字确定为目标关键字;
确定目标帧的隐层特征向量是否与目标关键字对应的关键字模板匹配成 功,关键字模板指示包括目标关键字的第二语音中的第二目标帧的隐层特征向量;
在匹配成功的情况下,若逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位于第一语音中的帧的隐层特征向量与其匹配成功,确定第一语音中包括语音关键词。
相应的,本申请实施例提供的一种终端的结构中至少包括如上述图1所示的语音关键词识别服务器的结构,有关终端的结构请参见上述对语音关键词识别服务器的结构的描述,在此不做赘述。
相应的,本申请实施例提供一种语音关键词识别方法的流程图,请参见图2。
如图2所示,该方法包括:
S201、从构成第一语音的第一帧序列中选取一个帧第一目标帧;
S202、从语音关键词包括的关键字序列中选取一个关键字确定为目标关键字;
S203、确定第一目标帧的隐层特征向量是否与目标关键字对应的关键字模板匹配成功,关键字模板指示包括目标关键字的第二语音中的第二目标帧的隐层特征向量;若所述第一目标帧的隐层特征向量与所述目标关键字对应的关键字模板匹配成功,则执行步骤S204。
可选的,预设有语音模型,将包括目标关键字的第二语音(第二语音包括第二帧序列)输入语音模型后,可得到第二语音中的第二目标帧的隐层特征向量,与目标关键字对应的关键字模板指示所得到的隐层特征向量。
可选的,语音模型基于时间递归神经网络(Long Short-Term Memory,LSTM)以及目标准则(Connectionist Temporal Classification,CTC)生成。
以上仅仅是本申请实施例提供的语音模型生成的可选方式,发明人可根据自己的需求任意设置语音模型的具体生成过程,在此不做限定。
可选的,将包括第一帧序列的第一语音输入语音模型,可得到与第一语音中的第一目标帧对应的隐层特征向量。
相应的,将第一目标帧的隐层特征向量与目标关键字对应的关键字模板进 行匹配,确定第一目标帧的隐层特征向量是否与目标关键字对应的关键字模板匹配成功,如果匹配成功执行步骤S204。
在本申请实施例中,可选的,确定第一目标帧的隐层特征向量是否与目标关键字对应的关键字模板匹配成功,包括:计算第一目标帧的隐层特征向量与目标关键字对应的关键字模板之间的余弦距离;若计算得到的余弦距离满足预设值,则确定第一目标帧的隐层特征向量与目标关键字对应的关键字模板匹配成功;若计算得到的余弦距离不满足预设值,则确定第一目标帧的隐层特征向量与目标关键字对应的关键字模板匹配不成功(失败)。
S204、若逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位于第一语音中的帧的隐层特征向量与其匹配成功,则确定第一语音中包括语音关键词。
可选的,在步骤S203确定匹配成功的情况下,判断当前是否已经逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位于第一语音中的帧的隐层特征向量与其匹配成功;如果是,确定第一语音中包括语音关键词。
图3为本申请实施例提供的另一种语音关键词识别方法的流程图。
如图3所示,该方法包括:
S301、从构成第一语音的第一帧序列中选取一个第一目标帧;
S302、从语音关键词包括的关键字序列中选取一个关键字确定为目标关键字;
S303、确定第一目标帧的隐层特征向量是否与目标关键字对应的关键字模板匹配成功,关键字模板指示包括目标关键字的第二语音中的第二目标帧的隐层特征向量;在第一目标帧的隐层特征向量与目标关键字对应的关键字模板匹配成功的情况下,执行步骤S304;在匹配不成功的情况下,返回执行步骤S301;
S304、判断是否已逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位于第一语音中的帧的隐层特征向量与其匹配成功,如果是,执行步骤S305;如果否,返回执行步骤S301;
可选的,逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位于第一语音中的帧的隐层特征向量与其匹配成功,包括:针对关键字序 列中的每个关键字对应的关键字模板,均已确定出位于第一语音中的帧的隐层特征向量与其匹配成功;并且,匹配关键字模板成功的各个关键字,按照匹配成功的先后顺序进行排序后得到的结果为关键字序列。
S305、确定第一语音中包括语音关键词。
为了便于对本申请实施例提供的一种语音关键词识别方法的理解,现提供一种从构成第一语音的第一帧序列中选取一个帧确定为第一目标帧的方法流程图,请参见图4。
如图4所示,该方法包括:
S401、确定构成第一语音的第一帧序列中的、第一个从未被确定为第一目标帧的帧;
S402、将所确定的帧,作为从构成第一语音的第一帧序列中确定的第一目标帧。
可选的,第一语音包括第一帧序列,第一帧序列由依次排列的至少一个帧构成。从构成第一语音的第一帧序列中选取一个帧确定为第一目标帧,包括:从第一帧序列中选取一个帧作为第一目标帧,第一目标帧为第一帧序列中的从未被作为第一目标帧的、且在第一帧序列中排序最靠前的帧。
为了便于对本申请实施例提供的一种语音关键词识别方法的理解,现提供一种从语音关键词包括的关键字序列中选取一个关键字确定为目标关键字的方法流程图,请参见图5。
如图5所示,该方法包括:
S501、从语音关键词包括的关键字序列中,确定与最近一次匹配成功的关键字模板对应的关键字相邻的下一关键字;
可选的,关键字序列由依次排序的多个关键字构成。
例如,若语音关键词包括的关键字序列为“小红你好”时,若最近一次匹配成功的关键模板对应的关键字为“红”,则语音关键词包括的关键字序列中的,与最近一次匹配成功的关键字模板对应的关键字相邻的下一关键字为关键字“你”。
S502、判断下一关键字被连续确定为目标关键字的次数是否达到预设的阈 值;若下一关键字被连续确定为目标关键字的次数未达到预设的阈值,则执行步骤S503;若下一关键字被连续确定为目标关键字的次数达到阈值,则执行步骤S504;
可选的,预设的阈值为30次,以上仅仅是本申请实施例提供的阈值的可选方式,发明人可根据自己的需求任意设置阈值的具体内容,在此不做限定。
S503、将下一关键字确定为目标关键字;
S504、将关键字序列中的第一个关键字确定为目标关键字。
例如,若语音关键词包括的关键字序列为“小红你好”时,将关键字序列中的第一个关键字确定为目标关键字,包括:将关键字序列中的第一个关键字“小”,确定为目标关键字。
为了便于对本申请实施例提供的一种语音关键词识别方法的理解,现提供一种与目标关键字对应的关键字模板的生成方法流程图,请参见图6。
如图6所示,该方法包括:
S601、确定包括目标关键字的第二语音,第二语音由第二帧序列构成;
可选的,生成与目标关键字对应的关键字模板的过程包括:确定包括目标关键字的第二语音,第二语音由第二帧序列构成,第二帧序列由依次排列的至少一个帧构成。
S602、将第二语音作为预设的语音模型的输入信息,确定分别与第二帧序列中的每个帧对应的终层特征向量;
可选的,预设有语音模型,语音模型的输入信息为语音(如第二语音)/帧,输出信息可包括分别与输入的每个帧对应的隐层特征向量和终层特征向量。
在本申请实施例中,可选的,将第二语音作为语音模型的输入信息,得到第二语音包括的第二帧序列中的每个帧对应的终层特征向量。
S603、基于分别与每个帧对应的终层特征向量,从第二帧序列中确定第二目标帧;
可选的,根据第二语音包括的第二帧序列中的每个帧对应的终层特征向量,从第二语音中选取一个帧作为第二目标帧。
S604、根据将第二目标帧作为语音模型的输入信息所得到的与第二目标帧对应的隐层特征向量,生成与目标关键字对应的关键字模板。
可选的,第二目标帧作为语音模型的输入信息,得到的与第二目标帧对应的隐层特征向量的过程,可以在步骤S602中实现,将第二语音作为预设的语音模型的输入信息,确定分别与第二帧序列中的每个帧对应的终层特征向量,以及分别与第二帧序列中的每个帧对应的隐层特征向量;进而,在步骤S604执行过程中,直接从步骤S602的“分别与第二帧序列中的每个帧对应的隐层特征向量”结果中,直接获取与第二目标帧对应的隐层特征向量。
以上仅仅是本申请实施例的可选方式,发明人可根据自己的需求任意设置“将第二目标帧作为语音模型的输入信息所得到的与第二目标帧对应的隐层特征向量”的实现方式,如将“将第二目标帧作为语音模型的输入信息所得到的与第二目标帧对应的隐层特征向量”过程独立于步骤S602实现,在此不做限定。
可选的,第二语音的个数为至少一个,根据与第二目标帧对应的隐层特征向量,生成与目标关键字对应的关键字模板,包括:确定分别与每个第二语音的第二目标帧对应的隐层特征向量,对所确定的各个隐层特征向量求平均,并将所得到的结果作为与目标关键字对应的关键字模板。
为了便于对本申请实施例提供的一种语音关键词识别方法的理解,现提供一种基于分别与每个帧对应的终层特征向量,从第二帧序列中确定第二目标帧的方法进行详细介绍。
在本申请实施例中,可选的,帧对应的终层特征向量,包括:帧分别与语音模型中预设的文字集中的每个文字之间的相似度,目标关键字为文件集中的一个文字。
例如,若文字集为5200个汉字,则帧对应的终层特征向量包括:帧分别与5200个汉字中的每个汉字的相似度。
基于分别与每个帧对应的终层特征向量,从第二帧序列中确定第二目标帧,包括:根据分别与每个帧对应的终层特征向量,从第二帧序列中选取与目标关键字的相似程度最高的帧作为第二目标帧;其中,帧与目标关键字的相似程度根据帧分别与文字集中的每个文字之间的相似度确定。
为了便于理解,现提供一种基于分别与每个帧对应的终层特征向量,从第二帧序列中选取与目标关键字的相似程度最高的帧作为第二目标帧的方法流程图,请参见图7。
如图7,该方法包括:
S701、从第二帧序列中确定至少一个第一候选帧,第一候选帧与目标关键字的相似度小于第一候选帧与文字集中的至少一个文字的相似度,至少一个文字的个数小于预设数值;
S702、从至少一个第一候选帧中确定至少一个第二候选帧,至少一个第二候选帧为至少一个第一候选帧中与目标关键字的相似度最大的各第一候选帧;
S703、从至少一个第二候选帧中确定第二目标帧,按照相似度从高到低的顺序,第二目标帧与目标关键字的相似度位于第二目标帧与各文字的相似度中的排名,高于除第二目标帧外的每个第二候选帧与目标关键字的相似度位于第二候选帧与各文字的相似度中的排名。
进一步的,为了便于对本申请实施例提供的如图7所示的一种基于分别与每个帧对应的终层特征向量,从第二帧序列中选取与目标关键字的相似程度最高的帧作为第二目标帧的方法的理解,现举例说明:
若第二语音包括的第二帧序列包括四个帧,分别为帧1、帧2、帧3和帧4,语音模型中预设的文字集包括4个文字,分别为文字1、文字2、文字3和文字4,其中文字3为目标关键字。
将第二语音作为语音模型的输入信息输入至语音模型,得到与帧1对应的终层特征向量1、与帧2对应的终层特征向量2、与帧3对应的终层特征向量3,以及与帧4对应的终层特征向量4。
其中,终层特征向量1包括帧1与文字1的相似度11、帧1与文字2的相似度12、帧1与文字3的相似度13和帧1与文字4的相似度14,其中,相似度11为20%、相似度12为30%、相似度13为15%、相似度14为50%;
终层特征向量2包括帧2与文字1的相似度21、帧2与文字2的相似度22、帧2与文字3的相似度23和帧2与文字4的相似度24,其中,相似度21为15%、相似度22为5%、相似度23为65%、相似度24为95%;
终层特征向量3包括帧3与文字1的相似度31、帧3与文字2的相似度32、帧3与文字3的相似度33和帧3与文字4的相似度34,其中,相似度31为10%、相似度32为20%、相似度33为65%、相似度34为30%;
终层特征向量4包括帧4与文字1的相似度41、帧4与文字2的相似度42、帧4与文字3的相似度43和帧4与文字4的相似度44,其中,相似度41为10%、相似度42为20%、相似度43为55%、相似度44为30%。
首先,从第二帧序列中确定至少一个第一候选帧,第一候选帧与目标关键字的相似度小于第一候选帧与文字集中的至少一个文字的相似度,至少一个文字的个数小于预设数值,若预设数值为3时,则说明:从第二帧序列中确定至少一个第一候选帧,具体的,第一候选帧与文字集中的每个文字的相似度按照从大到小的顺序进行排列得到一个序列,第一候选帧与目标关键字的相似度位于此序列的前3位以内(第一候选帧与目标关键字的相似度位于此序列的第1位、第2位或第3位)。此时,从第二帧序列中确定的至少一个第一候选帧包括3个,分别为帧2、帧3和帧4。
从至少一个第一候选帧中确定至少一个第二候选帧:因此时相似度23和相似度33相等,均为65%;相似度43为55%;故从至少一个第一候选帧中确定出的至少一个第二候选帧包括2个,分别为帧2和帧3。
从至少一个第二候选帧中确定第二目标帧:因与帧3对应的相似度33在帧3对应的各个相似度中的排名为第1位;帧2对应的相似度23在帧2对应的各个相似度中的排名为第2位,故选择与第1位对应的帧3作为第二目标帧。
通过上述对本申请实施例提供的一种语音关键词识别方法的详细介绍,使得本申请实施例提供的一种语音关键词识别方法更加清晰、完整,便于本领域技术人员理解。
进一步的,为了便于理解上述实施例提供的一种语音关键词识别方法,下面对此方法进行更具体的详细说明,请参见图8。
如图8所示,该方法包括:
需要注意的是:该方法中对应的第一语音包括的第一帧序列中的每个帧设置有唯一的帧ID,其中,帧在第一帧序列中的序位号即为帧的帧ID。例如, 第一帧序列包括依次排序的三个帧,分别为帧1、帧3和帧2。则,帧1的序位号为1,帧ID为1;帧3的序位号为2,帧ID为2;帧2的序位号为3,帧ID为3。
可选的,语音关键词包括的关键字序列中的每个关键字设置有唯一的关键字ID,其中,关键字在关键字序列中的序位号为关键字的关键字ID。例如,关键词序列包括依次排序的4个关键字,分别为关键字1、关键字3关键字2和关键字4。则,关键字1的序位号为1,关键字ID为1;关键字3的序位号为2,关键字ID为2;关键字2的序位号为3,关键字ID为3;关键字4的序位号为4,关键字ID为4。
S801、初始化帧ID:n=0;关键字ID:m=1;计算器置零;
S802、i=n++;判断第一语音包括的第一帧序列中的第i个帧的隐层特征向量与语音关键词中的第m个关键字对应关键字模板是否匹配成功;如果匹配成功,执行步骤S803;如果匹配失败,执行步骤S806;
S803、判断当前关键字是否为语音关键词包括的关键词序列中的最后一个关键字;如果是,执行步骤S804;如果否,执行步骤S805;
S804、确定第一语音中包括语音关键词;
S805、设置计数器的计数s为触发初始值;n++;返回执行步骤S802;
可选的,触发初始值即为上述步骤S502中所涉及到的阈值。可选的,触发初始值为30。
以上仅仅是本申请实施例提供的触发初始值的可选方式,发明人可根据自己的需求任意设置触发初始值的具体数值,在此不做限定。
S806、s--;
可选的,s--表示计数器的计数减一。
S807、判断计数器的计数s是否大于0;若是,返回执行步骤S802;若否,返执行步骤S801。
以上仅仅是本申请实施例提供的一种语音关键词识别方法的可选方式,具体的,发明人可根据自己的需求任意设置本申请实施例提供一种语音关键词识别方法的具体实现方式,在此不做限定。
通过上述对本申请实施例提供的一种语音关键词识别方法的详细介绍,使得本申请实施例提供的一种语音关键词识别方法更加清晰、完整,便于本领域技术人员理解。
上述本发明公开的实施例中详细描述了方法,对于本发明的方法可采用多种形式的装置实现,因此本发明还公开了一种装置,下面给出具体的实施例进行详细说明。
图9为本申请实施例提供的一种语音关键词识别装置的结构示意图。
如图9所示,该装置包括:
第一目标帧确定单元91,用于从构成第一语音的第一帧序列中选取一个第一目标帧;
目标关键字确定单元92,用于从关键字序列中选取一个关键字确定为目标关键字,其中,所述关键字序列属于所述语音关键词;
匹配单元93,用于若所述第一目标帧的隐层特征向量与所述目标关键字对应的关键字模板匹配成功,则逐一针对关键字序列中的每个关键字对应的关键字模板,确定位于所述第一语音中的帧的隐层特征向量是否匹配,其中,所述关键字模板指示包括所述目标关键字的第二语音中的第二目标帧的隐层特征向量;
识别单元94,用于若逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位于所述第一语音中的帧的隐层特征向量与其匹配成功,则确定所述第一语音中包括所述语音关键词。进一步的,本申请实施例提供的一种语音关键词识别装置还包括:返回执行单元,用于:在匹配失败的情况下,返回执行“从构成第一语音的第一帧序列中选取一个帧确定为第一目标帧”步骤。
本发明实施例提供第一目标帧确定单元91的一种可选结构。
可选的,第一目标帧确定单元91包括:
第一确定单元,用于从所述构成第一语音的第一帧序列中确定第一个从未被确定为第一目标帧的帧;
第二确定单元,用于将所述帧作为从所述构成第一语音的第一帧序列中确 定的第一目标帧。
本发明实施例提供目标关键字确定单元92的一种可选结构。
可选的,目标关键字确定单元92包括:
第三确定单元,用于从所述语音关键词包括的所述关键字序列中,确定与最近一次匹配成功的关键字模板对应的关键字相邻的下一关键字;
第四确定单元,用于若所述下一关键字被连续确定为目标关键字的次数未达到预设的阈值,将所述下一关键字确定为目标关键字;
第五确定单元,用于若所述下一关键字被连续确定为目标关键字的次数达到所述阈值,将所述关键字序列中的第一个关键字确定为目标关键字。
进一步的,本申请实施例提供的一种语音关键词识别装置还包括:关键字模板生成单元。
本发明实施例提供的关键字模板生成单元的一种可选结构,请参见图10。
如图10所示,所述关键字模板生成单元,包括:
第二语音确定单元101,用于确定包括所述目标关键字的第二语音,所述第二语音由第二帧序列构成;
终层特征向量确定单元102,用于将所述第二语音作为预设的语音模型的输入信息,确定分别与所述第二帧序列中的每个帧对应的终层特征向量;
第二目标帧确定单元103,用于根据分别与每个帧对应的终层特征向量,从所述第二帧序列中确定第二目标帧;
关键字模板生成子单元104,用于根据将所述第二目标帧作为所述语音模型的输入信息所得到的与所述第二目标帧对应的隐层特征向量,生成与所述目标关键字对应的关键字模板。
在本申请实施例中,可选地,所述帧对应的终层特征向量,包括:所述帧分别与所述语音模型中预设的文字集中的每个文字之间的相似度,所述目标关键字为所述文件集中的一个文字;所述第二目标帧确定单元,具体用于:基于分别与每个帧对应的终层特征向量,从所述第二帧序列中选取与所述目标关键字的相似程度最高的帧作为第二目标帧;其中,帧与所述目标关键字的相似程度根据所述帧分别与所述文字集中的每个文字之间的相似度确定。
本发明实施例提供第二目标帧确定单元的一种可选结构,请参见图11。
如图11所示,所述第二目标帧确定单元,包括:
第一候选帧确定单元111,用于从所述第二帧序列中确定至少一个第一候选帧,所述第一候选帧与所述目标关键字的相似度小于所述第一候选帧与所述文字集中的至少一个文字的相似度,所述至少一个文字的个数小于预设数值;
第二候选帧确定单元112,用于从所述至少一个第一候选帧中确定至少一个第二候选帧,所述至少一个第二候选帧为所述至少一个第一候选帧中与所述目标关键字的相似度最大的各第一候选帧;
第二目标帧确定子单元113,用于从所述至少一个第二候选帧中确定第二目标帧,按照相似度从高到低的顺序,所述第二目标帧与所述目标关键字的相似度位于所述第二目标帧与各文字的相似度中的排名,高于除所述第二目标帧外的每个所述第二候选帧与所述目标关键字的相似度位于所述第二候选帧与各文字的相似度中的排名。
综上:
本发明实施例公开了一种语音关键词识别方法、装置、终端及服务器,通过从构成第一语音的第一帧序列中确定第一目标帧;从语音关键词包括的关键字序列中确定目标关键字;在确定目标帧的隐层特征向量与目标关键字对应的关键字模板匹配成功时(关键字模板指示包括目标关键字的第二语音中的第二目标帧的隐层特征向量),若逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位于第一语音中的帧的隐层特征向量与其匹配成功,确定第一语音中包括语音关键词的方式,有效实现了对第一语音中的语音关键词的识别。进一步的,便于使用语音唤醒技术的电子设备在识别出第一语音中包括语音关键词时,自动激活与所述语音关键词相应的处理模块。
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例 的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下,在其它实施例中实现。因此,本发明将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。

Claims (18)

  1. 一种语音关键词识别方法,其特征在于,包括:
    从构成第一语音的第一帧序列中选取一个第一目标帧;
    从关键字序列中选取一个关键字确定为目标关键字,其中,所述关键字序列属于所述语音关键词;
    若所述第一目标帧的隐层特征向量与所述目标关键字对应的关键字模板匹配成功,则逐一针对关键字序列中的每个关键字对应的关键字模板,确定位于所述第一语音中的帧的隐层特征向量是否匹配,其中,所述关键字模板指示包括所述目标关键字的第二语音中的第二目标帧的隐层特征向量;
    若逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位于所述第一语音中的帧的隐层特征向量与其匹配成功,则确定所述第一语音中包括所述语音关键词。
  2. 根据权利要求1所述的方法,其特征在于,在匹配失败的情况下,所述方法还包括:
    返回执行所述从构成第一语音的第一帧序列中选取一个帧确定为第一目标帧的步骤。
  3. 根据权利要求2所述的方法,其特征在于,所述从构成第一语音的第一帧序列中选取一个第一目标帧,包括:
    从所述构成第一语音的第一帧序列中确定第一个从未被确定为第一目标帧的帧;
    将所述帧作为从所述构成第一语音的第一帧序列中确定的第一目标帧。
  4. 根据权利要求1至3中任一项所述的方法,其特征在于,所述从关键字序列中选取一个关键字确定为目标关键字,包括:
    从所述语音关键词包括的所述关键字序列中,确定与最近一次匹配成功的关键字模板对应的关键字相邻的下一关键字;
    若所述下一关键字被连续确定为目标关键字的次数未达到预设的阈值,则将所述下一关键字确定为目标关键字;
    若所述下一关键字被连续确定为目标关键字的次数达到所述阈值,则将所述关键字序列中的第一个关键字确定为目标关键字。
  5. 根据权利要求1至4中任一项所述的方法,其特征在于,所述关键字模板的生成过程包括:
    确定包括所述目标关键字的第二语音,所述第二语音由第二帧序列构成;
    将所述第二语音作为预设的语音模型的输入信息,确定分别与所述第二帧序列中的每个帧对应的终层特征向量;
    根据分别与每个帧对应的终层特征向量,从所述第二帧序列中确定第二目标帧;
    根据将所述第二目标帧作为所述语音模型的输入信息所得到的与所述第二目标帧对应的隐层特征向量,生成与所述目标关键字对应的关键字模板。
  6. 根据权利要求5所述的方法,其特征在于,所述帧对应的终层特征向量,包括:所述帧分别与所述语音模型中预设的文字集中的每个文字之间的相似度,所述目标关键字为所述文件集中的一个文字;
    所述根据分别与每个帧对应的终层特征向量,从所述第二帧序列中确定第二目标帧,包括:
    根据分别与每个帧对应的终层特征向量,从所述第二帧序列中选取与所述目标关键字的相似程度最高的帧作为第二目标帧;其中,帧与所述目标关键字的相似程度根据所述帧分别与所述文字集中的每个文字之间的相似度确定。
  7. 根据权利要求6所述的方法,其特征在于,所述根据分别与每个帧对应的终层特征向量,从所述第二帧序列中选取与所述目标关键字的相似程度最高的帧作为第二目标帧,包括:
    从所述第二帧序列中确定至少一个第一候选帧,所述第一候选帧与所述目标关键字的相似度小于所述第一候选帧与所述文字集中的至少一个文字的相似度,所述至少一个文字的个数小于预设数值;
    从所述至少一个第一候选帧中确定至少一个第二候选帧,所述至少一个第二候选帧为所述至少一个第一候选帧中与所述目标关键字的相似度最大的各第一候选帧;
    从所述至少一个第二候选帧中确定第二目标帧,按照相似度从高到低的顺序,所述第二目标帧与所述目标关键字的相似度位于所述第二目标帧与各文字的相似度中的排名,高于除所述第二目标帧外的每个所述第二候选帧与所述目 标关键字的相似度位于所述第二候选帧与各文字的相似度中的排名。
  8. 一种语音关键词识别装置,其特征在于,包括:
    第一目标帧确定单元,用于从构成第一语音的第一帧序列中选取一个第一目标帧;
    目标关键字确定单元,用于从关键字序列中选取一个关键字确定为目标关键字,其中,所述关键字序列属于所述语音关键词;
    匹配单元,用于若所述第一目标帧的隐层特征向量与所述目标关键字对应的关键字模板匹配成功,则逐一针对关键字序列中的每个关键字对应的关键字模板,确定位于所述第一语音中的帧的隐层特征向量是否匹配,其中,所述关键字模板指示包括所述目标关键字的第二语音中的第二目标帧的隐层特征向量;
    识别单元,用于若逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位于所述第一语音中的帧的隐层特征向量与其匹配成功,则确定所述第一语音中包括所述语音关键词。
  9. 根据权利要求8所述的装置,其特征在于,还包括:返回执行单元,用于:在匹配失败的情况下,返回执行所述从构成第一语音的第一帧序列中选取一个帧确定为第一目标帧的步骤。
  10. 根据权利要求9所述的装置,其特征在于,所述第一目标帧确定单元,包括:
    第一确定单元,用于从所述构成第一语音的第一帧序列中确定第一个从未被确定为第一目标帧的帧;
    第二确定单元,用于将所述帧作为从所述构成第一语音的第一帧序列中确定的第一目标帧。
  11. 根据权利要求8至10中任一项所述的装置,其特征在于,所述目标关键字确定单元,包括:
    第三确定单元,用于从所述语音关键词包括的所述关键字序列中,确定与最近一次匹配成功的关键字模板对应的关键字相邻的下一关键字;
    第四确定单元,用于若所述下一关键字被连续确定为目标关键字的次数未达到预设的阈值,将所述下一关键字确定为目标关键字;
    第五确定单元,用于若所述下一关键字被连续确定为目标关键字的次数达到所述阈值,将所述关键字序列中的第一个关键字确定为目标关键字。
  12. 根据权利要求8至11中任一项所述的装置,其特征在于,还包括关键字模板生成单元,所述关键字模板生成单元,包括:
    第二语音确定单元,用于确定包括所述目标关键字的第二语音,所述第二语音由第二帧序列构成;
    终层特征向量确定单元,用于将所述第二语音作为预设的语音模型的输入信息,确定分别与所述第二帧序列中的每个帧对应的终层特征向量;
    第二目标帧确定单元,用于根据分别与每个帧对应的终层特征向量,从所述第二帧序列中确定第二目标帧;
    关键字模板生成子单元,用于根据将所述第二目标帧作为所述语音模型的输入信息所得到的与所述第二目标帧对应的隐层特征向量,生成与所述目标关键字对应的关键字模板。
  13. 根据权利要求12所述的装置,其特征在于,所述帧对应的终层特征向量,包括:所述帧分别与所述语音模型中预设的文字集中的每个文字之间的相似度,所述目标关键字为所述文件集中的一个文字;
    所述第二目标帧确定单元,具体用于:根据分别与每个帧对应的终层特征向量,从所述第二帧序列中选取与所述目标关键字的相似程度最高的帧作为第二目标帧;其中,帧与所述目标关键字的相似程度根据所述帧分别与所述文字集中的每个文字之间的相似度确定。
  14. 根据权利要求13所述的装置,其特征在于,所述第二目标帧确定单元,包括:
    第一候选帧确定单元,用于从所述第二帧序列中确定至少一个第一候选帧,所述第一候选帧与所述目标关键字的相似度小于所述第一候选帧与所述文字集中的至少一个文字的相似度,所述至少一个文字的个数小于预设数值;
    第二候选帧确定单元,用于从所述至少一个第一候选帧中确定至少一个第二候选帧,所述至少一个第二候选帧为所述至少一个第一候选帧中与所述目标关键字的相似度最大的各第一候选帧;
    第二目标帧确定子单元,用于从所述至少一个第二候选帧中确定第二目标 帧,按照相似度从高到低的顺序,所述第二目标帧与所述目标关键字的相似度位于所述第二目标帧与各文字的相似度中的排名,高于除所述第二目标帧外的每个所述第二候选帧与所述目标关键字的相似度位于所述第二候选帧与各文字的相似度中的排名。
  15. 一种终端,其特征在于,包括存储器和处理器,所述存储器用于存储程序,所述处理器调用所述程序,所述程序用于:
    从构成第一语音的第一帧序列中选取一个第一目标帧;
    从关键字序列中选取一个关键字确定为目标关键字,其中,所述关键字序列属于所述语音关键词;
    若所述第一目标帧的隐层特征向量与所述目标关键字对应的关键字模板匹配成功,则逐一针对关键字序列中的每个关键字对应的关键字模板,确定位于所述第一语音中的帧的隐层特征向量是否匹配,其中,所述关键字模板指示包括所述目标关键字的第二语音中的第二目标帧的隐层特征向量;
    若逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位于所述第一语音中的帧的隐层特征向量与其匹配成功,则确定所述第一语音中包括所述语音关键词。
  16. 一种语音关键词识别服务器,其特征在于,包括存储器和处理器,所述存储器用于存储程序,所述处理器调用所述程序,所述程序用于:
    从构成第一语音的第一帧序列中选取一个第一目标帧;
    从关键字序列中选取一个关键字确定为目标关键字,其中,所述关键字序列属于所述语音关键词;
    若所述第一目标帧的隐层特征向量与所述目标关键字对应的关键字模板匹配成功,则逐一针对关键字序列中的每个关键字对应的关键字模板,确定位于所述第一语音中的帧的隐层特征向量是否匹配,其中,所述关键字模板指示包括所述目标关键字的第二语音中的第二目标帧的隐层特征向量;
    若逐一针对关键字序列中的每个关键字对应的关键字模板,均已确定出位于所述第一语音中的帧的隐层特征向量与其匹配成功,则确定所述第一语音中包括所述语音关键词。
  17. 一种计算机可读存储介质,包括指令,当其在计算机上运行时,使得 计算机执行如权利要求1至7中任一项所述的方法。
  18. 一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行如权利要求1至7任一项所述的方法。
PCT/CN2018/079769 2017-05-27 2018-03-21 一种语音关键词识别方法、装置、终端及服务器 WO2018219023A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710391388.6A CN107230475B (zh) 2017-05-27 2017-05-27 一种语音关键词识别方法、装置、终端及服务器
CN201710391388.6 2017-05-27

Publications (1)

Publication Number Publication Date
WO2018219023A1 true WO2018219023A1 (zh) 2018-12-06

Family

ID=59934556

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/079769 WO2018219023A1 (zh) 2017-05-27 2018-03-21 一种语音关键词识别方法、装置、终端及服务器

Country Status (3)

Country Link
CN (3) CN110349572B (zh)
TW (1) TWI690919B (zh)
WO (1) WO2018219023A1 (zh)

Families Citing this family (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110349572B (zh) * 2017-05-27 2021-10-22 腾讯科技(深圳)有限公司 一种语音关键词识别方法、装置、终端及服务器
CN107564517A (zh) 2017-07-05 2018-01-09 百度在线网络技术(北京)有限公司 语音唤醒方法、设备及***、云端服务器与可读介质
CN110444195B (zh) * 2018-01-31 2021-12-14 腾讯科技(深圳)有限公司 语音关键词的识别方法和装置
CN108564941B (zh) 2018-03-22 2020-06-02 腾讯科技(深圳)有限公司 语音识别方法、装置、设备及存储介质
CN108492827B (zh) 2018-04-02 2019-07-30 百度在线网络技术(北京)有限公司 应用程序的唤醒处理方法、装置及存储介质
CN108665900B (zh) * 2018-04-23 2020-03-03 百度在线网络技术(北京)有限公司 云端唤醒方法及***、终端以及计算机可读存储介质
CN108615526B (zh) 2018-05-08 2020-07-07 腾讯科技(深圳)有限公司 语音信号中关键词的检测方法、装置、终端及存储介质
CN109192224B (zh) * 2018-09-14 2021-08-17 科大讯飞股份有限公司 一种语音评测方法、装置、设备及可读存储介质
CN109215632B (zh) * 2018-09-30 2021-10-08 科大讯飞股份有限公司 一种语音评测方法、装置、设备及可读存储介质
CN109599124B (zh) * 2018-11-23 2023-01-10 腾讯科技(深圳)有限公司 一种音频数据处理方法、装置及存储介质
CN110322871A (zh) * 2019-05-30 2019-10-11 清华大学 一种基于声学表征矢量的样例关键词检索方法
CN110648668A (zh) * 2019-09-24 2020-01-03 上海依图信息技术有限公司 关键词检测装置和方法
CN110706703A (zh) * 2019-10-16 2020-01-17 珠海格力电器股份有限公司 一种语音唤醒方法、装置、介质和设备
CN110827806B (zh) * 2019-10-17 2022-01-28 清华大学深圳国际研究生院 一种语音关键词检测方法及***
CN112837680A (zh) * 2019-11-25 2021-05-25 马上消费金融股份有限公司 音频关键词检索方法、智能外呼方法及相关装置
CN111292753A (zh) * 2020-02-28 2020-06-16 广州国音智能科技有限公司 一种离线语音识别方法、装置和设备
CN111128138A (zh) * 2020-03-30 2020-05-08 深圳市友杰智新科技有限公司 语音唤醒方法、装置、计算机设备和存储介质
CN111723204B (zh) * 2020-06-15 2021-04-02 龙马智芯(珠海横琴)科技有限公司 语音质检区域的校正方法、装置、校正设备及存储介质
CN111798840B (zh) * 2020-07-16 2023-08-08 中移在线服务有限公司 语音关键词识别方法和装置
CN112259101B (zh) * 2020-10-19 2022-09-23 腾讯科技(深圳)有限公司 语音关键词识别方法、装置、计算机设备和存储介质
CN112259077B (zh) * 2020-10-20 2024-04-09 网易(杭州)网络有限公司 语音识别方法、装置、终端和存储介质
CN116523970B (zh) * 2023-07-05 2023-10-20 之江实验室 基于二次隐式匹配的动态三维目标跟踪方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593519A (zh) * 2008-05-29 2009-12-02 夏普株式会社 检测语音关键词的方法和设备及检索方法和***
CN104766608A (zh) * 2014-01-07 2015-07-08 深圳市中兴微电子技术有限公司 一种语音控制方法及装置
CN105117384A (zh) * 2015-08-19 2015-12-02 小米科技有限责任公司 分类器训练方法、类型识别方法及装置
CN105740686A (zh) * 2016-01-28 2016-07-06 百度在线网络技术(北京)有限公司 应用的控制方法和装置
CN107230475A (zh) * 2017-05-27 2017-10-03 腾讯科技(深圳)有限公司 一种语音关键词识别方法、装置、终端及服务器

Family Cites Families (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4224250B2 (ja) * 2002-04-17 2009-02-12 パイオニア株式会社 音声認識装置、音声認識方法および音声認識プログラム
CN101188110B (zh) * 2006-11-17 2011-01-26 陈健全 提高文本和语音匹配效率的方法
CN102053993B (zh) * 2009-11-10 2014-04-09 阿里巴巴集团控股有限公司 一种文本过滤方法及文本过滤***
CN102081638A (zh) * 2010-01-29 2011-06-01 蓝盾信息安全技术股份有限公司 一种匹配关键词的方法及装置
CN102915729B (zh) * 2011-08-01 2014-11-26 佳能株式会社 语音关键词检出***、创建用于其的词典的***和方法
JP5810946B2 (ja) * 2012-01-31 2015-11-11 富士通株式会社 特定通話検出装置、特定通話検出方法及び特定通話検出用コンピュータプログラム
KR101493006B1 (ko) * 2013-03-21 2015-02-13 디노플러스 (주) 멀티미디어 콘텐츠 편집장치 및 그 방법
US20140337030A1 (en) * 2013-05-07 2014-11-13 Qualcomm Incorporated Adaptive audio frame processing for keyword detection
US9786296B2 (en) * 2013-07-08 2017-10-10 Qualcomm Incorporated Method and apparatus for assigning keyword model to voice operated function
CN104143328B (zh) * 2013-08-15 2015-11-25 腾讯科技(深圳)有限公司 一种关键词检测方法和装置
CN104143329B (zh) * 2013-08-19 2015-10-21 腾讯科技(深圳)有限公司 进行语音关键词检索的方法及装置
CN103577548B (zh) * 2013-10-12 2017-02-08 优视科技有限公司 近音文字匹配方法及装置
US10032449B2 (en) * 2014-09-03 2018-07-24 Mediatek Inc. Keyword spotting system for achieving low-latency keyword recognition by using multiple dynamic programming tables reset at different frames of acoustic data input and related keyword spotting method
WO2016112113A1 (en) * 2015-01-07 2016-07-14 Knowles Electronics, Llc Utilizing digital microphones for low power keyword detection and noise suppression
US20160284349A1 (en) * 2015-03-26 2016-09-29 Binuraj Ravindran Method and system of environment sensitive automatic speech recognition
US9990917B2 (en) * 2015-04-13 2018-06-05 Intel Corporation Method and system of random access compression of transducer data for automatic speech recognition decoding
CN106161755A (zh) * 2015-04-20 2016-11-23 钰太芯微电子科技(上海)有限公司 一种关键词语音唤醒***及唤醒方法及移动终端
CN106297776B (zh) * 2015-05-22 2019-07-09 中国科学院声学研究所 一种基于音频模板的语音关键词检索方法
US20170061959A1 (en) * 2015-09-01 2017-03-02 Disney Enterprises, Inc. Systems and Methods For Detecting Keywords in Multi-Speaker Environments
TWI639153B (zh) * 2015-11-03 2018-10-21 絡達科技股份有限公司 電子裝置及其透過語音辨識喚醒的方法
CN105575386B (zh) * 2015-12-18 2019-07-30 百度在线网络技术(北京)有限公司 语音识别方法和装置
CN105679316A (zh) * 2015-12-29 2016-06-15 深圳微服机器人科技有限公司 一种基于深度神经网络的语音关键词识别方法及装置
US9805714B2 (en) * 2016-03-22 2017-10-31 Asustek Computer Inc. Directional keyword verification method applicable to electronic device and electronic device using the same
CN105930413A (zh) * 2016-04-18 2016-09-07 北京百度网讯科技有限公司 相似度模型参数的训练方法、搜索处理方法及对应装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593519A (zh) * 2008-05-29 2009-12-02 夏普株式会社 检测语音关键词的方法和设备及检索方法和***
CN104766608A (zh) * 2014-01-07 2015-07-08 深圳市中兴微电子技术有限公司 一种语音控制方法及装置
CN105117384A (zh) * 2015-08-19 2015-12-02 小米科技有限责任公司 分类器训练方法、类型识别方法及装置
CN105740686A (zh) * 2016-01-28 2016-07-06 百度在线网络技术(北京)有限公司 应用的控制方法和装置
CN107230475A (zh) * 2017-05-27 2017-10-03 腾讯科技(深圳)有限公司 一种语音关键词识别方法、装置、终端及服务器

Also Published As

Publication number Publication date
CN110444199B (zh) 2022-01-07
CN110444199A (zh) 2019-11-12
CN107230475B (zh) 2022-04-05
TW201832221A (zh) 2018-09-01
CN110349572B (zh) 2021-10-22
TWI690919B (zh) 2020-04-11
CN107230475A (zh) 2017-10-03
CN110349572A (zh) 2019-10-18

Similar Documents

Publication Publication Date Title
WO2018219023A1 (zh) 一种语音关键词识别方法、装置、终端及服务器
CN108491433B (zh) 聊天应答方法、电子装置及存储介质
EP3648099B1 (en) Voice recognition method, device, apparatus, and storage medium
WO2018133761A1 (zh) 一种人机对话的方法和装置
JP7407968B2 (ja) 音声認識方法、装置、設備及び記憶媒体
WO2017166650A1 (zh) 语音识别方法及装置
JP2018005218A (ja) 自動通訳方法及び装置
WO2020244065A1 (zh) 基于人工智能的字向量定义方法、装置、设备及存储介质
WO2018165932A1 (en) Generating responses in automated chatting
WO2021072955A1 (zh) 解码网络构建方法、语音识别方法、装置、设备及存储介质
US20210200813A1 (en) Human-machine interaction method, electronic device, and storage medium
JP2017162190A (ja) 類似文書検索プログラム、類似文書検索装置、及び類似文書検索方法
WO2018076450A1 (zh) 一种输入方法和装置、一种用于输入的装置
WO2020177592A1 (zh) 画作问答方法及装置、画作问答***、可读存储介质
US10802605B2 (en) Input method, device, and electronic apparatus
WO2020151690A1 (zh) 语句生成方法、装置、设备及存储介质
WO2014036827A1 (zh) 一种文本校正方法及用户设备
JP2020004382A (ja) 音声対話方法及び装置
CN111651578B (zh) 人机对话方法、装置及设备
WO2020001329A1 (zh) 一种输入预测方法及装置
WO2021244099A1 (zh) 语音编辑方法、电子设备及计算机可读存储介质
JP6553180B2 (ja) 言語検出を行うためのシステムおよび方法
WO2020006488A1 (en) Corpus generating method and apparatus, and human-machine interaction processing method and apparatus
TWI660340B (zh) 聲控方法及系統
WO2018028319A1 (zh) 一种联系人排序方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18809091

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18809091

Country of ref document: EP

Kind code of ref document: A1