WO2014117645A1 - 信息的识别方法和装置 - Google Patents

信息的识别方法和装置 Download PDF

Info

Publication number
WO2014117645A1
WO2014117645A1 PCT/CN2014/070489 CN2014070489W WO2014117645A1 WO 2014117645 A1 WO2014117645 A1 WO 2014117645A1 CN 2014070489 W CN2014070489 W CN 2014070489W WO 2014117645 A1 WO2014117645 A1 WO 2014117645A1
Authority
WO
WIPO (PCT)
Prior art keywords
command word
information
voice
voice information
command
Prior art date
Application number
PCT/CN2014/070489
Other languages
English (en)
French (fr)
Inventor
蒋洪睿
王细勇
梁俊斌
郑伟军
周均扬
Original Assignee
华为终端有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为终端有限公司 filed Critical 华为终端有限公司
Priority to EP14745447.4A priority Critical patent/EP2869298A4/en
Publication of WO2014117645A1 publication Critical patent/WO2014117645A1/zh
Priority to US14/585,959 priority patent/US9390711B2/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the present invention claims the Chinese patent application filed on January 29, 2013, filed on Jan. 29, 2013, the application number is 201310034262.5, and the invention name is "identification method and device for information", the entire contents of which are incorporated herein by reference. In the application.
  • the present invention relates to the field of information technology, and in particular, to a method and an apparatus for identifying information.
  • Speech recognition technology is a technology that converts human speech input into computer instructions. Using speech recognition technology can realize natural human-computer interaction. At present, with the development of speech recognition technology, many terminals can implement functions such as voice dialing, voice navigation, voice control, voice retrieval, and simple dictation.
  • the terminal may send the voice information to the cloud server through the network, and the cloud server completes the identification of the voice information.
  • the cloud server uses the cloud server to identify voice information, the user needs to upload some personal information to the cloud server, thereby reducing the security of the user information; in addition, the interaction between the terminal and the cloud server requires the use of the network, which increases the limitation of voice recognition, and The user's network traffic is consumed, and the delay of voice recognition is increased when the network is crowded, which affects the user experience.
  • the embodiment of the invention provides a method and a device for identifying information, which can implement the command word splitting of the voice information according to the two command word slot recognition grammar or the multi-command word slot recognition grammar, and according to at least one of the split words.
  • the command word identifies the operation instruction corresponding to the voice information, and provides a method for identifying the information.
  • the same number of command words can identify more voice input content, and improve the user experience.
  • an embodiment of the present invention provides a method for identifying information, where the method includes: Receiving voice information, extracting voice features from the voice information;
  • the calculating, by the phoneme, the phoneme corresponding to each candidate text of the plurality of candidate texts comprises: performing the voice feature and the plurality of candidate texts Each phoneme string corresponding to the candidate text is subjected to a phoneme distance calculation to obtain a distance value; and a candidate text corresponding to the phoneme string having the smallest distance value between the voice features is selected as the recognition result.
  • the determining, according to the label corresponding to the at least one command word, the operation instruction corresponding to the voice information specifically: according to all the commands in the at least one command word A combination of tags corresponding to the words, and an operation instruction corresponding to the voice information is identified.
  • the receiving the voice information, before extracting the voice feature from the voice information further includes: selecting a command word in the multiple command word slots according to the recognized grammar network Generating the plurality of candidate texts.
  • the operation corresponding to the voice information is identified according to a combination of labels corresponding to all command words in the at least one command word
  • the command includes: combining tags corresponding to each of the at least one command word in the recognition result, and querying an operation instruction corresponding to the combination of the tags in a local database or a network server.
  • an embodiment of the present invention provides an information identifying apparatus, where the apparatus includes: a receiving unit, configured to receive voice information, extract a voice feature from the voice information, and transmit the voice feature to a matching unit. ; a matching unit, configured to receive the text information transmitted by the receiving unit, perform matching calculation on the phoneme feature corresponding to each of the plurality of candidate texts, and obtain a recognition result, where the recognition result includes at least a command word and a label corresponding to the at least one command word, and transmitting the label to the identification unit;
  • an identifier configured to receive the label transmitted by the matching unit, and identify an operation instruction corresponding to the voice information according to the label corresponding to the at least one command word.
  • the matching unit is specifically configured to: perform a phoneme distance calculation on the phoneme string corresponding to each of the plurality of candidate texts, and obtain a distance value; The candidate text corresponding to the phoneme model with the smallest distance between the speech features as the recognition result
  • each of the at least one command word is identified by using a label; the identifying unit is specifically configured to: according to all of the at least one command word A combination of tags corresponding to the command words, and an operation instruction corresponding to the voice information is identified.
  • the device further includes: a generating unit, configured to generate the plurality of candidate texts by selecting command words in the plurality of command word slots according to the recognized grammar network.
  • the identifying unit is specifically configured to: each of the at least one command word in the recognition result The labels corresponding to the command words are combined, and the operation instruction corresponding to the combination of the labels is queried in the local database or the network server.
  • the terminal receives the voice information, and extracts a voice feature from the voice information; performing matching calculation on the phoneme feature corresponding to each of the plurality of candidate texts to obtain a recognition result, where
  • the recognition result includes at least one command word and a label corresponding to the at least one command word; and an operation instruction corresponding to the voice information is identified according to the label corresponding to the at least one command word.
  • the terminal recognizes the grammar according to the two command words slot or the multi-command word slot identifier
  • the method performs splitting of the command words on the voice information, and recognizes the operation instruction corresponding to the voice information according to the split at least one command word, and provides a method for identifying the information, and the same number of command words can identify more Voice input content improves the user experience.
  • FIG. 1 is a flowchart of a method for identifying information according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of an apparatus for identifying information according to an embodiment of the present invention.
  • FIG. 3 is a schematic diagram of a terminal according to an embodiment of the present invention.
  • FIG. 1 is a flowchart of a method for identifying information according to an embodiment of the present invention.
  • the execution subject of this embodiment is a terminal, and a method of recognizing an operation command after the terminal receives the voice input of the user is described in detail. As shown in FIG. 1, this embodiment includes the following steps:
  • Step 101 Receive voice information, and extract voice features from the voice information.
  • the terminal Before the terminal performs voice input, the terminal is first set to the voice information input state according to the received operation instruction of the user, and the voice recognition engine is run.
  • the recognition grammar can generate candidate text.
  • the terminal After receiving the voice information, the terminal converts the voice information into digital information, and extracts corresponding voice features from the digital information.
  • Step 102 Perform matching calculation on the phoneme feature corresponding to each of the plurality of candidate texts to obtain a recognition result, where the recognition result includes at least one command word and the A label corresponding to at least one command word.
  • the terminal provides a multi-command word slot recognition grammar, compared with the existing "act ion + object" grammar, multi-command words
  • the trellis recognition grammar structure divides the ac t ion part into different parts, and supports more speech input content through different combinations of parts, so that for the same number of voice input contents, the commands required to identify the grammar are provided.
  • the number of words can be reduced. For example, for the user's commonly used command words: "call to call”, “help me to call”, “call”, “help me call” and other voice input content, there is a part of the content is a total of several voice input content , such as "Call a call", "Call".
  • the ac t ion part in the existing identification grammar can be split into two levels or multiple levels, for example, it can be split into three levels, wherein the first order command word can be a modified command word.
  • the second order word can be a willing command word, and the third level command word can be an action command word. Therefore, before the voice input engine is executed, before the voice input is received, the embodiment of the present invention further includes: generating the plurality of candidate texts by selecting a command word in the multi-command word slot according to the recognized grammar network.
  • the recognition grammar of the multi-command word slot may select one command word in each command word slot of the plurality of command word slots (for a certain command word slot, a command word may not be selected), The selected command words are then combined to obtain candidate texts.
  • the recognition grammar of the three-level command word slot there are two command words "now” and “please” in the modified command word slot, and there are two commands "help me” and "I want” in the command word slot.
  • the action command word slot has two command words: "call to call” and "call”.
  • the recognition grammar can construct 26 candidate texts, respectively asking "now”, “please”, “ Help me, “I want”, “Help me now”, “Now I want”, “Please help me”, “Please ask me”, “Help me call”, “I want to call”, “ Call me for help, “I want to call Call “, “call now”, “call now”, “please call”, “call”, “call me now”, “call me now”, “Now I have to call “, "Now I want to call”, “Please call me for help”, "Please call me”, “Please call me”, “Please call me”, “Please call me”, “Call phone”, “Call” Of course, in order to achieve a complete operation, the candidate text must generally contain action command words.
  • the recognition grammar can construct 18 candidate texts, respectively I will call ","”I want to call", "Help me call”, “I want to call”, “Call now”, “Call now”, “Please call”, “Please call”, “Now call me for help”, “Call me now”, “Now I have to call”, "Now I want to call”, "Please call me for help”, “Please call me”, "Please I have to call ",””I want to call”, “Call to call”, “Call”. Therefore, the terminal can construct a plurality of candidate texts according to the recognition grammar of the multi-command word slot. By using the recognition grammar of the multi-level command word slot, when the same number of command words are used, more candidate texts can be constructed. Accordingly, more audible input content can be identified.
  • the calculating the phoneme string corresponding to each of the plurality of candidate texts by the phoneme feature comprises: performing a phoneme on the phoneme string corresponding to each of the plurality of candidate texts The distance is calculated to obtain a distance value; the candidate text corresponding to the phoneme string having the smallest distance value between the voice features is selected as the recognition result.
  • the phoneme feature is matched with the phoneme model corresponding to each phoneme in each phoneme string of each candidate text, and the distance value of each phonetic feature and each phoneme is obtained, and multiple pronunciations are performed.
  • the accumulated distance values corresponding to one phoneme string are obtained by accumulating a plurality of distance values between the plurality of phoneme models, wherein the candidate text corresponding to the phoneme string with the smallest accumulated distance value is the recognition result.
  • the phoneme string corresponding to each candidate text includes a series of phonemes, each phoneme corresponding to one phoneme model, and each phonetic feature and all phoneme model calculations can obtain a value; the entire phoneme string and the phonetic feature Get another cumulative value.
  • a candidate text corresponding to the phoneme string having the smallest cumulative distance between the voice features is selected as the recognition result. For example, for the candidate text "call to Zhang San", the corresponding phoneme strings are "d”, "a"d”,”ian”,”h”,”ua”,”g”,”ei”,”zh””””””””””””””””””””””””
  • the phoneme model is a group statistical feature, the user's phonetic features are individual features, and there will be errors between them.
  • This error becomes the distance between the phonetic feature and the phoneme model, and the error of all phoneme and phonetic features is accumulated.
  • the obtained value is the recognition distance of the phoneme feature corresponding to the phoneme feature corresponding to the candidate text.
  • the recognition result includes at least one command word, and each command word is identified by a label.
  • each command word is identified by a label.
  • “I want to call Zhang San” including “I want”, “Call to”, “Zhang San” include two command words “I want” and “Call”, and also includes a contact object.
  • “Zhang San” where the label of the "I want” command word is "0001”, through which the word can be known as the willing command word, and the label of the "call” command word is "0011".
  • the tag can know that the operation corresponding to the voice input is a call, and the tag corresponding to "Zhang San” is "1000", and the tag can be used to determine that the information is contact information.
  • the terminal may not save, and when the identification grammar of the multi-command word slot is needed, the identification grammar is obtained from the network server.
  • Step 103 Identify an operation instruction corresponding to the voice information according to the label corresponding to the at least one command word.
  • the operation command operation corresponding to the voice information is performed according to the label corresponding to the at least one command word, and the operation instruction corresponding to the combination of the labels is queried in a local database or a network server. And identifying, according to the label corresponding to the at least one command word, the operation instruction corresponding to the voice information, specifically: combining, by using a label corresponding to each command word in the at least one command word in the recognition result, in a local database or The network server queries the operation instruction corresponding to the combination of the tags.
  • the recognition grammar has determined the content and the label of each part of the candidate text when the candidate text is generated, and thus each component in the recognition result has also determined the corresponding label.
  • the terminal may determine an operation instruction corresponding to the label according to a mapping relationship between the label stored in the local database or the network server and the corresponding operation instruction.
  • the terminal receives the voice information, extracts the voice feature from the voice information, performs matching calculation on the voice feature and the plurality of candidate texts, and obtains a recognition result, where the recognition result includes at least one command word and the And a label corresponding to the at least one command word; and identifying an operation instruction corresponding to the voice information according to the label corresponding to the at least one command word. Therefore, the terminal performs command word splitting on the voice information according to the two command word slot recognition grammar or the multi-command word slot recognition grammar, and identifies the operation instruction corresponding to the voice information according to the split at least one command word, and provides a The method of identifying information, the same number of command words can identify more voice input content, and improve the user experience.
  • FIG. 2 is a schematic diagram of an apparatus for identifying information according to an embodiment of the present invention.
  • the embodiment of the present invention includes the following units:
  • the receiving unit 201 is configured to receive voice information, extract voice features from the voice information, and transmit the voice features to the matching unit.
  • the terminal After receiving the voice information, the terminal converts the voice information into digital information, and extracts corresponding voice features from the digital information.
  • the matching unit 202 is configured to receive the text information transmitted by the receiving unit, perform matching calculation on the phoneme feature corresponding to each of the plurality of candidate texts, and obtain a recognition result, where the recognition result includes At least one command word and a label corresponding to the at least one command word are transmitted to the identification unit.
  • the terminal provides a multi-command word slot recognition grammar, compared with the existing "act ion + object" grammar, multi-command words
  • the trellis recognition grammar structure is to split the ac t ion part into different parts, through different parts
  • the combination of points supports the recognition of more voice input content, so that for the same number of voice input contents, the number of command words required to identify the grammar can be reduced, and the recognition and grammar maintenance and expansion of the multi-command word slot is more convenient.
  • the ac t ion part in the existing identification grammar can be split into two levels or multiple levels, for example, it can be split into three levels, wherein the first order command word can be a modified command word.
  • the second order word can be a willing command word, and the third level command word can be an action command word.
  • the matching unit 202 is specifically configured to: perform a phoneme distance calculation on the phoneme string corresponding to each of the plurality of candidate texts, and obtain a distance value; and select a distance value between the voice feature and the voice feature.
  • the candidate text corresponding to the smallest phoneme string is used as the recognition result.
  • the speech feature is matched with the phoneme string of each candidate text, wherein the candidate text corresponding to the phoneme string with the smallest distance value is the recognition result.
  • the recognition result includes at least one command word, and each command word is identified by a label.
  • each command word is identified by a label.
  • “I want to call Zhang San” including “I want”, “Call to”, “Zhang San” include two command words “I want” and “Call”, and also includes a contact object.
  • the tag can be called when the operation corresponding to the voice input is made.
  • the tag corresponding to "Zhang San” is "1000", and the tag can be used to determine that the information is contact information.
  • the terminal may not save, and when the identification grammar of the multi-command word slot is needed, the identification grammar is obtained from the network server.
  • the identifying unit 203 is configured to receive the label transmitted by the matching unit, and identify an operation instruction corresponding to the voice information according to the label corresponding to the at least one command word.
  • the identifying unit 203 is specifically configured to: identify an operation instruction corresponding to the voice information according to a combination of labels corresponding to all command words in the at least one command word.
  • the identifying unit 203 is specifically configured to: combine the labels corresponding to each of the at least one command word in the recognition result, and query an operation instruction corresponding to the combination of the labels in a local database or a network server.
  • the recognition grammar has determined the content and the label of each part of the candidate text when the candidate text is generated, and thus each component in the recognition result has also determined the corresponding label.
  • the terminal may determine an operation instruction corresponding to the label according to a mapping relationship between the label stored in the local database or the network server and the corresponding operation instruction.
  • the embodiment of the present invention further includes: a generating unit 204, configured to generate, by using a command word in the plurality of command word slots, the plurality of candidate texts according to the recognized grammar network.
  • the terminal can construct a plurality of candidate texts according to the recognition grammar of the multi-command word slot, and by using the recognition grammar of the multi-level command word slot, when the same number of command words are used, the candidate text can be constructed more Accordingly, more audible input content can be recognized.
  • the terminal performs command word splitting on the voice information according to the two command word slot recognition grammar or the multi-command word slot recognition grammar, and identifies the operation instruction corresponding to the voice information according to the split at least one command word, and provides a
  • the method of identifying information, the same number of command words can identify more voice input content, and improve the user experience.
  • FIG. 3 is a schematic diagram of a terminal according to an embodiment of the present invention.
  • the embodiment includes a network interface 301, a processor 302, and a memory 303.
  • System bus 304 is used to connect network interface 301, processor 302, and memory 303.
  • the network interface 301 is used to communicate with other terminals or network servers.
  • the memory 303 can be a persistent storage such as a hard disk drive and flash memory having identification grammar, software modules and device drivers therein.
  • the software module is capable of executing various functional modules of the above described method of the present invention;
  • the device driver can be a network and interface driver, the recognition grammar is used to generate candidate text and identify the recognition result corresponding to the voice input content.
  • recognition grammar and software components are loaded into memory 303 and then accessed by processor 302 and executed as follows:
  • Receiving voice information extracting voice features from the voice information
  • a recognition grammar of a multi-command word slot can be saved in the memory 303 of the terminal, compared with the existing "act ion + object" recognition grammar.
  • the multi-command word slot recognition grammar structure splits the act ion part into different parts, and supports more voice input content through different combinations of parts, so that for the same number of voice input contents, the grammar needs to be provided.
  • the number of command words can be reduced. For example, for the user's commonly used command words: "call to call”, “help me to call”, “call”, “help me call” and other voice input content, there is a part of the content is a total of several voice input content , such as "Call a call", "Call".
  • the ac t ion part in the existing identification grammar can be split into two levels or multiple levels, for example, it can be split into three levels, wherein the first order command word can be a modified command word.
  • the second order word can be a willing command word, and the third level command word can be an action command word.
  • the instruction to perform the matching calculation process of the text information with the plurality of candidate texts is: the candidate feature for the voice feature and each of the plurality of candidate texts
  • the corresponding phoneme string is subjected to a phoneme distance calculation to obtain a distance value; and a candidate text corresponding to the phoneme string having the smallest distance value between the voice features is selected as the recognition result.
  • Each of the at least one command word is identified by using a label; further, after the processor 302 accesses the software component in the memory, performing, according to the label corresponding to the at least one command word, identifying the voice information
  • the operation instruction specifically includes: according to the at least one A combination of labels corresponding to all command words in the command words, and an operation instruction corresponding to the voice information is identified.
  • the processor 302 performs, according to the combination of the labels corresponding to all the command words in the at least one command word, the process of identifying the operation instruction corresponding to the voice information, including: the at least one command word in the recognition result
  • the labels corresponding to each command word are combined, and the operation instruction corresponding to the combination of the labels is queried in a local database or a web server.
  • the processor 302 may also access the software component and execute the following instructions: Selecting a command word generation in multiple command word slots according to the recognized grammar network; The plurality of candidate texts.
  • the terminal performs command word splitting on the voice information according to the two command word slot recognition grammar or the multi-command word slot recognition grammar, and identifies the operation instruction corresponding to the voice information according to the split at least one command word, and provides a
  • the method of identifying information, the same number of command words can identify more voice input content, and improve the user experience.
  • RAM random access memory
  • ROM read-only memory
  • EEPROM electrically programmable ROM
  • EEPROM electrically erasable programmable ROM
  • registers hard disk, removable disk, CD-ROM, or technical field Any other form of storage medium known.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Telephonic Communication Services (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

一种信息的识别方法和装置。该方法包括:终端接收语音信息,从所述语音信息中提取语音特征(101);将所述语音特征与多个候选文本中每一个候选文本对应的音素串进行匹配计算,得到识别结果,所述识别结果包括至少一个命令词以及所述至少一个命令词对应的标签(102);根据所述至少一个命令词对应的标签,识别所述语音信息对应的操作指令(103)。实现了终端将用户输入的语音信息对应的文本信息识别为操作指令。

Description

信息的识别方法和装置 本申请要求于 2013 年 1 月 29 日提交中国专利局, 申请号为 201310034262.5、 发明名称为 "信息的识别方法和装置" 的中国专利申请, 其 全部内容通过引用结合在本申请中。
技术领域
本发明涉及信息技术领域, 尤其涉及一种信息的识别方法和装置。
背景技术
语音识别技术是一种将人类的语音输入转换为计算机指令的一种技术, 使用语音识别技术可以实现自然的人机互动。 目前, 随着语音识别技术的发 展, 很多终端可以实现语音拨号、 语音导航、 语音控制、 语音检索、 简单的 听写录入等功能。
在现有技术下, 终端接收到输入语音信息后, 可以将语音信息通过网络 发送至云端服务器, 由云端服务器完成语音信息的识别。 然而, 使用云端服 务器识别语音信息, 用户需要将一些个人信息上传到云端服务器, 从而降低 用户信息的安全性; 另外, 终端和云端服务器的交互需要使用网络, 这增加 了语音识别的局限性, 而且消耗用户的网络流量, 在网络拥挤时还会增加语 音识别的时延, 影响用户的体验效果。 发明内容
本发明实施例提供了一种信息的识别方法和装置, 可以实现终端根据两 命令词槽识别文法或多命令词槽识别文法来对语音信息进行命令词拆分, 并 根据拆分后的至少一个命令词来识别语音信息对应的操作指令, 提供了一种 信息的识别方法, 同样的命令词数量可以识别出更多的语音输入内容, 提高 了用户的体验效果。
第一方面, 本发明实施例提供了一种信息的识别方法, 所述方法包括: 接收语音信息, 从所述语音信息中提取语音特征;
将所述语音特征与多个候选文本中每一个候选文本对应的音素串进行匹 配计算, 得到识别结果, 所述识别结果包括至少一个命令词以及所述至少一 个命令词对应的标签;
根据所述至少一个命令词对应的标签, 识别所述语音信息对应的操作指 令。
在第一种可能的实现方式中, 所述将所述语音特征与多个候选文本中每 一个候选文本对应的音素串进行匹配计算具体包括: 对所述语音特征与所述 多个候选文本中每一个候选文本对应的音素串进行音素距离计算, 得到距离 值; 选择与所述语音特征之间的距离值最小的音素串对应的候选文本作为识 别结果。
结合第一方面, 在第二种可能的实现方式中, 所述根据所述至少一个命 令词对应的标签, 识别所述语音信息对应的操作指令具体包括: 根据所述至 少一个命令词中所有命令词对应的标签的组合, 识别所述语音信息对应的操 作指令。
结合第一方面, 在第三种可能的实现方式中, 所述接收语音信息, 从所 述语音信息中提取语音特征之前, 还包括: 根据识别文法网络, 在多个命令 词槽中选择命令词生成所述多个候选文本。
结合第一方面的第二种可能的实现方式, 在第四种可能的实现方式中, 所述根据所述至少一个命令词中所有命令词对应的标签的组合, 识别所述语 音信息对应的操作指令包括: 将所述识别结果中所述至少一个命令词中每一 个命令词对应的标签进行组合, 在本地数据库或网络服务器中查询所述标签 的组合对应的操作指令。
第二方面, 本发明实施例提供了一种信息的识别装置, 所述装置包括: 接收单元, 用于接收语音信息, 从所述语音信息中提取语音特征, 将所 述语音特征传输至匹配单元; 匹配单元, 用于接收所述接收单元传输的所述文本信息, 将所述语音特 征与多个候选文本中每一个候选文本对应的音素串进行匹配计算, 得到识别 结果, 所述识别结果包括至少一个命令词以及所述至少一个命令词对应的标 签, 将所述标签传输至识别单元;
识别单元, 用于接收所述匹配单元传输的所述标签, 根据所述至少一个 命令词对应的标签, 识别所述语音信息对应的操作指令。
在第一种可能的实现方式中, 所述匹配单元具体用于: 对所述语音特征 与所述多个候选文本中每一个候选文本对应的音素串进行音素距离计算, 得 到距离值; 选择与所述语音特征之间的距离值最小的音素模型对应的候选文 本作为识别结果
结合第二方面, 在第二种可能的实现方式中, 所述至少一个命令词中的 每一个命令词使用一个标签来标识; 所述识别单元具体用于: 根据所述至少 一个命令词中所有命令词对应的标签的组合, 识别所述语音信息对应的操作 指令。
结合第二方面, 在第三种可能的实现方式中, 所述装置还包括: 生成单 元, 用于根据识别文法网络, 在多个命令词槽中选择命令词生成所述多个候 选文本。
结合第二方面或者第二方面的第二种可能的实现方式, 在第四种可能的 实现方式中, 所述识别单元具体用于: 将所述识别结果中所述至少一个命令 词中每一个命令词对应的标签进行组合, 在本地数据库或网络服务器中查询 所述标签的组合对应的操作指令
本发明实施例中, 终端接收语音信息, 从所述语音信息中提取语音特征; 将所述语音特征与多个候选文本中每一个候选文本对应的音素串进行匹配计 算, 得到识别结果, 所述识别结果包括至少一个命令词以及所述至少一个命 令词对应的标签; 根据所述至少一个命令词对应的标签, 识别所述语音信息 对应的操作指令。 由此, 终端根据两命令词槽识别文法或多命令词槽识别文 法来对语音信息进行命令词拆分, 并根据拆分后的至少一个命令词来识别语 音信息对应的操作指令, 提供了一种信息的识别方法, 同样的命令词数量可 以识别出更多的语音输入内容, 提高了用户的体验效果。 附图说明
为了更清楚地说明本发明实施例中的技术方案, 下面将对实施例或现有 技术描述中所需要使用的附图作简单地介绍, 显而易见地, 下面描述中的附 图仅仅是本发明的一些实施例, 对于本领域普通技术人员来讲, 在不付出创 造性劳动性的前提下, 还可以根据这些附图获得其他的附图。
图 1为本发明实施例提供的一种信息的识别方法流程图;
图 2为本发明实施例提供的一种信息的识别装置示意图;
图 3为本发明实施例提供的一种终端示意图。
具体实施方式
下面通过附图和实施例, 对本发明的技术方案做进一步的详细描述。 图 1 为本发明实施例提供的一种信息的识别方法流程图。 该实施例的执 行主体是终端, 其中详细描述了终端接收到用户的语音输入后, 识别出操作 指令的方法。 如图 1所示, 该实施例包括以下步骤:
步骤 101 , 接收语音信息, 从所述语音信息中提取语音特征。
当然, 在终端进行语音输入之前, 首先根据接收到的用户的操作指令, 将终端置为语音信息输入状态, 运行语音识别引擎。 在运行语音识别引擎时, 识别文法可以生成候选文本。
终端接收到语音信息后, 将语音信息转换成数字信息, 并从数字信息中 提取相应的语音特征。
步骤 102 ,将所述语音特征与多个候选文本中每一个候选文本对应的音素 串进行匹配计算, 得到识别结果, 所述识别结果包括至少一个命令词以及所 述至少一个命令词对应的标签。
为了实现本发明的技术方案,在实际应用中,终端提供了一种多命令词槽 的识别文法, 与现有的 "act ion (动作) +object (对象)" 识别文法相比, 多命令词槽的识别文法结构是将 ac t ion部分拆分为不同的部分, 通过各部分 的不同的组合支持识别更多的语音输入内容, 这样对于相同数量的语音输入 内容, 识别文法所需要提供的命令词数量可减少。 例如, 对于用户常用的命 令词: "拨打电话给", "帮我拨打电话给", "呼叫", "帮我呼叫" 等语音输入 内容中, 有一部分内容是几个语音输入内容共有的说法, 如 "拨打电话给", "呼叫"。如果釆用现有的文法识别结构, "拨打电话给", "帮我拨打电话给", "呼叫", "帮我呼叫" 等语音输入内容需要 4 个命令元素, 而如果釆用多命 令词槽的识别文法, 则可以只需要一个一级命令词 "帮我", 两个二级命令词 "拨打电话给"、 "呼叫", 共需要 3个命令词, 节省了命令词的数量, 并且多 命令词槽的识别文法维护和扩充更加方便。
本发明实施例提供的识别文法中, 可将现有识别文法中的 ac t ion部分拆 分为两级或者多级, 如可拆分为三级, 其中一级命令词可以为修饰命令词, 二级命令词可以为意愿命令词, 三级命令词可以为动作命令词。 因此, 在运 行语音识别引擎时, 接收语音输入之前, 本发明实施例还包括: 根据识别文 法网络, 在多命令词槽中选择命令词生成所述多个候选文本。 具体地, 多命 令词槽的识别文法可以在多个命令词词槽中的每个命令词词槽中选择一个命 令词 (对于某个命令词词槽, 也可以一个命令词都不选择), 然后将选择出的 命令词进行组合, 便得到候选文本。 例如, 在三级命令词槽的识别文法中, 修饰命令词槽中有 "现在"、 "请" 两个个命令词, 意愿命令词槽中有 "帮我"、 "我要" 两个命令词, 动作命令词槽中有 "拨打电话给"、 "呼叫" 两个命令 词, 如果没有任何约束条件, 则识别文法可以构造出 26个候选文本, 分别问 "现在"、 "请"、 "帮我"、 "我要"、 "现在帮我"、 "现在我要"、 "请帮我"、 "请我要"、 "帮我拨打电话给"、 "我要拨打电话给"、 "帮我呼叫"、 "我要呼 叫"、 "现在拨打电话给"、 "现在呼叫"、 "请拨打电话给"、 "请呼叫"、 "现在 帮我拨打电话给"、 "现在帮我呼叫"、 "现在我要拨打电话给"、 "现在我要呼 叫"、 "请帮我拨打电话给"、 "请帮我呼叫"、 "请我要拨打电话给"、 "请我要 呼叫"、 "拨打电话给"、 "呼叫", 当然了, 为了实现一个完整的操作, 候选文 本中一般必须包含动作命令词, 如果约束条件为候选文本中必须包含动作命 令词, 则即识别文法可以构造出 18个候选文本, 分别为 "帮我拨打电话给"、 "我要拨打电话给"、 "帮我呼叫"、 "我要呼叫"、 "现在拨打电话给"、 "现在 呼叫"、 "请拨打电话给"、 "请呼叫"、 "现在帮我拨打电话给"、 "现在帮我呼 叫"、 "现在我要拨打电话给"、 "现在我要呼叫"、 "请帮我拨打电话给"、 "请 帮我呼叫"、 "请我要拨打电话给"、 "请我要呼叫"、 "拨打电话给"、 "呼叫"。 由此, 终端可以将根据多命令词槽的识别文法构造出多个候选文本, 通过使 用多级命令词槽的识别文法, 在使用相同数目的命令词的情况下, 可以构造 的候选文本更多, 相应地, 可以识别的语音输入内容更多。
其中, 将所述语音特征与多个候选文本中每一个候选文本对应的音素串 进行匹配计算具体包括: 对所述语音特征与所述多个候选文本中每一个候选 文本对应的音素串进行音素距离计算, 得到距离值; 选择与所述语音特征之 间的距离值最小的音素串对应的候选文本作为识别结果。 在语音信息中提取 出语音特征后, 将语音特征与每一个候选文本的音素串中每一个音素对应的 音素模型进行匹配计算, 得到每一个语音特征与每一个音素的距离值, 将多 个发音与多个音素模型之间的多个距离值进行累计, 即可得到对应于一个音 素串的累计距离值, 其中, 累计距离值最小的音素串对应的候选文本即为识 别结果。
具体地, 每一个候选文本多对应的音素串包括一系列的音素, 每个音素 对应一个音素模型, 每个语音特征和所有的音素模型计算都可以得到一个数 值; 整个音素串和所述语音特征又得到一个累计的数值。 选择与所述语音特 征之间的累计距离最小的音素串对应的候选文本作为识别结果。 例如, 对于候选文本 "打电话给张三", 对应的音素串为 "d" , " a" d" , " ian" , " h" , " ua" , " g" , " ei " , " zh" , " ang" , " s " , " an" ; 将用户的语音输入对应的每一个语音特征与这些每一个音素对应的音素模型 进行计算, 即可得到一个距离值, 该值是一个大于等于 0 的数。 由于音素模 型是群体统计特征, 用户的语音特征是个体特征, 它们之间会有误差, 这个 误差就成为语音特征和音素模型的距离, 把所有的音素的和语音特征的误差 累计, 得到的数值就是这个语音特征和对应的候选文本对应的音素串的识别 距离, 距离值越小, 误差就越小, 说明该音素串与语音输入内容越匹配, 对 应的候选文本就为识别结果。
其中, 识别结果中包括至少一个命令词, 而且每个命令词都使用一个标 签来标识。 例如, "我要打电话给张三" 包括 "我要"、 "打电话给"、 "张三" 中包括两个命令词 "我要" 和 "打电话给", 还包括一个联系人对象 "张三", 其中, "我要" 命令词对应的标签是 "0001 " , 通过该标签可以知道该词是意 愿命令词, "打电话给" 命令词对应的标签是 "0011 " , 通过该标签可以知道 该语音输入对应的操作是打电话, "张三" 对应的标签是 "1000" , 通过该标 签可以确定这个信息是联系人信息。
对于上述多命令词槽的识别文法,终端中也可以不保存,而在需要使用该 多命令词槽的识别文法时, 从网络服务器中获取该识别文法。
步骤 103 ,根据所述至少一个命令词对应的标签,识别所述语音信息对应 的操作指令。
其中, 根据所述至少一个命令词对应的标签, 识别所述语音信息对应的 操作指令操作包括: 在本地数据库或网络服务器中查询所述标签的组合对应 的操作指令。 根据所述至少一个命令词对应的标签, 识别所述语音信息对应 的操作指令具体包括: 将所述识别结果中所述至少一个命令词中每一个命令 词对应的标签进行组合, 在本地数据库或网络服务器中查询所述标签的组合 对应的操作指令。 具体地, 识别文法在生成候选文本时, 已经确定了候选文本每个部分的 内容和标签, 因此识别结果中的每个组成部分也已经确定了相应的标签。 终 端可以根据本地数据库或者网络服务器中保存的标签与对应的操作指令的映 射关系, 确定标签所对应的操作指令。
本发明实施例中, 终端接收语音信息, 从所述语音信息中提取语音特征; 将所述语音特征与多个候选文本进行匹配计算, 得到识别结果, 所述识别结 果包括至少一个命令词以及所述至少一个命令词对应的标签; 根据所述至少 一个命令词对应的标签, 识别所述语音信息对应的操作指令。 由此, 终端根 据两命令词槽识别文法或多命令词槽识别文法来对语音信息进行命令词拆 分, 并根据拆分后的至少一个命令词来识别语音信息对应的操作指令, 提供 了一种信息的识别方法, 同样的命令词数量可以识别出更多的语音输入内容, 提高了用户的体验效果。
相应地, 本发明实施例还提供了一种信息的识别装置, 图 2为本发明实 施例提供的一种信息的识别装置示意图, 如图 2 所示, 本发明实施例包括以 下单元:
接收单元 201 , 用于接收语音信息, 从所述语音信息中提取语音特征, 将 所述语音特征传输至匹配单元。
终端接收到语音信息后, 将语音信息转换成数字信息, 并从数字信息中 提取相应的语音特征。
匹配单元 202 ,用于接收所述接收单元传输的所述文本信息,将所述语音 特征与多个候选文本中每一个候选文本对应的音素串进行匹配计算, 得到识 别结果, 所述识别结果包括至少一个命令词以及所述至少一个命令词对应的 标签, 将所述标签传输至识别单元。
为了实现本发明的技术方案,在实际应用中,终端提供了一种多命令词槽 的识别文法, 与现有的 "act ion (动作) +object (对象)" 识别文法相比, 多命令词槽的识别文法结构是将 ac t ion部分拆分为不同的部分, 通过不同部 分的组合支持识别更多的语音输入内容, 这样对于相同数量的语音输入内容, 识别文法所需要提供的命令词数量可减少, 并且多命令词槽的识别文法维护 和扩充更加方便。
本发明实施例提供的识别文法中, 可将现有识别文法中的 ac t ion部分拆 分为两级或者多级, 如可拆分为三级, 其中一级命令词可以为修饰命令词, 二级命令词可以为意愿命令词, 三级命令词可以为动作命令词。
其中, 匹配单元 202 具体用于: 对所述语音特征与所述多个候选文本中 每一个候选文本对应的音素串进行音素距离计算, 得到距离值; 选择与所述 语音特征之间的距离值最小的音素串对应的候选文本作为识别结果。 在语音 信息中提取出语音特征后, 将语音特征与每一个候选文本的音素串进行匹配 计算, 其中, 距离值最小的音素串对应的候选文本即为识别结果。
其中, 识别结果中包括至少一个命令词, 而且每个命令词都使用一个标 签来标识。 例如, "我要打电话给张三" 包括 "我要"、 "打电话给"、 "张三" 中包括两个命令词 "我要" 和 "打电话给", 还包括一个联系人对象 "张三", 其中, "我要" 命令词对应的标签是 " 0001 " , 通过该标签可以知道该词是意 愿命令词, "打电话给" 命令词对应的标签是 " 0011 " , 通过该标签可以知道 该语音输入对应的操作时打电话, "张三" 对应的标签是 " 1000" , 通过该标 签可以确定这个信息是联系人信息。
对于上述多命令词槽的识别文法, 终端中也可以不保存, 而在需要使用 该多命令词槽的识别文法时, 从网络服务器中获取该识别文法。
识别单元 203 ,用于接收所述匹配单元传输的所述标签,根据所述至少一 个命令词对应的标签, 识别所述语音信息对应的操作指令。
其中, 识别单元 203具体用于: 根据所述至少一个命令词中所有命令词 对应的标签的组合, 识别所述语音信息对应的操作指令。 识别单元 203 具体 用于: 将所述识别结果中所述至少一个命令词中每一个命令词对应的标签进 行组合, 在本地数据库或网络服务器中查询所述标签的组合对应的操作指令。 具体地, 识别文法在生成候选文本时, 已经确定了候选文本每个部分的 内容和标签, 因此识别结果中的每个组成部分也已经确定了相应的标签。 终 端可以根据本地数据库或者网络服务器中保存的标签与对应的操作指令的映 射关系, 确定标签所对应的操作指令。
优选地, 本发明实施例还包括: 生成单元 204 , 用于根据识别文法网络, 在多个命令词槽中选择命令词生成所述多个候选文本。 由此, 终端可以将根 据多命令词槽的识别文法构造出多个候选文本, 通过使用多级命令词槽的识 别文法, 在使用相同数目的命令词的情况下, 可以构造的候选文本更多, 相 应地, 可以识别的语音输入内容更多。
由此, 终端根据两命令词槽识别文法或多命令词槽识别文法来对语音信 息进行命令词拆分, 并根据拆分后的至少一个命令词来识别语音信息对应的 操作指令, 提供了一种信息的识别方法, 同样的命令词数量可以识别出更多 的语音输入内容, 提高了用户的体验效果。
相应地, 本发明实施例还提供了一种终端, 图 3为本发明实施例提供的 一种终端示意图, 如图 3所示, 本实施例包括网络接口 301、 处理器 302和存 储器 303。 ***总线 304用于连接网络接口 301、 处理器 302和存储器 303。
网络接口 301用于与其他终端或者网络服务器进行通信。
存储器 303可以是永久存储器, 例如硬盘驱动器和闪存, 存储器 303中 具有识别文法、 软件模块和设备驱动程序。 软件模块能够执行本发明上述方 法的各种功能模块; 设备驱动程序可以是网络和接口驱动程序, 识别文法用 于生成候选文本并识别出语音输入内容对应的识别结果。
在启动时, 识别文法和软件组件被加载到存储器 303 中, 然后被处理器 302访问并执行如下指令:
接收语音信息, 从所述语音信息中提取语音特征;
将所述语音特征与多个候选文本中的每一个候选文本对应音素串进行匹 配计算, 得到识别结果, 所述识别结果包括至少一个命令词以及所述至少一 个命令词对应的标签;
根据所述至少一个命令词对应的标签, 识别所述语音信息对应的操作指 令。
具体地,为了实现本发明的技术方案,在终端的存储器 303中可以保存一 种多命令词槽的识别文法, 与现有的 "act ion (动作) +object (对象)" 识 别文法相比,多命令词槽的识别文法结构是将 act ion部分拆分为不同的部分, 通过各部分的不同的组合支持识别更多的语音输入内容, 这样对于相同数量 的语音输入内容, 识别文法所需要提供的命令词数量可减少。 例如, 对于用 户常用的命令词: "拨打电话给", "帮我拨打电话给", "呼叫", "帮我呼叫" 等语音输入内容中, 有一部分内容是几个语音输入内容共有的说法, 如 "拨 打电话给", "呼叫"。 如果釆用现有的文法识别结构, "拨打电话给", "帮我 拨打电话给", "呼叫", "帮我呼叫" 等语音输入内容需要 4 个命令元素, 而 如果釆用多命令词槽的识别文法, 则可以只需要一个一级命令词 "帮我", 两 个二级命令词 "拨打电话给"、 "呼叫", 共需要 3个命令词, 节省了命令词的 数量, 并且多命令词槽的识别文法维护和扩充更加方便。
本发明实施例提供的识别文法中, 可将现有识别文法中的 ac t ion部分拆 分为两级或者多级, 如可拆分为三级, 其中一级命令词可以为修饰命令词, 二级命令词可以为意愿命令词, 三级命令词可以为动作命令词。
进一步的, 处理器 302 访问存储器中的软件组件后, 执行将所述文本信 息与多个候选文本进行匹配计算过程的指令为: 对所述语音特征与所述多个 候选文本中每一个候选文本对应的音素串进行音素距离计算, 得到距离值; 选择与所述语音特征之间的距离值最小的音素串对应的候选文本作为识别结 果。
其中, 至少一个命令词中的每一个命令词使用一个标签来标识; 进一步 的, 处理器 302访问存储器中的软件组件后, 执行根据所述至少一个命令词 对应的标签, 识别所述语音信息对应的操作指令具体包括: 根据所述至少一 个命令词中所有命令词对应的标签的组合, 识别所述语音信息对应的操作指 令。
进一步的, 处理器 302 执行根据所述至少一个命令词中所有命令词对应 的标签的组合, 识别所述语音信息对应的操作指令的过程包括: 将所述识别 结果中所述至少一个命令词中每一个命令词对应的标签进行组合, 在本地数 据库或网络服务器中查询所述标签的组合对应的操作指令。
进一步的, 处理器 302在执行将所述文本信息与候选文本进行匹配计算 的操作之前, 还可访问软件组件, 并执行以下指令: 根据识别文法网络, 在 多个命令词槽中选择命令词生成所述多个候选文本。
由此, 终端根据两命令词槽识别文法或多命令词槽识别文法来对语音信 息进行命令词拆分, 并根据拆分后的至少一个命令词来识别语音信息对应的 操作指令, 提供了一种信息的识别方法, 同样的命令词数量可以识别出更多 的语音输入内容, 提高了用户的体验效果。
本领域技术人员应该还可以进一步意识到, 结合本文中所公开的实施例 描述的各示例的单元及算法步骤, 能够以电子硬件、 计算机软件或者二者的 结合来实现, 为了清楚地说明硬件和软件的可互换性, 在上述说明中已经按 照功能一般性地描述了各示例的组成及步骤。 这些功能究竟以硬件还是软件 方式来执行, 取决于技术方案的特定应用和设计约束条件。 本领域技术人员 可以对每个特定的应用来使用不同方法来实现所描述的功能, 但是这种实现 不应认为超出本发明的范围。
结合本文中所公开的实施例描述的方法或算法的步骤可以用硬件、 处理 器执行的软件模块, 或者二者的结合来实施。 软件模块可以置于随机存储器 ( RAM )、 内存、 只读存储器(ROM )、 电可编程 R0M、 电可擦除可编程 R0M、 寄 存器、 硬盘、 可移动磁盘、 CD-R0M、 或技术领域内所公知的任意其它形式的 存储介质中。
以上所述的具体实施方式, 对本发明的目的、 技术方案和有益效果进行 了进一步详细说明, 所应理解的是, 以上所述仅为本发明的具体实施方式而 已, 并不用于限定本发明的保护范围, 凡在本发明的精神和原则之内, 所做 的任何修改、 等同替换、 改进等, 均应包含在本发明的保护范围之内。

Claims

权 利 要求 书
1、 一种信息的识别方法, 其特征在于, 所述方法包括:
接收语音信息, 从所述语音信息中提取语音特征;
将所述语音特征与多个候选文本中每一个候选文本对应的音素串进行匹配 计算, 得到识别结果, 所述识别结果包括至少一个命令词以及所述至少一个命 令词对应的标签;
根据所述至少一个命令词对应的标签,识别所述语音信息对应的操作指令。
2、 根据权利要求 1所述的信息的识别方法,其特征在于,所述将所述语音 特征与多个候选文本中每一个候选文本对应的音素串进行匹配计算具体包括: 对所述语音特征与所述多个候选文本中每一个候选文本对应的音素串进行 音素距离计算, 得到距离值;
选择与所述语音特征之间的距离值最小的音素串对应的候选文本作为识别 结果。
3、 根据权利要求 1所述的信息的识别方法,其特征在于,所述至少一个命 令词中的每一个命令词使用一个标签来标识;
所述根据所述至少一个命令词对应的标签, 识别所述语音信息对应的操作 指令具体包括: 根据所述至少一个命令词中所有命令词对应的标签的组合, 识 别所述语音信息对应的操作指令。
4、 根据权利要求 1所述的信息的识别方法,其特征在于,所述将接收语音 信息, 从所述语音信息中提取语音特征之前, 还包括: 根据识别文法网络, 在 多个命令词槽中选择命令词生成所述多个候选文本。
5、 根据权利要求 3所述的信息的识别方法,其特征在于,所述根据所述至 少一个命令词中所有命令词对应的标签的组合, 识别所述语音信息对应的操作 指令包括: 将所述识别结果中所述至少一个命令词中每一个命令词对应的标签 进行组合, 在本地数据库或网络服务器中查询所述标签的组合对应的操作指令。
6、 一种信息的识别装置, 其特征在于, 所述装置包括:
接收单元, 用于接收语音信息, 从所述语音信息中提取语音特征, 将所述 语音特征传输至匹配单元;
匹配单元, 用于接收所述接收单元传输的所述文本信息, 将所述语音特征 与多个候选文本中每一个候选文本对应的音素串进行匹配计算, 得到识别结果, 所述识别结果包括至少一个命令词以及所述至少一个命令词对应的标签, 将所 述标签传输至识别单元;
识别单元, 用于接收所述匹配单元传输的所述标签, 根据所述至少一个命 令词对应的标签, 识别所述语音信息对应的操作指令。
7、 根据权利要求 6所述的信息的识别装置,其特征在于,所述匹配单元具 体用于:
对所述语音特征与所述多个候选文本中每一个候选文本对应的音素串进行 音素距离计算, 得到距离值;
选择与所述语音特征之间的距离值最小的音素串对应的候选文本作为识别 结果。
8、 根据权利要求 6所述的信息的识别装置,其特征在于,所述至少一个命 令词中的每一个命令词使用一个标签来标识;
所述识别单元具体用于: 根据所述至少一个命令词中所有命令词对应的标 签的组合, 识别所述语音信息对应的操作指令。
9、 根据权利要求 6所述的信息的识别装置,其特征在于,所述装置还包括: 生成单元, 用于根据识别文法网络, 在多个命令词槽中选择命令词生成所述多 个候选文本。
10、 根据权利要求 8 所述的信息的识别装置, 其特征在于, 所述识别单元 具体用于: 将所述识别结果中所述至少一个命令词中每一个命令词对应的标签 进行组合, 在本地数据库或网络服务器中查询所述标签的组合对应的操作指令。
PCT/CN2014/070489 2013-01-29 2014-01-10 信息的识别方法和装置 WO2014117645A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP14745447.4A EP2869298A4 (en) 2013-01-29 2014-01-10 INFORMATION IDENTIFICATION PROCESS AND DEVICE
US14/585,959 US9390711B2 (en) 2013-01-29 2014-12-30 Information recognition method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310034262.5 2013-01-29
CN201310034262.5A CN103077714B (zh) 2013-01-29 2013-01-29 信息的识别方法和装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US14/585,959 Continuation US9390711B2 (en) 2013-01-29 2014-12-30 Information recognition method and apparatus

Publications (1)

Publication Number Publication Date
WO2014117645A1 true WO2014117645A1 (zh) 2014-08-07

Family

ID=48154223

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/070489 WO2014117645A1 (zh) 2013-01-29 2014-01-10 信息的识别方法和装置

Country Status (4)

Country Link
US (1) US9390711B2 (zh)
EP (1) EP2869298A4 (zh)
CN (1) CN103077714B (zh)
WO (1) WO2014117645A1 (zh)

Families Citing this family (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103077714B (zh) 2013-01-29 2015-07-08 华为终端有限公司 信息的识别方法和装置
CN104301500A (zh) * 2013-07-16 2015-01-21 中兴通讯股份有限公司 一种终端控制方法、装置和终端
CN103699293A (zh) * 2013-12-02 2014-04-02 联想(北京)有限公司 一种操作方法和电子设备
EP2911149B1 (en) * 2014-02-19 2019-04-17 Nokia Technologies OY Determination of an operational directive based at least in part on a spatial audio property
US9263042B1 (en) * 2014-07-25 2016-02-16 Google Inc. Providing pre-computed hotword models
CN104408060B (zh) * 2014-10-29 2018-08-07 小米科技有限责任公司 信息处理的方法及装置
CN104681023A (zh) * 2015-02-15 2015-06-03 联想(北京)有限公司 一种信息处理方法及电子设备
CN107092606B (zh) * 2016-02-18 2022-04-12 腾讯科技(深圳)有限公司 一种搜索方法、装置及服务器
CN105931639B (zh) * 2016-05-31 2019-09-10 杨若冲 一种支持多级命令词的语音交互方法
WO2017206133A1 (zh) * 2016-06-02 2017-12-07 深圳市智物联网络有限公司 语音识别方法及装置
CN106335436B (zh) * 2016-08-31 2022-03-25 北京兴科迪科技有限公司 一种集成麦克风的内后视镜
CN106791010B (zh) * 2016-11-28 2020-07-10 北京安云世纪科技有限公司 一种信息处理的方法、装置和移动终端
CN106910498A (zh) * 2017-03-01 2017-06-30 成都启英泰伦科技有限公司 提高语音控制命令词识别率的方法
CN108573706B (zh) * 2017-03-10 2021-06-08 北京搜狗科技发展有限公司 一种语音识别方法、装置及设备
CN109754784B (zh) * 2017-11-02 2021-01-29 华为技术有限公司 训练滤波模型的方法和语音识别的方法
CN108597509A (zh) * 2018-03-30 2018-09-28 百度在线网络技术(北京)有限公司 智能语音交互实现方法、装置、计算机设备及存储介质
CN108922531B (zh) * 2018-07-26 2020-10-27 腾讯科技(北京)有限公司 槽位识别方法、装置、电子设备及存储介质
CN109087645B (zh) * 2018-10-24 2021-04-30 科大讯飞股份有限公司 一种解码网络生成方法、装置、设备及可读存储介质
CN109979449A (zh) * 2019-02-15 2019-07-05 江门市汉的电气科技有限公司 一种智能灯具的语音控制方法、装置、设备和存储介质
CN109830240A (zh) * 2019-03-25 2019-05-31 出门问问信息科技有限公司 基于语音操作指令识别用户特定身份的方法、装置及***
CN111860549B (zh) * 2019-04-08 2024-02-20 北京嘀嘀无限科技发展有限公司 信息识别装置、方法、计算机设备及存储介质
CN110580908A (zh) * 2019-09-29 2019-12-17 出门问问信息科技有限公司 一种支持不同语种的命令词检测方法及设备
CN111128174A (zh) * 2019-12-31 2020-05-08 北京猎户星空科技有限公司 一种语音信息的处理方法、装置、设备及介质
CN111489737B (zh) * 2020-04-13 2020-11-10 深圳市友杰智新科技有限公司 语音命令识别方法、装置、存储介质及计算机设备
CN113539252A (zh) * 2020-04-22 2021-10-22 庄连豪 无障碍智能语音***及其控制方法
CN111681647B (zh) * 2020-06-10 2023-09-05 北京百度网讯科技有限公司 用于识别词槽的方法、装置、设备以及存储介质
CN112017647B (zh) * 2020-09-04 2024-05-03 深圳海冰科技有限公司 一种结合语义的语音识别方法、装置和***
CN112735394B (zh) * 2020-12-16 2022-12-30 青岛海尔科技有限公司 一种语音的语义解析方法及装置
CN113160810A (zh) * 2021-01-13 2021-07-23 安徽师范大学 一种基于ld3320的语音识别交互方法及***
CN113823269A (zh) * 2021-09-07 2021-12-21 广西电网有限责任公司贺州供电局 一种基于语音识别电网调度命令自动保存的方法

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1323436A (zh) * 1998-09-09 2001-11-21 旭化成株式会社 声音识别装置
US6836758B2 (en) * 2001-01-09 2004-12-28 Qualcomm Incorporated System and method for hybrid voice recognition
CN1615508A (zh) * 2001-12-17 2005-05-11 旭化成株式会社 语音识别方法、遥控器、信息终端、电话通信终端以及语音识别器
JP2005266009A (ja) * 2004-03-16 2005-09-29 Matsushita Electric Ind Co Ltd データ変換プログラムおよびデータ変換装置
JP2006031010A (ja) * 2004-07-15 2006-02-02 Robert Bosch Gmbh 固有名称又は部分的な固有名称の認識を提供する方法及び装置
CN1945563A (zh) * 2005-10-04 2007-04-11 罗伯特·博世有限公司 不流利语句的自然语言处理
TW201117114A (en) * 2009-11-10 2011-05-16 Inst Information Industry System, apparatus and method for message simulation
CN102201235A (zh) * 2010-03-26 2011-09-28 三菱电机株式会社 发音词典的构建方法和***
CN103077714A (zh) * 2013-01-29 2013-05-01 华为终端有限公司 信息的识别方法和装置

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5732187A (en) * 1993-09-27 1998-03-24 Texas Instruments Incorporated Speaker-dependent speech recognition using speaker independent models
US6208971B1 (en) * 1998-10-30 2001-03-27 Apple Computer, Inc. Method and apparatus for command recognition using data-driven semantic inference
US6895558B1 (en) * 2000-02-11 2005-05-17 Microsoft Corporation Multi-access mode electronic personal assistant
US8068881B2 (en) * 2002-08-09 2011-11-29 Avon Associates, Inc. Voice controlled multimedia and communications system
US7742911B2 (en) * 2004-10-12 2010-06-22 At&T Intellectual Property Ii, L.P. Apparatus and method for spoken language understanding by using semantic role labeling
GB0426347D0 (en) * 2004-12-01 2005-01-05 Ibm Methods, apparatus and computer programs for automatic speech recognition
US7627096B2 (en) * 2005-01-14 2009-12-01 At&T Intellectual Property I, L.P. System and method for independently recognizing and selecting actions and objects in a speech recognition system
KR100717385B1 (ko) * 2006-02-09 2007-05-11 삼성전자주식회사 인식 후보의 사전적 거리를 이용한 인식 신뢰도 측정 방법및 인식 신뢰도 측정 시스템
ATE509345T1 (de) * 2007-09-21 2011-05-15 Boeing Co Gesprochene fahrzeugsteuerung
EP2210372B1 (en) * 2007-10-04 2011-04-13 U-MAN Universal Media Access Networks GmbH Digital multimedia network with hierarchical parameter control protocol
CN101345051B (zh) * 2008-08-19 2010-11-10 南京师范大学 带定量参数的地理信息***语音控制方法
US20100121707A1 (en) * 2008-11-13 2010-05-13 Buzzient, Inc. Displaying analytic measurement of online social media content in a graphical user interface
US20110099507A1 (en) * 2009-10-28 2011-04-28 Google Inc. Displaying a collection of interactive elements that trigger actions directed to an item
US8909771B2 (en) * 2011-09-15 2014-12-09 Stephan HEATH System and method for using global location information, 2D and 3D mapping, social media, and user behavior and information for a consumer feedback social media analytics platform for providing analytic measurements data of online consumer feedback for global brand products or services of past, present or future customers, users, and/or target markets
CN102510426A (zh) * 2011-11-29 2012-06-20 安徽科大讯飞信息科技股份有限公司 个人助理应用访问方法及***
US9467409B2 (en) * 2013-06-04 2016-10-11 Yahoo! Inc. System and method for contextual mail recommendations

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1323436A (zh) * 1998-09-09 2001-11-21 旭化成株式会社 声音识别装置
US6836758B2 (en) * 2001-01-09 2004-12-28 Qualcomm Incorporated System and method for hybrid voice recognition
CN1615508A (zh) * 2001-12-17 2005-05-11 旭化成株式会社 语音识别方法、遥控器、信息终端、电话通信终端以及语音识别器
JP2005266009A (ja) * 2004-03-16 2005-09-29 Matsushita Electric Ind Co Ltd データ変換プログラムおよびデータ変換装置
JP2006031010A (ja) * 2004-07-15 2006-02-02 Robert Bosch Gmbh 固有名称又は部分的な固有名称の認識を提供する方法及び装置
CN1945563A (zh) * 2005-10-04 2007-04-11 罗伯特·博世有限公司 不流利语句的自然语言处理
TW201117114A (en) * 2009-11-10 2011-05-16 Inst Information Industry System, apparatus and method for message simulation
CN102201235A (zh) * 2010-03-26 2011-09-28 三菱电机株式会社 发音词典的构建方法和***
CN103077714A (zh) * 2013-01-29 2013-05-01 华为终端有限公司 信息的识别方法和装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP2869298A4 *

Also Published As

Publication number Publication date
CN103077714A (zh) 2013-05-01
CN103077714B (zh) 2015-07-08
EP2869298A1 (en) 2015-05-06
EP2869298A4 (en) 2015-09-16
US20150120301A1 (en) 2015-04-30
US9390711B2 (en) 2016-07-12

Similar Documents

Publication Publication Date Title
WO2014117645A1 (zh) 信息的识别方法和装置
US11915699B2 (en) Account association with device
JP6771805B2 (ja) 音声認識方法、電子機器、及びコンピュータ記憶媒体
US8972260B2 (en) Speech recognition using multiple language models
CN108694940B (zh) 一种语音识别方法、装置及电子设备
US8959014B2 (en) Training acoustic models using distributed computing techniques
US9940927B2 (en) Multiple pass automatic speech recognition methods and apparatus
JP2021018797A (ja) 対話の交互方法、装置、コンピュータ可読記憶媒体、及びプログラム
US10366690B1 (en) Speech recognition entity resolution
WO2019001194A1 (zh) 语音识别方法、装置、设备及存储介质
JP2018536905A (ja) 発話認識方法及び装置
US10170122B2 (en) Speech recognition method, electronic device and speech recognition system
US10152298B1 (en) Confidence estimation based on frequency
JP2016500843A (ja) 検索クエリ情報を使用する音声認識処理のための方法およびシステム
US20150348543A1 (en) Speech Recognition of Partial Proper Names by Natural Language Processing
US10909983B1 (en) Target-device resolution
KR20210016767A (ko) 음성 인식 방법 및 음성 인식 장치
US10629207B2 (en) Caching scheme for voice recognition engines
CN109741735A (zh) 一种建模方法、声学模型的获取方法和装置
WO2017184387A1 (en) Hierarchical speech recognition decoder
WO2021098318A1 (zh) 应答方法、终端及存储介质
US11552803B1 (en) Systems for provisioning devices
US11551666B1 (en) Natural language processing
JP2015102805A (ja) 音声認識システム、電子装置、サーバー、音声認識方法および音声認識プログラム
US11627185B1 (en) Wireless data protocol

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14745447

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2014745447

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE