WO2023159881A1 - 语音意图识别方法、装置及电子设备 - Google Patents

语音意图识别方法、装置及电子设备 Download PDF

Info

Publication number
WO2023159881A1
WO2023159881A1 PCT/CN2022/110081 CN2022110081W WO2023159881A1 WO 2023159881 A1 WO2023159881 A1 WO 2023159881A1 CN 2022110081 W CN2022110081 W CN 2022110081W WO 2023159881 A1 WO2023159881 A1 WO 2023159881A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
sample
terminal device
voice data
speech
Prior art date
Application number
PCT/CN2022/110081
Other languages
English (en)
French (fr)
Inventor
刘建国
施新梅
Original Assignee
青岛海尔科技有限公司
海尔智家股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 青岛海尔科技有限公司, 海尔智家股份有限公司 filed Critical 青岛海尔科技有限公司
Publication of WO2023159881A1 publication Critical patent/WO2023159881A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications

Definitions

  • the present disclosure relates to the field of big data, and in particular, relates to a speech intent recognition method, device and electronic equipment.
  • Embodiments of the present disclosure provide a speech intent recognition method, device, and electronic device, so as to at least solve the technical problem of inaccurate speech intent recognition results in related technologies.
  • a voice intention recognition method including: acquiring voice data from a terminal device and status data of the terminal device; inputting the voice data and status data into a multi-classification model to obtain the voice data corresponding to Intent recognition results, wherein the multi-classification model is obtained based on multiple sets of sample data training, and the multiple sets of sample data include sample voice data and sample state data, and the intent corresponding to the sample voice data; return the intent recognition result to the terminal device.
  • a voice intent recognition method including: collecting voice data; reporting the voice data and the status data of the terminal device to the server, wherein the server is configured to use a multi-classification model to classify the voice
  • the data and state data are processed to obtain the intent recognition results corresponding to the voice data.
  • the multi-classification model is obtained based on multiple sets of sample data training.
  • the multiple sets of sample data include sample voice data and sample state data, as well as the intent corresponding to the sample voice data; the receiving server The returned intent recognition result.
  • a voice intention recognition device including: an acquisition module configured to acquire voice data from a terminal device and status data of the terminal device; a processing module configured to convert the voice data and The state data is input into the multi-classification model to obtain the intent recognition result corresponding to the voice data, wherein the multi-classification model is obtained based on multiple sets of sample data training, and the multiple sets of sample data include sample voice data and sample state data, as well as the intent corresponding to the sample voice data;
  • the return module is set to return the intent recognition result to the terminal device.
  • a voice intention recognition device including: a collection module configured to collect voice data; a reporting module configured to report the voice data and status data of the terminal device to the server, wherein , the server is set to use a multi-classification model to process the voice data and state data, and obtain the intent recognition result corresponding to the voice data.
  • the multi-classification model is trained based on multiple sets of sample data, and the multiple sets of sample data include sample voice data and sample state data. and the intent corresponding to the sample voice data; the receiving module is configured to receive the intent recognition result returned by the server.
  • an electronic device including: a processor; a memory configured to store instructions executable by the processor; wherein the processor is configured to execute instructions to achieve any of the above speech intent recognition method.
  • a computer-readable storage medium When the instructions in the computer-readable storage medium are executed by the processor of the electronic device, the electronic device can execute any of the above-mentioned speech Intent recognition method.
  • the intention recognition of the voice is performed, and the result of the intention recognition is returned to the terminal device.
  • the state data when receiving the voice data is also considered. Therefore, when recognizing the voice intent, the accuracy of voice intent recognition can be effectively improved based on rich information.
  • the multi-classification model since the multi-classification model has been fully trained on sample speech data, sample state data, and the intent corresponding to the sample speech data, wherein the sample speech data and sample state data contain a variety of information in the current field, therefore, the multi-classification model can According to the input voice data and state data, the intention corresponding to the input voice is efficiently and accurately obtained, thereby achieving the technical effect of efficiently and accurately identifying the corresponding intention of the voice, and further solving the technical problem of inaccurate voice intention recognition results in related technologies.
  • FIG. 1 is a flow chart of a speech intent recognition method 1 according to an embodiment of the present disclosure
  • FIG. 2 is a flow chart of a second voice intent recognition method according to an embodiment of the present disclosure
  • FIG. 3 is a schematic flow diagram of an optional embodiment of the present disclosure.
  • FIG. 4 is a schematic diagram of a machine learning pre-training model in an optional embodiment of the present disclosure
  • Fig. 5 is a structural block diagram of a speech intention recognition device 1 provided according to an embodiment of the present disclosure
  • Fig. 6 is a structural block diagram of a speech intention recognition device 2 provided according to an embodiment of the present disclosure
  • Fig. 7 is a schematic diagram of an electronic device for voice intent recognition provided according to an embodiment of the present disclosure.
  • ASR Automatic Speech Recognition
  • Natural Language Processing is a discipline that studies the language problems of human-computer interaction.
  • Natural Language Understanding (NLU for short), commonly known as human-computer dialogue.
  • a sub-discipline of artificial intelligence which studies the use of electronic computers to simulate the process of human language communication, so that computers can understand and use natural languages of human society, such as Chinese and English, to realize natural language communication between humans and computers.
  • Natural Language Generation refers to the computer expressing its intentions in natural language text.
  • XGboost an optimized distributed gradient boosting library designed to be efficient, flexible and portable.
  • the Internet of Things refers to the real-time collection of any objects or objects that need to be connected and interacted with through various devices and technologies such as information sensors, radio frequency identification technology, global positioning system, infrared sensors, and laser scanners. process, collect the necessary information such as sound, light, heat, electricity, mechanics, chemistry, biology, location, etc., and realize the ubiquitous connection between things and things, things and people through various possible network accesses, and realize the Intelligent perception, identification and management of items and processes.
  • HBase a distributed, column-oriented open source database.
  • Kafka an open source stream processing platform developed by the Apache Software Foundation, written in Scala and Java.
  • Kafka is a high-throughput distributed publish-subscribe messaging system that can handle all action stream data of consumers in the website.
  • Elasticsearch a Lucene-based search server, provides a distributed multi-user capable full-text search engine based on a RESTful web interface.
  • Elasticsearch developed in the Java language and released as open source under the terms of the Apache license, is a popular enterprise-grade search engine.
  • Flink a distributed computing framework, can quickly process data of any size.
  • Quintuplet a communication term, usually refers to IP address, source port, destination IP address, destination port and transport layer protocol.
  • an embodiment of a speech intent recognition method is provided. It should be noted that the steps shown in the flow charts of the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions, and, although In the flowcharts, a logical order is shown, but in some cases the steps shown or described may be performed in an order different from that shown or described herein.
  • FIG. 1 is a flow chart of a speech intent recognition method 1 according to an embodiment of the present disclosure. As shown in FIG. 1 , the method includes the following steps:
  • Step S102 acquiring voice data from the terminal device and status data of the terminal device
  • Step S104 input the voice data and state data into the multi-classification model to obtain the intent recognition result corresponding to the voice data, wherein the multi-classification model is obtained based on multiple sets of sample data training, and the multiple sets of sample data include sample voice data and sample state data, and The intent corresponding to the sample speech data;
  • Step S106 return the intention identification result to the terminal device.
  • the intention recognition of the voice is performed, and the result of the intention recognition is returned to the terminal device. Since the intention recognition of the voice not only considers the voice data It also considers the state data when receiving voice data, so when recognizing voice intent, the accuracy of voice intent recognition can be effectively improved based on rich information.
  • the multi-classification model since the multi-classification model has been fully trained on sample speech data, sample state data, and the intent corresponding to the sample speech data, wherein the sample speech data and sample state data contain a variety of information in the current field, therefore, the multi-classification model can According to the input voice data and state data, the intention corresponding to the input voice is efficiently and accurately obtained, thereby achieving the technical effect of efficiently and accurately identifying the corresponding intention of the voice, and further solving the technical problem of inaccurate voice intention recognition results in related technologies.
  • the multi-classification model has been fully trained on the sample voice data, the sample state data, and the intent corresponding to the sample voice data, so that when the multi-classification model is used for voice intent prediction, it is not only efficient but also accurate.
  • the sample voice data and sample status data can contain various information in the current field, for example, the voice data can include device nouns or commonly used expression keywords in the current vertical field, etc., and the status data can include the switch status of the current terminal device, Indoor environment information, location information, etc.
  • the multi-classification model can accurately obtain the intention corresponding to the input voice according to the input voice data and state data. For example, if there are keywords such as "dark” or “dark” in the voice data, and the time of receiving the voice in the status data is night, and the switch state of the light is "off”, it can be judged that the intention of the voice is "turn on the light”. For another example, if there is a keyword "hot” in the voice data, the time when the voice is received in the status data is summer, and the switch state of the air conditioner is "off”, it can be judged that the intention of the voice is "turn on the air conditioner for cooling". In this way, the technical effect of efficiently and accurately identifying the corresponding intent according to the voice is realized, and then the technical problem of inaccurate voice intent recognition results in related technologies is solved, and a better user experience is provided for the user.
  • the recognition result of the speech intention is not necessarily a single kind of intention, and multiple intentions can also be recognized in parallel according to the speech.
  • the following methods can be used to obtain the status data of the terminal device: obtain the device identification of the terminal device and the account information corresponding to the terminal device; based on the device identification and account information, match the status of the terminal device data.
  • the operation process of determining the status of the terminal device can be greatly simplified.
  • the terminal device only needs to report its corresponding device ID and account information to directly, Accurately obtain real-time status data of terminal equipment.
  • the status data may include various types, for example, may include region, environment, room information, terminal device switch status, and the like.
  • the account information may also include multiple types, for example, it may include a list of terminal devices bound to the account or operation preferences of the account for each terminal device, and the like. Since the above state data and account information cover various types of information that may be involved in the current vertical field, the state data and account information can provide a more comprehensive judgment basis for intent recognition. It should be noted that the above-mentioned account information may be identification information bound to the terminal device, and may correspond to one user, multiple users, or one or more organizations.
  • the voice data and state data into the multi-classification model before inputting the voice data and state data into the multi-classification model to obtain the intent recognition result corresponding to the voice data, it is also possible to: acquire multiple sets of sample data, wherein the sample voice included in the multiple sets of sample data
  • the data includes speech keyword classification
  • the sample state data included in multiple sets of sample data includes at least one of the following: time information, spatial information, and environmental information of the sample terminal device receiving the sample voice data, and the master control device of the sample terminal device information, the binding device information of the sample terminal device, and the quintuple information of the sample account corresponding to the sample terminal device.
  • the intent corresponding to the sample voice data includes: the operating device corresponding to the sample voice data, and the operation instruction; Machine training to obtain a multi-classification model.
  • the multi-classification model can pass the above Multiple sets of sample data are used for very full and comprehensive training.
  • the five-tuple information can coordinate global behaviors and sort them according to time to distinguish each action, which ensures the validity of input data for multi-classification models.
  • the status data the actual operation performed by the user after uttering the voice can be analyzed, which can also be used as the training target mark of the multi-classification model or set as the test result.
  • the multi-classification model can accurately judge the intention of the speech through the speech data and status data.
  • the terminal device may include multiple types, for example, a smart speaker.
  • smart speakers can not only obtain voice data and status data, or receive intent recognition results, but also realize further human-computer interaction based on the recognized intent and provide a better user experience.
  • FIG. 2 is a flow chart of a voice intent recognition method 2 according to an embodiment of the present disclosure. As shown in FIG. 2 , the method includes the following steps:
  • Step S202 collecting voice data
  • Step S204 report the voice data and the state data of the terminal device to the server, wherein the server is configured to process the voice data and the state data by using a multi-classification model to obtain an intention recognition result corresponding to the voice data, and the multi-classification model is based on multiple groups of samples Data training is obtained, and multiple sets of sample data include sample voice data and sample state data, as well as the corresponding intent of the sample voice data;
  • Step S206 receiving the intent identification result returned by the server.
  • the group sample data also covers various types of information that may be involved in the current vertical field, various factors that may generate the intention to operate the terminal device, and the real intention corresponding to the sample voice data, so the intention recognition result obtained by the multi-classification model It is accurate and reliable. Therefore, through the above steps, the accurate result of speech intent recognition can be quickly obtained, thereby solving the technical problem of inaccurate speech intent recognition results in related technologies, and providing better user experience for users.
  • the target operation instruction is sent to the target operation device, so that the target operation device executes the target operation instruction.
  • the corresponding target operation instruction can be executed on the target operation device according to the received intention recognition result, so as to provide the user with a more efficient and high-quality use experience.
  • speech intelligence is mainly divided into two fields, one is the perception field (ASR), and the other is the cognitive field (NLP).
  • ASR perception field
  • NLP cognitive field
  • the field of perception refers to the task of using computers to automatically convert speech to text.
  • speech recognition is usually combined with natural language understanding, natural language generation and speech synthesis technologies to provide a natural and fluent speech-based Human-computer interaction method.
  • Cognitive domains include Natural Semantic Understanding (NLU) and Natural Language Generation (NLG).
  • NLU Natural Semantic Understanding
  • NLG Natural Language Generation
  • the industry usually uses third-party speech recognition engines in the field of language perception, but there is a lack of mature solutions in the field of cognition, and industry segmentation knowledge is complex and rich, and it is difficult to unify it. Therefore, experimental research in vertical fields has become a way.
  • AI faces the problems of difficulty in semantic understanding and lack of annotation data resources, especially in the speech intent recognition (NLU) link, where there is a technical bottleneck; the specific performance is: the user's fuzzy semantics cannot be accurately identified, temporarily relying on fixed rule algorithms Discrimination, lack of logical association, resulting in misjudgment of intentions and user complaints. For example, the user asks the context of insubstantial words such as "too cold” and “too dark”, and NLP is currently unable to accurately identify the user's intention.
  • NLU speech intent recognition
  • the optional implementation mode of the present disclosure solves the problem of inferring and judging information such as the context, environment, time, user habits and preferences of the voice inquiry, which improves the accuracy of user intention recognition and improves the user's smart scene experience at the same time.
  • the optional implementation mode of this disclosure mainly studies semantic understanding, uses the daily language in the vertical field of home appliances as the basic corpus, combines the user's real-time home appliance status and key features such as language context, environment, room location, etc., conducts multi-classification model training, and outputs the user's possible operations. appliance type and operating parameters.
  • the semantic understanding model in related technologies is based on rules and knowledge graphs, and lacks information input such as context and environment; it also lacks labeled data for verification; and the recognition accuracy is not high.
  • the optional implementation mode of this disclosure adopts large-scale corpus data and status data reported by network appliances as training data, combines machine learning multi-classification model (XGboost) for model training, and combines manual labeling data for supervised feedback, which greatly improves The recall rate and precision rate of the model make up for the AI team's gap in the field of semantic understanding and the shortcomings of data raw materials.
  • the optional implementation mode of the present disclosure belongs to the bridge link between AI and IOT, and can be used as a hub to connect the input end and the output end.
  • Fig. 3 is a schematic flow chart of an optional implementation of the present disclosure. As shown in Fig. 3, the optional implementation of the present disclosure includes:
  • Upstream input network devices report status data, voice word vectors, time, space, and environmental information of inquiries
  • the middle reaches are big data storage media and processing equipment.
  • the storage medium uses hbase or elasticsearch as the cache medium for real-time data streams;
  • the processing equipment uses the Flink engine as the real-time computing engine for rule and logic processing; at the same time, it also calls the offline pre-training model Carry out algorithm supplementation;
  • the downstream output is the cloud network device operation command, including intent and slot information.
  • the terminal output is analyzed and expressed by IOT for device operation instructions.
  • Fig. 4 is a schematic diagram of a machine learning pre-training model in an optional embodiment of the present disclosure, as shown in Fig. 4 , including:
  • the AI terminal is used as the data input source to input voice keyword classification, device identification, user identification and voice query time;
  • the prediction result returned to AI is the selected equipment category, action and equipment identification.
  • the five-tuple behavior data coordinates the user's global behavior, sorts by time, distinguishes the previous action from the next action, and inputs an effective data source for the classification model;
  • the real action identification of the performance user is used as the behavior mark of the classification model.
  • the XGBOOST multi-classification model is a supervised model in a multi-entropy state, and the training effect is remarkable.
  • the accuracy rate of classification selection reaches more than 80%.
  • the adoption of the above optional implementation manner effectively solves the problem in related technologies that only certain links or categories of intelligent control or interaction capabilities are available.
  • the IoT ecosystem only some furniture equipment is included, and cross-category interconnection has not been opened up, and some have advantages in full-category home furnishing and interconnection, but do not have advanced skills.
  • the open optional implementation mode can be applied to the smart home of the whole house.
  • the home appliance has a certain degree of memory, learning, and prediction capabilities, and can perceive and identify the user's needs and intentions in real time, providing users with intimate service and precise device control.
  • the optional implementation mode of the present disclosure uses the AI multi-category selection model, combines the context of user behavior and device status, and the whole house model to accurately identify user intentions and provide high-level intelligent scene services.
  • FIG. 5 is a structural block diagram of a speech intent recognition device 1 provided according to an embodiment of the present disclosure. As shown in FIG. 5 , the device It includes: an acquisition module 51, a processing module 52 and a return module 53, and the device will be described below.
  • the acquisition module 51 is configured to acquire the voice data from the terminal equipment and the status data of the terminal equipment;
  • the processing module 52 is connected to the acquisition module 51 and is configured to input the voice data and status data into the multi-classification model to obtain the corresponding intent of the voice data Recognition results, wherein the multi-classification model is obtained based on multiple sets of sample data training, multiple sets of sample data include sample voice data and sample state data, and the corresponding intent of the sample voice data;
  • return module 53 connected to the above-mentioned processing module 52, is set to Return the intent recognition result to the terminal device.
  • the obtaining module 51 includes: a first obtaining unit configured to obtain the device identifier of the terminal device and account information corresponding to the terminal device; a matching unit configured to match the Output the status data of the terminal equipment.
  • the above-mentioned device further includes: a second acquisition unit configured to acquire multiple sets of sample data, wherein the sample speech data included in the multiple sets of sample data includes keyword classification of speech, and the multiple sets of sample data include
  • the included sample state data includes at least one of the following: time information, space information, environment information of the sample terminal device receiving sample voice data, master device information of the sample terminal device, binding device information of the sample terminal device, and sample terminal device
  • the quintuple information of the sample account corresponding to the terminal device, and the intent corresponding to the sample voice data include: the operating device corresponding to the sample voice data, operation instructions; the training unit is set to use multiple sets of sample data for machine training to obtain a multi-classification model.
  • the terminal device includes a smart speaker.
  • FIG. 6 is a structural block diagram of a speech intent recognition device 2 provided according to an embodiment of the present disclosure. As shown in FIG. 6 , the device It includes: a collection module 61, a reporting module 62 and a receiving module 63, and the device will be described below.
  • the collection module 61 is set to collect voice data;
  • the reporting module 62 is connected to the above-mentioned collection module 61 and is set to report the voice data and the status data of the terminal equipment to the server, wherein the server is set to adopt a multi-classification model for voice data and status
  • the data is processed to obtain the corresponding intent recognition result of the voice data, and the multi-classification model is obtained based on multiple sets of sample data training, and the multiple sets of sample data include sample voice data and sample state data, and the corresponding intent of the sample voice data;
  • the receiving module 63 connects To the above-mentioned reporting module 62, it is set to receive the intention recognition result returned by the server.
  • the above apparatus further includes: an execution unit configured to send a target operation instruction to the target operation device when the intention recognition result includes: a target operation device corresponding to the voice data, and a target operation instruction, Cause the target operation device to execute the target operation instruction.
  • FIG. 7 is a schematic diagram of a speech intent recognition electronic device provided according to an embodiment of the present disclosure.
  • the electronic device includes: a processor 702; configured to store and process memory 704 for executable instructions, etc.
  • the foregoing electronic device may be a terminal device, or may be a server.
  • the above-mentioned electronic device can execute the program code of the following steps in the speech intent recognition method of the application program: obtain the speech data from the terminal device and the status data of the terminal device; input the speech data and the status data into the multi-classification model, and obtain the corresponding intent of the speech data Recognition results, wherein the multi-classification model is obtained based on multiple sets of sample data training, and the multiple sets of sample data include sample voice data and sample status data, as well as intents corresponding to the sample voice data; return intent recognition results to the terminal device.
  • the memory can be set to store software programs and modules, such as program instructions/modules corresponding to the speech intent recognition method and device in the embodiments of the present disclosure, and the processor runs the software programs and modules stored in the memory to execute various Functional application and data processing, that is, to realize the above-mentioned speech intent recognition method.
  • the memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.
  • the memory may further include a memory located remotely from the processor, and these remote memories may be connected to the computer terminal through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the processor can call the information stored in the memory and the application program through the transmission device to perform the following steps: obtain the voice data from the terminal equipment and the state data of the terminal equipment; input the voice data and state data into the multi-classification model to obtain the voice data corresponding Intent recognition results, wherein the multi-classification model is obtained based on multiple sets of sample data training, and the multiple sets of sample data include sample voice data and sample state data, as well as the intent corresponding to the sample voice data; return the intent recognition result to the terminal device.
  • the above-mentioned processor may also execute the program code of the following steps: obtaining the device identification of the terminal device and the account information corresponding to the terminal device; based on the device identification and account information, matching the status data of the terminal device.
  • the above-mentioned processor can also execute the program code of the following steps: acquiring multiple sets of sample data, wherein the sample speech data included in the multiple sets of sample data includes keyword classification of speech, and the sample state data included in the multiple sets of sample data Include at least one of the following: the time information, spatial information, and environment information of the sample terminal device receiving the sample voice data, the master device information of the sample terminal device, the binding device information of the sample terminal device, and the corresponding sample terminal device
  • the quintuple information of the account and the intent corresponding to the sample voice data include: the operating equipment and operation instructions corresponding to the sample voice data; multiple sets of sample data are used for machine training to obtain a multi-classification model.
  • the above-mentioned processor may also execute the program code of the following steps: the terminal device includes a smart speaker.
  • the above-mentioned processor can also execute the program code of the following steps: collecting voice data; reporting the voice data and the status data of the terminal device to the server, wherein the server is configured to process the voice data and the status data using a multi-classification model , to obtain the intent recognition result corresponding to the voice data, the multi-classification model is obtained based on multiple sets of sample data training, the multiple sets of sample data include sample voice data and sample state data, and the intent corresponding to the sample voice data; receive the intent recognition result returned by the server.
  • the above-mentioned processor may also execute the program code of the following steps: when the intention recognition result includes the target operating device corresponding to the voice data and the target operating instruction, send the target operating instruction to the target operating device, so that the target operating device Execute the target operation instruction.
  • a computer-readable storage medium When the instructions in the computer-readable storage medium are executed by the processor of the electronic device, the electronic device can execute the voice intent of any one of the above-mentioned embodiments. recognition methods.
  • the disclosed technical content can be realized in other ways.
  • the device embodiments described above are only illustrative.
  • the division of units can be a logical function division.
  • multiple units or components can be combined or integrated into Another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of units or modules may be in electrical or other forms.
  • a unit described as a separate component may or may not be physically separated, and a component shown as a unit may or may not be a physical unit, that is, it may be located in one place, or may be distributed over multiple units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present disclosure is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods in various embodiments of the present disclosure.
  • the aforementioned storage medium includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • User Interface Of Digital Computer (AREA)
  • Machine Translation (AREA)

Abstract

一种语音意图识别方法、装置及电子设备,其中,该方法包括:获取来自终端设备的语音数据及终端设备的状态数据(S102);将语音数据和状态数据输入多分类模型,得到语音数据对应的意图识别结果,其中,多分类模型基于多组样本数据训练得到,多组样本数据包括样本语音数据和样本状态数据,以及样本语音数据对应的意图(S104);向终端设备返回意图识别结果(S106)。

Description

语音意图识别方法、装置及电子设备
本公开要求于2022年2月23日提交中国专利局、申请号为202210171068.0、发明名称“语音意图识别方法、装置及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本公开涉及大数据领域,具体而言,涉及一种语音意图识别方法、装置及电子设备。
背景技术
在相关技术中,通常采用感知领域或认知领域的手段进行语义识别。前者利用计算机将语音转换成文字,再与自然语言理解、自然语言生成、语音合成技术结合,以提供基于语音的人机交互方法;后者则是利用语义理解和语言生成进行识别。由于在家电垂直领域进行语音的意图识别需要综合考虑多种信息,所以上述两种常用手段存在语音意图识别结果不准确的问题。
因此,在相关技术中,存在语音意图识别结果不准确的技术问题。
针对上述的问题,目前尚未提出有效的解决方案。
发明内容
本公开实施例提供了一种语音意图识别方法、装置及电子设备,以至少解决相关技术中,语音意图识别结果不准确的技术问题。
根据本公开实施例的一个方面,提供了一种语音意图识别方法,包括:获取来自终端设备的语音数据及终端设备的状态数据;将语音数据和状态数据输入多分类模型,得到语音数据对应的意图识别结果,其中,多分类模型基于多组样本数据训练得到,多组样本数据包括样本语音数据和样本状态数据,以及样本语音数据对应的意图;向终端设备返回意图识别结果。
根据本公开实施例的另一方面,还提供了一种语音意图识别方法,包括:采集语音数据;将语音数据和终端设备的状态数据上报给服务器,其中,服务器设置为采用多分类模型对语音数据及状态数据进行处理,得到语音数据对应的意图识别结果,多分类模型基于多组样本数据训练得到,多组样本数据包括样本语音数据和样本状态数据,以及样本语音数据对应的意图;接收服务器返回的意图识别结果。
根据本公开实施例的另一方面,还提供了一种语音意图识别装置,包括:获取模块,设置为获取来自终端设备的语音数据及终端设备的状态数据;处理模块,设置为将语音数据和状态数据输入多分类模型,得到语音数据对应的意图识别结果,其中,多分类模型基于多组样本数据训练得到,多组样本数据包括样本语音数据和样本状态数据,以及样本语音数据对应的意图;返回模块,设置为向终端设备返回意图识别结果。
根据本公开实施例的另一方面,还提供了一种语音意图识别装置,包括:采集模块,设置为采集语音数据;上报模块,设置为将语音数据和终端设备的状态数据上报给服务器,其中,服务器设置为采用多分类模型对语音数据及状态数据进行处理,得到语音数据对应的意图识别结果,多分类模型基于多组样本数据训练得到,多组样本数据包括样本语音数据和样本状态数据,以及样本语音数据对应的意图;接收模块,设置为接收服务器返回的意图识别结果。
根据本公开实施例的另一方面,还提供了一种电子设备,包括:处理器;设置为存储处理器可执行指令的存储器;其中,处理器被配置为执行指令,以实现上述任一项的语音意图识别方法。
根据本公开实施例的另一方面,还提供了一种计算机可读存储介质,当计算机可读存储介质中的指令由电子设备的处理器执行时,使得电子设备能够执行上述任一项的语音意图识别方法。
在本公开实施例中,通过将终端设备获取的语音数据和状态数据输入多分类模型,进行语音的意图识别,并把意图识别结果返回给终端设备,由于在对语音进行意图识别时,不仅考虑了语音数据本身,还考虑了接收语音数据时的状态数据,因此,在对语音的意图进行识别时,基于丰富的信息能够有效提升语音意图识别的准确性。另外,由于多分类模型经过了样本语音数据、样本状态数据以及样本语音数据对应的意图的充分训练,其中,样本语音数据和样本状态数据包含了当前领域的多种信息,因此,多分类模型可以根据输入的语音数据和状态数据,高效准确地得出输入语音对应的意图,从而实现了高效准确识别语音对应意图的技术效果,进而解决了相关技术中,语音意图识别结果不准确技术问题。
附图说明
此处所说明的附图用来提供对本公开的进一步理解,构成本公开的一部分,本公开的示意性实施例及其说明用于解释本公开,并不构成对本公开的不当限定。在附图中:
图1是根据本公开实施例的语音意图识别方法一的流程图;
图2是根据本公开实施例的语音意图识别方法二的流程图;
图3是本公开可选实施方式的流程示意图;
图4是本公开可选实施方式的机器学习预训练模型原理图;
图5是根据本公开实施例提供的语音意图识别装置一的结构框图;
图6是根据本公开实施例提供的语音意图识别装置二的结构框图;
图7是根据本公开实施例提供的语音意图识别电子设备示意图。
具体实施方式
为了使本技术领域的人员更好地理解本公开方案,下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开一部分的实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本公开保护的范围。
需要说明的是,本公开的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本公开的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、***、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
首先,在对本公开实施例进行描述的过程中出现的部分名词或术语适用于如下解释:
自动语音识别技术(Automatic Speech Recognition,简称ASR),是一种将人的语音转换为文本的技术。
自然语言处理(Natural Language Processing,简称NLP),是研究人与计算机交互的语言问题的一门学科。
自然语言理解(Natural Language Understanding,简称NLU),俗称人机对话。人工智能的分支学科,研究用电子计算机模拟人的语言交际过程,使计算机能理解和运用人类社会的自然语言,如汉语、英语等,实现人机之间的自然语言通信。
自然语言生成(Natural Language Generation,简称NLG),指计算机以自然语言文本来表达它想要达到的意图。
网器,物理件、智能件和连接件三合一的产品,可引入物联网和人机对话。
XGboost,一个优化的分布式梯度增强库,旨在实现高效、灵活和便携。
物联网(Internet of Things,简称IOT),指通过各种信息传感器、射频识别技术、全球定位***、红外感应器、激光扫描器等各种装置与技术,实时采集任何需要连接、互动的物体或过程,采集其声、光、热、电、力学、化学、生物、位置等各种需要的信息,通过各类可能的网络接入,实现物与物、物与人的泛在连接,实现对物品和过程的智能化感知、识别和管理。
HBase,一个分布式的、面向列的开源式数据库。
Kafka,由Apache软件基金会开发的一个开源流处理平台,由Scala和Java编写。Kafka是一种高吞吐量的分布式发布订阅消息***,它可以处理消费者在网站中的所有动作流数据。
Elasticsearch,一个基于Lucene的搜索服务器,提供了一个分布式多用户能力的全文搜索引擎,基于RESTful web接口。Elasticsearch是用Java语言开发的,并作为Apache许可条款下的开放源码发布,是一种流行的企业级搜索引擎。
Flink,一个分布式计算框架,可以快速处理任意规模的数据。
五元组,通信术语,通常是指IP地址,源端口,目的IP地址,目的端口和传输层协议。
根据本公开实施例,提供了一种语音意图识别方法实施例,需要说明的是,在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机***中执行,并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。
图1是根据本公开实施例的语音意图识别方法一的流程图,如图1所示,该方法包括如下步骤:
步骤S102,获取来自终端设备的语音数据及终端设备的状态数据;
步骤S104,将语音数据和状态数据输入多分类模型,得到语音数据对应的意图识别结果,其中,多分类模型基于多组样本数据训练得到,多组样本数据包括样本语音数据和样本状态数据,以及样本语音数据对应的意图;
步骤S106,向终端设备返回意图识别结果。
通过上述步骤,通过将终端设备获取的语音数据和状态数据输入多分类模型,进行语音的意图识别,并把意图识别结果返回给终端设备,由于在对语音进行意图识别时,不仅考虑了语音数据本身,还考虑了接收语音数据时的状态数据,因此,在对语音的意图进行识别时,基于丰富的信息能够有效提升语音意图识别的准确性。另外,由于多分类模型经过了样本语音数据、样本状态数据以及样本语音数据对应的意图的充分训练,其中,样本语音数据和样本状态数据包含了当前领域的多种信息,因此,多分类模型可以根据输入的语音数据和状态数据,高效准确地得出输入语音对应的意图,从而实现了高效准确识别语音对应意图的技术效果,进而解决了相关技术中,语音意图识别结果不准确技术问题。
作为一种可选的实施例,将终端设备获取的语音数据和状态数据输入多分类模型,进行语音的意图识别,把意图识别结果返回给终端设备。其中,多分类模型经过了样本语音数据、样本状态数据以及样本语音数据对应的意图的充分训练,使得采用多分类模型进行语音意图预测时,不仅高效,而且准确。另外,样本语音数据和样本状态数据可以包含当前领域的多种信息,例如,语音数据中可以包括当前垂直领域的设备名词或常用表达关键词等等,状态数据可以包括当前终端设备的开关状态,室内环境信息,位置信息等等。举例来说,该多分类模型可以根据输入的语音数据和状态数据准确地得出输入语音对应的意图。例如,语音数据中有“黑”或“暗”等关键词,状态数据中接收语音的时间为晚上,且灯的开关状态为“关”,就可以判断为语音的意图为“开灯”。又例如,语音数据中有关键词“热”,状态数据中接收语音的时间为夏天,且空调的开关状态为“关”,就可以判断为语音的意图为“打开空调制冷”。从而实现根据语音高效,准确识别其对应意图的技术效果,进而解决了相关技术中,语音意图识别结果不准确技术问题,为用户提供更好的使用体验。
需要说明的是,在实际应用中,对语音意图的识别结果不一定为单独的一种意图,也可以根据语音并行地识别出多种意图。
作为一种可选的实施例,获取终端设备的状态数据时可以采用以下方式:获取终端设备的设备标识,以及与终端设备对应的账户信息;基于设备标识和账户信息,匹配出终端设备的状态数据。通过根据设备标识和账户信息直接匹配终端设备的状态数据,可以极大程度地简化确定终端设备状态的操作过程,例如,终端设备只需要上报其对应的设备标识和账户信息,就可以直接地、准确地获取终端设备的实时状态数据。其中,状态数据可以包括多种,例如,可以包括地域、环境、房间信息、终端设备开关状态等等。账户信息也可以包括多种,例如,可以包括账户对应绑定的终端设备列表或账户对各终端设备的操作偏好等等。由于上述状态数据和账户信息涵盖了当前垂 直领域中可能涉及的各种类别信息,所以,状态数据和账户信息可以提供更全面的意图识别的判断依据。需要说明的是,上述的账户信息可以是与该终端设备绑定的标识信息,可以对应于一个用户,也可以对应于多个用户,或者对应于一个或多个组织等。
作为一种可选的实施例,在将语音数据和状态数据输入多分类模型,得到语音数据对应的意图识别结果之前,还可以:获取多组样本数据,其中,多组样本数据包括的样本语音数据中包括语音的关键词分类,多组样本数据包括的样本状态数据中包括以下至少之一:样本终端设备的接收样本语音数据的时间信息,空间信息,环境信息,样本终端设备的主控设备信息,样本终端设备的绑定设备信息,以及样本终端设备对应的样本账户的五元组信息,样本语音数据对应的意图包括:样本语音数据对应的操作设备,操作指令;采用多组样本数据进行机器训练,得到多分类模型。由于上述多组样本数据涵盖了当前垂直领域中可能涉及的各种类别的信息、可能产生对终端设备进行操作意图的各种因素以及五元组信息中的上下文动作,所以多分类模型可以通过上述多组样本数据进行非常充分全面的训练。尤其是五元组信息可以统筹全域行为并按照时间排序以区分各动作,更是为多分类模型保证了输入数据的有效性。同时,根据状态数据解析出用户在发出语音后实际进行的操作,也可以以此作为多分类模型的训练目标标记或是设置为测试结果。综上,经过多组样本数据的训练后,使得多分类模型可以通过语音数据和状态数据准确地判断出语音的意图。
作为一种可选的实施例,终端设备可以包括多种,例如,包括智能音箱。智能音箱作为终端设备的可选设备,不仅可以获取语音数据和状态数据,或是接收意图识别结果,还可以根据识别出的意图实现进一步的人机交互,提供更优质的用户体验。
图2是根据本公开实施例的语音意图识别方法二的流程图,如图2所示,该方法包括如下步骤:
步骤S202,采集语音数据;
步骤S204,将语音数据和终端设备的状态数据上报给服务器,其中,服务器设置为采用多分类模型对语音数据及状态数据进行处理,得到语音数据对应的意图识别结果,多分类模型基于多组样本数据训练得到,多组样本数据包括样本语音数据和样本状态数据,以及样本语音数据对应的意图;
步骤S206,接收服务器返回的意图识别结果。
通过上述步骤,只需将采集的语音数据和终端状态数据上报服务器,就可以接收到服务器返回的意图识别结果,由于用于意图识别的多分类模型是经过多组样本数据训练得到的,而多组样本数据又涵盖了当前垂直领域中可能涉及的各种类别的信息、可能产生对终端设备进行操作意图的各种因素以及样本语音数据对应的真实意图,所 以多分类模型得出的意图识别结果是准确可靠的,因此,通过上述步骤就可以快速地得到语音意图识别的准确结果,进而解决了相关技术中,语音意图识别结果不准确技术问题,为用户提供更好的使用体验。
作为一种可选的实施例,在意图识别结果包括语音数据对应的目标操作设备,以及目标操作指令的情况下,向目标操作设备发送目标操作指令,使目标操作设备执行目标操作指令。通过上述操作,可以根据接受到的意图识别结果对目标操作设备执行对应的目标操作指令,为用户提供更加高效、优质的使用体验。
基于上述实施例和可选实施例,提供一种可选实施方式,下面具体说明。
在相关技术中,语音智能主要分为两个领域,一个是感知领域(ASR),一个是认知领域(NLP)。其中,感知领域是指利用计算机实现语音到文字的自动转换的任务,在实际应用中,语音识别通常与自然语言理解、自然语言生成和语音合成技术结合在一起,提供一个基于语音的自然流畅的人机交互方法。认知领域包括自然语义理解(NLU)和自然语言生成(NLG)。业界在语言感知领域通常采用第三方语音识别引擎,但在认知领域缺乏成熟的解决方案,行业细分知识庞杂丰富,难以统一,因此在垂直领域进行实验研究成为一种方式。
在相关技术中,AI面临语义理解难度大和标注数据资源贫乏的问题,尤其在语音意图识别(NLU)环节存在技术瓶颈;具体表现为:用户的模糊语义无法精确识别,暂时依赖固定的规则算法进行判别,缺乏逻辑关联,导致意图判断失误引起用户客诉。比如用户询问为“太冷了”、“太暗了”等无实体词的语境,NLP目前无法准确识别用户意图。
本公开可选实施方式解决的是通过语音询问的上下文、环境、时间、用户习惯和偏好等信息进行推理判断,提高了用户意图识别的准确率,同时改善了用户的智慧场景体验。
本公开可选实施方式主要研究语义理解,以家电垂直领域的生活用语作为基本语料,结合用户的实时家电状态和语言上下文、环境、房间位置等关键特征,进行多分类模型训练,输出用户可能操作的家电类型和操作参数。
相关技术中的语义理解模型基于规则和知识图谱,缺乏上下文和环境等信息输入;也缺乏标注数据进行校验;识别准确率不高。
本公开可选实施方式,采用大规模语料数据和网器上报的状态数据作为训练数据,结合机器学习多分类模型(XGboost)进行模型训练,并且结合人工标注数据进行有监督反馈,极大提高了模型的召回率和精准率,弥补了AI团队在语义理解领域的空白和数据原料短板。同时本公开可选实施方式属于AI和IOT的桥梁环节,可作为中枢, 联接输入端和输出端。
图3是本公开可选实施方式的流程示意图,如图3所示,本公开可选实施方式包括:
(1)架构图(分为网器、语音、大数据、IOT四个功能单元),实时数据流传输介质为kafka;
(2)上游输入网器上报状态数据、语音词向量、问询的时间、空间、环境信息;
(3)中游为大数据存储介质和处理装备,存储介质选用hbase或者elasticsearch作为实时数据流的缓存介质;处理装备采用Flink引擎作为实时计算引擎,进行规则和逻辑处理;同时也调用离线预训练模型进行算法补充;
(4)下游输出为云端网器操作命令,包括意图和槽位信息。终端输出由IOT进行设备操作指令解析和表达。
图4是本公开可选实施方式的机器学习预训练模型原理图,如图4所示,包括:
(1)AI端作为数据输入源,输入语音关键词分类、设备标识、用户标识和语音查询时间;
(2)大数据根据设备标识和用户标识匹配用户的实时地域、环境、房间信息,以及设备的主控设备信息,用户绑定的设备列表;
(3)同时大数据会根据用户标识实时查询五元组行为数据(上下文的行为标识);
(4)综合(1)至(3)的离线和实时数据输入(X1....XN),调用算法组的离线训练模型(XGBOOST),分成训练组和测试组,行为标记(Y)为根据设备上报状态解析的真实行为标识;
(5)返回给AI的预测结果是选择的设备大类、动作和设备标识。
在本公开可选实施方式中,五元组行为数据统筹了用户全域的行为,按时间进行排序,区分上一个动作和下一个动作,为分类模型输入了有效的数据源;通过设备上报状态推演出用户真实发生的动作标识,作为分类模型的行为标记。其中,经过实验,XGBOOST多分类模型作为在多熵状态下有监督模型,训练效果显著。分类选择的准确率达到80%以上。
采用上述可选实施方式,有效地解决了相关技术中,只具备某些环节或者某些品类的智慧控制或者交互能力的问题。针对相关技术,在物联网生态体系中,只包括某些家具设备,而且跨品类互联互通并未打通,有些则在全品类家居和互联互通具备优 势,但不具备高阶技能的该问题,本公开可选实施方式可以应用于全屋智慧家居,同时结合AI赋能智慧家居,使家电具备一定程度的记忆、学习、预测能力,能实时感知和识别用户的需求和意图,给用户提供贴心的服务和精确的设备控制。
综上,本公开可选实施方式通过AI多分类选择模型,结合用户行为的上下文和设备状态,以及全屋家居模型,精准地识别用户意图,提供高阶的智能场景服务。
根据本公开实施例,还提供了一种用于实施上述语音意图识别方法一的装置,图5是根据本公开实施例提供的语音意图识别装置一的结构框图,如图5所示,该装置包括:获取模块51,处理模块52和返回模块53,下面对该装置进行说明。
获取模块51,设置为获取来自终端设备的语音数据及终端设备的状态数据;处理模块52,连接至上述获取模块51,设置为将语音数据和状态数据输入多分类模型,得到语音数据对应的意图识别结果,其中,多分类模型基于多组样本数据训练得到,多组样本数据包括样本语音数据和样本状态数据,以及样本语音数据对应的意图;返回模块53,连接至上述处理模块52,设置为向终端设备返回意图识别结果。
作为一种可选的实施例,获取模块51包括:第一获取单元,设置为获取终端设备的设备标识,以及与终端设备对应的账户信息;匹配单元,设置为基于设备标识和账户信息,匹配出终端设备的状态数据。
作为一种可选的实施例,上述装置还包括:第二获取单元,设置为获取多组样本数据,其中,多组样本数据包括的样本语音数据中包括语音的关键词分类,多组样本数据包括的样本状态数据中包括以下至少之一:样本终端设备的接收样本语音数据的时间信息,空间信息,环境信息,样本终端设备的主控设备信息,样本终端设备的绑定设备信息,以及样本终端设备对应的样本账户的五元组信息,样本语音数据对应的意图包括:样本语音数据对应的操作设备,操作指令;训练单元,设置为采用多组样本数据进行机器训练,得到多分类模型。
作为一种可选的实施例,终端设备包括智能音箱。
根据本公开实施例,还提供了一种设置为实施上述语音意图识别方法二的装置,图6是根据本公开实施例提供的语音意图识别装置二的结构框图,如图6所示,该装置包括:采集模块61,上报模块62和接收模块63,下面对该装置进行说明。
采集模块61,设置为采集语音数据;上报模块62,连接至上述采集模块61,设置为将语音数据和终端设备的状态数据上报给服务器,其中,服务器设置为采用多分类模型对语音数据及状态数据进行处理,得到语音数据对应的意图识别结果,多分类模型基于多组样本数据训练得到,多组样本数据包括样本语音数据和样本状态数据,以及样本语音数据对应的意图;接收模块63,连接至上述上报模块62,设置为接收服 务器返回的意图识别结果。
作为一种可选的实施例,上述装置还包括:执行单元,设置为在意图识别结果包括:语音数据对应的目标操作设备,以及目标操作指令的情况下,向目标操作设备发送目标操作指令,使目标操作设备执行目标操作指令。
根据本公开实施例,还提供了一种电子设备,图7是根据本公开实施例提供的语音意图识别电子设备示意图,如图7所示,该电子设备包括:处理器702;设置为存储处理器可执行指令的存储器704等。
需要说明的是,在本公开实施例中,上述电子设备可以是终端设备,也可以是服务器。
上述电子设备可以执行应用程序的语音意图识别方法中以下步骤的程序代码:获取来自终端设备的语音数据及终端设备的状态数据;将语音数据和状态数据输入多分类模型,得到语音数据对应的意图识别结果,其中,多分类模型基于多组样本数据训练得到,多组样本数据包括样本语音数据和样本状态数据,以及样本语音数据对应的意图;向终端设备返回意图识别结果。
其中,存储器可设置为存储软件程序以及模块,如本公开实施例中的语音意图识别方法和装置对应的程序指令/模块,处理器通过运行存储在存储器内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的语音意图识别方法。存储器可包括高速随机存储器,还可以包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器可进一步包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至计算机终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
处理器可以通过传输装置调用存储器存储的信息及应用程序,以执行下述步骤:获取来自终端设备的语音数据及终端设备的状态数据;将语音数据和状态数据输入多分类模型,得到语音数据对应的意图识别结果,其中,多分类模型基于多组样本数据训练得到,多组样本数据包括样本语音数据和样本状态数据,以及样本语音数据对应的意图;向终端设备返回意图识别结果。
可选的,上述处理器还可以执行如下步骤的程序代码:获取终端设备的设备标识,以及与终端设备对应的账户信息;基于设备标识和账户信息,匹配出终端设备的状态数据。
可选的,上述处理器还可以执行如下步骤的程序代码:获取多组样本数据,其中,多组样本数据包括的样本语音数据中包括语音的关键词分类,多组样本数据包括的样本状态数据中包括以下至少之一:样本终端设备的接收样本语音数据的时间信息,空 间信息,环境信息,样本终端设备的主控设备信息,样本终端设备的绑定设备信息,以及样本终端设备对应的样本账户的五元组信息,样本语音数据对应的意图包括:样本语音数据对应的操作设备,操作指令;采用多组样本数据进行机器训练,得到多分类模型。
可选的,上述处理器还可以执行如下步骤的程序代码:终端设备包括智能音箱。
可选的,上述处理器还可以执行如下步骤的程序代码:采集语音数据;将语音数据和终端设备的状态数据上报给服务器,其中,服务器设置为采用多分类模型对语音数据及状态数据进行处理,得到语音数据对应的意图识别结果,多分类模型基于多组样本数据训练得到,多组样本数据包括样本语音数据和样本状态数据,以及样本语音数据对应的意图;接收服务器返回的意图识别结果。
可选的,上述处理器还可以执行如下步骤的程序代码:在意图识别结果包括语音数据对应的目标操作设备,以及目标操作指令的情况下,向目标操作设备发送目标操作指令,使目标操作设备执行目标操作指令。
根据本公开实施例,还提供了一种计算机可读存储介质,当计算机可读存储介质中的指令由电子设备的处理器执行时,使得电子设备能够执行上述实施例中任一项的语音意图识别方法。
上述本公开实施例序号仅仅为了描述,不代表实施例的优劣。
在本公开的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本公开所提供的几个实施例中,应该理解到,所揭露的技术内容,可通过其它的方式实现。其中,以上所描述的装置实施例仅仅是示意性的,例如单元的划分,可以为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元或模块的间接耦合或通信连接,可以是电性或其它的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本公开各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成 的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本公开各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
以上仅是本公开的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本公开原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本公开的保护范围。

Claims (14)

  1. 一种语音意图识别方法,包括:
    获取来自终端设备的语音数据及所述终端设备的状态数据;
    将所述语音数据和所述状态数据输入多分类模型,得到所述语音数据对应的意图识别结果,其中,所述多分类模型基于多组样本数据训练得到,所述多组样本数据包括样本语音数据和样本状态数据,以及所述样本语音数据对应的意图;
    向所述终端设备返回所述意图识别结果。
  2. 根据权利要求1所述的方法,其中,所述获取所述终端设备的状态数据包括:
    获取所述终端设备的设备标识,以及与所述终端设备对应的账户信息;
    基于所述设备标识和所述账户信息,匹配出所述终端设备的状态数据。
  3. 根据权利要求1所述的方法,其中,在所述将所述语音数据和所述状态数据输入多分类模型,得到所述语音数据对应的意图识别结果之前,还包括:
    获取所述多组样本数据,其中,所述多组样本数据包括的样本语音数据中包括语音的关键词分类,所述多组样本数据包括的样本状态数据中包括以下至少之一:样本终端设备的接收所述样本语音数据的时间信息,空间信息,环境信息,所述样本终端设备的主控设备信息,所述样本终端设备的绑定设备信息,以及所述样本终端设备对应的样本账户的五元组信息,所述样本语音数据对应的意图包括:所述样本语音数据对应的操作设备,操作指令;
    采用所述多组样本数据进行机器训练,得到所述多分类模型。
  4. 根据权利要求1至3中任一项所述的方法,其中,所述终端设备包括智能音箱。
  5. 一种语音意图识别方法,包括:
    采集语音数据;
    将所述语音数据和终端设备的状态数据上报给服务器,其中,所述服务器设置为采用多分类模型对所述语音数据及所述状态数据进行处理,得到所述语音数据对应的意图识别结果,所述多分类模型基于多组样本数据训练得到,所述多组样本数据包括样本语音数据和样本状态数据,以及所述样本语音数据对应的意图;
    接收所述服务器返回的所述意图识别结果。
  6. 根据权利要求5所述的方法,其中,所述方法还包括:
    在所述意图识别结果包括:所述语音数据对应的目标操作设备,以及目标操 作指令的情况下,向所述目标操作设备发送所述目标操作指令,使所述目标操作设备执行所述目标操作指令。
  7. 一种语音意图识别装置,包括:
    获取模块,设置为获取来自终端设备的语音数据及所述终端设备的状态数据;
    处理模块,设置为将所述语音数据和所述状态数据输入多分类模型,得到所述语音数据对应的意图识别结果,其中,所述多分类模型基于多组样本数据训练得到,所述多组样本数据包括样本语音数据和样本状态数据,以及所述样本语音数据对应的意图;
    返回模块,设置为向所述终端设备返回所述意图识别结果。
  8. 根据权利要求7所述的装置,其中,
    所述获取模块包括:第一获取单元,设置为获取终端设备的设备标识,以及与终端设备对应的账户信息;匹配单元,设置为基于设备标识和账户信息,匹配出终端设备的状态数据。
  9. 根据权利要求7所述的装置,其中,所述装置还包括:
    第二获取单元,设置为获取多组样本数据,其中,多组样本数据包括的样本语音数据中包括语音的关键词分类,多组样本数据包括的样本状态数据中包括以下至少之一:样本终端设备的接收样本语音数据的时间信息,空间信息,环境信息,样本终端设备的主控设备信息,样本终端设备的绑定设备信息,以及样本终端设备对应的样本账户的五元组信息,样本语音数据对应的意图包括:样本语音数据对应的操作设备,操作指令;训练单元,设置为采用多组样本数据进行机器训练,得到多分类模型。
  10. 根据权利要求7至9中任一项所述的装置,其中,所述终端设备包括智能音箱。
  11. 一种语音意图识别装置,包括:
    采集模块,设置为采集语音数据;
    上报模块,设置为将所述语音数据和终端设备的状态数据上报给服务器,其中,所述服务器设置为采用多分类模型对所述语音数据及所述状态数据进行处理,得到所述语音数据对应的意图识别结果,所述多分类模型基于多组样本数据训练得到,所述多组样本数据包括样本语音数据和样本状态数据,以及所述样本语音数据对应的意图;
    接收模块,设置为接收所述服务器返回的所述意图识别结果。
  12. 根据权利要求11所述的装置,其中,所述装置还包括:
    执行单元,设置为在意图识别结果包括:语音数据对应的目标操作设备,以及目标操作指令的情况下,向目标操作设备发送目标操作指令,使目标操作设备执行目标操作指令。
  13. 一种电子设备,包括:
    处理器;
    设置为存储所述处理器可执行指令的存储器;
    其中,所述处理器被配置为执行所述指令,以实现如权利要求1至6中任一项所述的语音意图识别方法。
  14. 一种计算机可读存储介质,当所述计算机可读存储介质中的指令由电子设备的处理器执行时,使得电子设备能够执行如权利要求1至6中任一项所述的语音意图识别方法。
PCT/CN2022/110081 2022-02-23 2022-08-03 语音意图识别方法、装置及电子设备 WO2023159881A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210171068.0A CN114694644A (zh) 2022-02-23 2022-02-23 语音意图识别方法、装置及电子设备
CN202210171068.0 2022-02-23

Publications (1)

Publication Number Publication Date
WO2023159881A1 true WO2023159881A1 (zh) 2023-08-31

Family

ID=82136680

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/110081 WO2023159881A1 (zh) 2022-02-23 2022-08-03 语音意图识别方法、装置及电子设备

Country Status (2)

Country Link
CN (1) CN114694644A (zh)
WO (1) WO2023159881A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114694644A (zh) * 2022-02-23 2022-07-01 青岛海尔科技有限公司 语音意图识别方法、装置及电子设备
CN115356939A (zh) * 2022-08-18 2022-11-18 青岛海尔科技有限公司 控制指令的发送方法、控制装置、存储介质及电子装置

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108646580A (zh) * 2018-05-14 2018-10-12 中兴通讯股份有限公司 控制对象的确定方法及装置、存储介质、电子装置
CN110286601A (zh) * 2019-07-01 2019-09-27 珠海格力电器股份有限公司 控制智能家居设备的方法、装置、控制设备及存储介质
CN110501918A (zh) * 2019-09-10 2019-11-26 百度在线网络技术(北京)有限公司 智能家电控制方法、装置、电子设备和存储介质
CN110942773A (zh) * 2019-12-10 2020-03-31 上海雷盎云智能技术有限公司 语音控制智能家居设备的方法及装置
CN111128156A (zh) * 2019-12-10 2020-05-08 上海雷盎云智能技术有限公司 基于模型训练的智能家居设备语音控制方法及装置
CN113111186A (zh) * 2021-03-31 2021-07-13 青岛海尔科技有限公司 用于控制家电设备的方法、存储介质及电子设备
US20210295203A1 (en) * 2020-03-18 2021-09-23 International Business Machines Corporation Precise chatbot-training system
CN114694644A (zh) * 2022-02-23 2022-07-01 青岛海尔科技有限公司 语音意图识别方法、装置及电子设备

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5974942B2 (ja) * 2013-03-15 2016-08-23 ブラザー工業株式会社 サーバ、及びネットワークシステム
CN105137789A (zh) * 2015-08-28 2015-12-09 青岛海尔科技有限公司 一种智能物联家电的控制方法、装置及相关设备
CN107908116B (zh) * 2017-10-20 2021-05-11 深圳市艾特智能科技有限公司 语音控制方法、智能家居***、存储介质和计算机设备
CN109186037A (zh) * 2018-08-15 2019-01-11 珠海格力电器股份有限公司 一种空调控制方法、装置、存储介质及空调
CN111008532B (zh) * 2019-12-12 2023-09-12 广州小鹏汽车科技有限公司 语音交互方法、车辆和计算机可读存储介质
CN112856727A (zh) * 2021-01-21 2021-05-28 广州三星通信技术研究有限公司 用于控制电子装置的方法和设备

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108646580A (zh) * 2018-05-14 2018-10-12 中兴通讯股份有限公司 控制对象的确定方法及装置、存储介质、电子装置
CN110286601A (zh) * 2019-07-01 2019-09-27 珠海格力电器股份有限公司 控制智能家居设备的方法、装置、控制设备及存储介质
CN110501918A (zh) * 2019-09-10 2019-11-26 百度在线网络技术(北京)有限公司 智能家电控制方法、装置、电子设备和存储介质
CN110942773A (zh) * 2019-12-10 2020-03-31 上海雷盎云智能技术有限公司 语音控制智能家居设备的方法及装置
CN111128156A (zh) * 2019-12-10 2020-05-08 上海雷盎云智能技术有限公司 基于模型训练的智能家居设备语音控制方法及装置
US20210295203A1 (en) * 2020-03-18 2021-09-23 International Business Machines Corporation Precise chatbot-training system
CN113111186A (zh) * 2021-03-31 2021-07-13 青岛海尔科技有限公司 用于控制家电设备的方法、存储介质及电子设备
CN114694644A (zh) * 2022-02-23 2022-07-01 青岛海尔科技有限公司 语音意图识别方法、装置及电子设备

Also Published As

Publication number Publication date
CN114694644A (zh) 2022-07-01

Similar Documents

Publication Publication Date Title
CN111104495B (zh) 基于意图识别的信息交互方法、装置、设备和存储介质
WO2023159881A1 (zh) 语音意图识别方法、装置及电子设备
CN110704641B (zh) 一种万级意图分类方法、装置、存储介质及电子设备
KR102288249B1 (ko) 정보 처리 방법, 단말기, 및 컴퓨터 저장 매체
CN107229684B (zh) 语句分类方法、***、电子设备、冰箱及存储介质
CN107908116A (zh) 语音控制方法、智能家居***、存储介质和计算机设备
WO2019218820A1 (zh) 控制对象的确定方法及装置、存储介质、电子装置
WO2020253064A1 (zh) 语音的识别方法及装置、计算机设备、存储介质
US11488599B2 (en) Session message processing with generating responses based on node relationships within knowledge graphs
CN113590850A (zh) 多媒体数据的搜索方法、装置、设备及存储介质
US20140207716A1 (en) Natural language processing method and system
CN112631139B (zh) 智能家居指令合理性实时检测***及方法
EP4198807A1 (en) Audio processing method and device
WO2021226840A1 (zh) 热点新闻意图识别方法、装置、设备及可读存储介质
CN113705315B (zh) 视频处理方法、装置、设备及存储介质
CN111161726B (zh) 一种智能语音交互方法、设备、介质及***
WO2021073179A1 (zh) 命名实体的识别方法和设备、以及计算机可读存储介质
WO2023272616A1 (zh) 一种文本理解方法、***、终端设备和存储介质
CN110047484A (zh) 一种语音识别交互方法、***、设备和存储介质
CN111353026A (zh) 一种智能法务律师助手客服***
CN112669842A (zh) 人机对话控制方法、装置、计算机设备及存储介质
RU2693332C1 (ru) Способ и компьютерное устройство для выбора текущего зависящего от контекста ответа для текущего пользовательского запроса
CN109408658A (zh) 表情图片提示方法、装置、计算机设备及存储介质
CN116821307B (zh) 内容交互方法、装置、电子设备和存储介质
WO2021042902A1 (zh) 一种多轮对话中用户意图的识别方法及相关设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22928148

Country of ref document: EP

Kind code of ref document: A1