WO2019007247A1 - Human-machine conversation processing method and apparatus, and electronic device - Google Patents

Human-machine conversation processing method and apparatus, and electronic device Download PDF

Info

Publication number
WO2019007247A1
WO2019007247A1 PCT/CN2018/093225 CN2018093225W WO2019007247A1 WO 2019007247 A1 WO2019007247 A1 WO 2019007247A1 CN 2018093225 W CN2018093225 W CN 2018093225W WO 2019007247 A1 WO2019007247 A1 WO 2019007247A1
Authority
WO
WIPO (PCT)
Prior art keywords
session
voice
user
voice command
input
Prior art date
Application number
PCT/CN2018/093225
Other languages
French (fr)
Chinese (zh)
Inventor
刘广兴
许毅
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2019007247A1 publication Critical patent/WO2019007247A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the present application relates to the field of computer technologies, and in particular, to a method, an apparatus, and an electronic device for processing a human-machine session.
  • the device executes the voice command input by the user, such as increasing the volume, playing the video, etc.
  • the text can also be passed through TTS (Text To Speech). Turn voice to feedback to the user, such as playing "Volume has increased”, “Video is already on”, etc.
  • the device needs to be re-awakened. Re-waking up the device will lead to greater inconvenience in both time and program. For example, the user has to re-enter the voice wake-up words, and the wake-up device will take a certain amount of time, thus seriously affecting the experience.
  • the invention provides a method, a device and an electronic device for processing a human-machine session.
  • the device does not need to be repeatedly awake, the user needs to actively perform continuous conversation with the device, improve the user experience and improve the session efficiency.
  • a method for processing a human-machine session including:
  • the device After the device completes the previous voice command, it identifies the content of the previous voice command to determine whether the user has the need to input the voice command again based on the previous voice command;
  • voice activity detection is initiated; otherwise, the session is ended.
  • Another method for processing a human-machine session including:
  • the human-machine session operation is performed.
  • the third aspect provides a processing device for a human-machine session, including:
  • the instruction identification module is configured to identify, after the device completes the previous voice instruction, the content of the previous voice instruction, to determine whether the user has the requirement to input the voice instruction again based on the previous voice instruction;
  • the voice detection module is configured to initiate voice activity detection if it is determined that the user has the requirement to input the voice command again; otherwise, the session is ended.
  • a device for processing a human-machine session including:
  • a content identification module configured to identify content of the received voice instruction
  • a demand judging module configured to determine whether the user has a need to input a voice command again
  • the operation module is executed to perform a human-machine session operation according to the judgment result.
  • an electronic device including:
  • a processor coupled to the memory for executing the program for:
  • the device After the device completes the previous voice command, it identifies the content of the previous voice command to determine whether the user has the need to input the voice command again based on the previous voice command;
  • voice activity detection is initiated; otherwise, the session is ended.
  • another electronic device including:
  • a processor coupled to the memory for executing the program for:
  • the human-machine session operation is performed.
  • the method, device and electronic device for processing a human-machine session provided by the present invention can improve the device to execute continuous voice commands of the user by performing prediction and judgment on whether the user inputs the next voice command after the device completes the previous voice command. Efficiency and enhance the user experience.
  • FIG. 1 is a schematic diagram 1 of a process of a human-machine session according to an embodiment of the present invention
  • FIG. 2 is a second schematic diagram of processing of a human-machine session according to an embodiment of the present invention.
  • FIG. 3 is a system structural diagram of processing of a human-machine session according to an embodiment of the present invention.
  • 4a is a flowchart 1 of a method for processing a human-machine session according to an embodiment of the present invention
  • 4b is a flowchart 1 of a method for processing a human-machine session according to an embodiment of the present invention
  • FIG. 5 is a structural diagram 1 of a processing apparatus for a human-machine session according to an embodiment of the present invention.
  • FIG. 5b is a second structural diagram of a device for processing a human-machine session according to an embodiment of the present invention.
  • FIG. 6 is a third structural diagram of a device for processing a human-machine session according to an embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram 1 of an electronic device according to an embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram 2 of an electronic device according to an embodiment of the present invention.
  • the machine performs a voice command input by the user, such as increasing the volume, playing a video, etc., after the instruction is executed, the user can also feed back through the TTS. For example, the play "Volume has increased”.
  • the device can perform voice feedback; or the device will respond to the voice command, for example, if the user inputs “speak a joke”, the device will select a joke to answer through the TTS. For example, if the user inputs "how is the weather today", the device will broadcast the weather forecast through TTS.
  • the device will broadcast the weather forecast through TTS.
  • the invention changes the processing flow of ending the session immediately after the device completes the voice instruction in the prior art, and the core idea is that after the device completes the voice instruction, the content of the voice command input by the previous user is first determined and confirmed. Whether the user will also input the next voice command. If it is determined that the user also inputs the next voice command, after the device completes the voice command, it enters the voice activity detection (VAD) process, and if it is determined that the user does not input The next voice command terminates the session, thereby increasing the efficiency of the user's continuous session.
  • VAD voice activity detection
  • FIG. 1 is a schematic diagram of a process of a human-machine session according to an embodiment of the present invention.
  • the basic flow of the human-machine session is user wake-up, VAD, voice input, ASR (Automatic Speech Recognition), semantic analysis, instruction execution, system feedback, and TTS.
  • VAD Voice Call Identity
  • ASR Automatic Speech Recognition
  • semantic analysis instruction execution
  • system feedback system feedback
  • TTS TTS
  • the present application also solves the problem of judging the termination of the entire session in each link in the above process in consideration of the continuous session scenario.
  • the existing scheme for judging the termination of the session is the completion of the entire process from the user waking to the final TTS feedback as the session termination judgment condition, basically does not consider the continuous session scenario, and the error type judgment is performed once an abnormal situation occurs in each link in the session flow. And the abnormal feedback is performed through the TTS, and the TTS broadcasts that a session unit is terminated. In some special cases, such as when the voice system actively asks the user, the TTS will resume the session process from the VAD link after the broadcast.
  • FIG. 2 is a processing logic diagram of terminating a session in a continuous session scenario.
  • this logic there are generally five steps: domain judgment, VAD, ASR, semantic parsing, and execution of voice instructions.
  • VAD After the VAD is started, if a voice signal is detected within the set time, the voice signal is sent to the ASR for voice analysis to form a text; if no voice signal is detected, the session is ended.
  • ASR Perform text parsing on the speech signal. If the text content is parsed, the text content is semantically parsed; if the text content is not parsed after the text parsing, the session is terminated.
  • Semantic parsing Semantic parsing of text content, judging whether the statement in the text enters the preset field, if the statement does not enter any field, or the statement enters the field of stopping the radio (terminating session), the session is terminated; if the statement When entering an existing field, voice instructions are formed according to the existing field.
  • Execution command according to the determined voice instruction, control the corresponding device to execute the voice instruction and feed back through TTS.
  • the flow redirects to the first step, that is, to continue to judge the field of the previous instruction to determine whether the user needs a continuous session.
  • the embodiment of the present invention provides a processing system for the human-machine session, which is used to improve the efficiency of the user's active continuous session in the human-machine session scenario.
  • the system includes: a device 310 and a server 320.
  • Apparatus 310 includes:
  • Human-computer interaction devices in the human-machine session such as microphones, stereos, etc., and operating devices that execute voice commands, such as media playback devices, air conditioners, televisions, refrigerators, and the like.
  • the device 310 is configured to interact with a person during a human-machine session, including voice signal collection, TTS feedback, and the like, and perform specific operations of the voice instruction.
  • the server 2 has logic processing functions for controlling VAD startup, ASR, semantic analysis, forming voice control commands, and feeding back to the device.
  • the server 2 specifically includes: a human machine session processing device 321 and a domain library 322;
  • the processing device 321 of the human-machine session includes:
  • the instruction identification module is configured to: after the device 310 completes the previous voice instruction, identify the content of the previous voice instruction, determine whether the user has the requirement to input the voice instruction again based on the previous voice instruction; and perform the content of the voice instruction At the time of recognition, the voice instruction needs to be judged in the domain library 322 in the demand type.
  • a plurality of intents in a plurality of different fields are preset stored in the domain library 322.
  • domain is a function that implements a certain type of user's needs in a human-computer interaction system.
  • Domain identification is the process of determining which type of demand a user's voice belongs to.
  • Intention the function of realizing a single explicit requirement of a user in a certain field in a human-computer interaction system.
  • Intent recognition is the process of determining which of a user's voice belongs to a specific requirement in a certain field.
  • the domain library 322 By identifying the content of the user's previous voice command in the domain library 322, it can be determined whether the user still has a need to re-enter the voice command.
  • the voice detection module is configured to start the voice activity detection VAD if it is determined that the user has the requirement to input the voice instruction again; otherwise, end the session.
  • the device for receiving the voice signal such as the microphone
  • the device for receiving the voice signal is always on, but only after the server 320 determines the detection flow for starting the VAD, the voice signal received by the microphone is transmitted to the processing device 321 of the human-machine session.
  • the VAD detection process is only started at the beginning of each session to detect the voice signal.
  • the VAD is automatically turned off until the user wakes up the device again to start the VAD again. Therefore, when the command recognition module determines that the user has a need to input the voice command again, the voice detection module is triggered to activate the VAD.
  • the voice detection module is further configured to end the session if the voice signal is not detected within the specified detection time after starting the VAD; otherwise, the voice recognition module is triggered to perform automatic voice recognition ASR on the detected voice signal.
  • Time calculation module for:
  • the time of VAD detection has a greater impact on the effect of VAD.
  • the detection time of the VAD should be determined according to the pronunciation time of the user in the continuous conversation scenario, and the pronunciation habits of different users are different. This requires dynamic adjustment of the VAD execution time, specifically based on the user's habit of using the speaker to dynamically optimize the VAD detection time.
  • the time counted by these habits includes the average time the user wakes up from the device to the time the voice command is issued, and the average time from the start of the VAD to the user's voice command.
  • the time calculation module calculates the specified detection time according to the first average time and the second average time, and may include:
  • the specified detection time T4 is calculated; wherein T2 is the sum of the first average time T1 and the preset redundancy time, and T3 is the second average time.
  • the speech recognition module is further configured to: after the ASR is performed on the detected speech signal, if the text content is not recognized, the session is ended; otherwise, the semantic parsing module is triggered to perform semantic analysis on the recognized text content.
  • the semantic parsing module is configured to end the session after the semantic analysis of the recognized text content, if the parsed semantics does not enter any of the preset fields, or the parsed semantics are explicitly ended in the session; Otherwise, the voice instruction is generated according to the domain entered by the parsed semantics, and the instruction execution module is triggered to control the corresponding device to perform the operation according to the voice instruction.
  • the processing system for the human-machine session identifies the content of the previous voice command after the device completes the previous voice command, and determines whether the user has the requirement to input the voice command again based on the previous voice command; If it is determined that the user has the need to input the voice command again, the voice activity detection VAD is started; otherwise, the session is ended. Further, in the continuous session process, according to the execution status of each link, the conditions for ending the session are set in advance, and after the determination condition is formed, the session is ended, and the complete process of the continuous session is realized.
  • FIG. 4a is a flow chart 1 of the processing method of the human-machine session shown in the embodiment of the present invention, and the execution body of the method is shown in FIG.
  • the processing device of the human-machine session As shown in FIG. 4a, the processing method of the human-machine session includes the following steps:
  • the user after inputting a voice command, the user often wants to input a new voice command again based on the content of the voice command and the execution result. For example, the previous voice input is “helping me to search for a deformation. King Kong", after the system search is completed, the search result list is displayed through the screen. At this time, the user is likely to input the voice command of "playing the first" search result.
  • the content of the previous completed voice command is identified to determine whether the user has a basis.
  • Domain judgment that is, whether to determine the need to enter a continuous session state after a domain command is completed.
  • the triggering of a continuous session needs to be determined according to the type of the user's previous voice command and the execution status.
  • control class instruction (I want to turn the light on) is basically a user's operation, and no continuous session is required in such a scenario;
  • a continuous session is required. If a continuous session is required, the user is considered to have a need to input the voice command again based on the previous voice command.
  • step S420 is performed, otherwise, S430 is performed.
  • the device side can be controlled to open the VAD detection process to collect the voice command that the user may input, and upload it to the processing device of the human-machine session on the server side for identification processing.
  • the session can be ended and the control device enters the standby state.
  • the content of the previous voice command is identified, and the determination process of determining whether the user has the requirement to input the voice command again based on the previous voice command may be processed on the server side, that is, whether the front end device is turned on by the server. Read the session process.
  • the above method further includes:
  • the session is ended; otherwise, the detected voice signal is automatically voice-recognized ASR.
  • the server side controls the device side to start the VAD process, and transmits the voice signal detected by the microphone to the server side for analysis. If the vocal input is not detected within the specified detection time, the session is considered to be ended and the VAD process is closed. If the vocal input is detected within the specified detection time, the detected sound is uploaded to the server for ASR processing.
  • Vocal detection needs to shield out noise interference.
  • Steady-state noise is better to identify and shield, such as air-conditioning noise with stable frequency and motor noise; but dynamic noise is more difficult to shield, such as singing, TV noise and other frequency changes. It also contains the noise of vocal recordings. Therefore, the detection time of VAD has a greater influence on the effect of VAD.
  • the detection time of the VAD should be determined according to the pronunciation time of the user in a continuous session, and the pronunciation habits of different users are different.
  • the dynamic adjustment of the VAD is based on the user's habit of using the speaker dynamics to optimize the VAD detection time.
  • the optimization strategy is as follows:
  • the average time T1 (first average time) from the wake-up of the device to the issuance of the command is calculated in real time in the scenario of the normal wake-up device.
  • the detection time of the initial continuous session defaults to Add 3 seconds (redundancy time) based on the evaluation time in the state of the user wake-up device.
  • the detection time of the VAD in the initial continuous session is a long fault-tolerant time. It needs to be converged according to the actual average pronunciation time of the user in the continuous session state. In the course of each session of the user, the continuous session needs to be calculated in real time. In the state, the average time T3 (second average time) from the start of VAD to the user's voice command.
  • T2 is corrected by the actual average utterance time T3 of the user in the continuous session state, so that a relatively reasonable designated detection time T4 can be obtained.
  • processing step of calculating the specified detection time according to the first average time and the second average time may include:
  • T4 T3+(T2-T3)/2 (1)
  • the specified detection time T4 is calculated; wherein T2 is the sum of the first average time T1 and the preset redundancy time, and T3 is the second average time.
  • the above method further includes:
  • the session is ended; otherwise, the recognized text content is semantically parsed.
  • the text content is recognized, and the text content can be further transmitted to the next processing step, that is, semantic analysis.
  • the above method further includes:
  • the session ends; otherwise, according to the parsing
  • the field into which the semantics enter generates voice instructions and controls the corresponding device to perform operations according to the voice instructions.
  • Semantic analysis can be divided into three parts, domain identification, intention identification, and execution logic judgment. Among them, the judgment to judge whether the continuous dialogue is terminated can be basically completed in the field identification link, which may include but not limited to the following two judgments. condition:
  • the session terminates when the semantics in the text parsed by ASR does not fall into any of the fields (multiple different fields are preset in the domain library). If the interference noise is misjudged into the semantic analysis stage in a continuous session scenario, the messy and non-logical statements are not easily understood by semantics into a certain domain, and the session is terminated.
  • stop word When the parsed semantics falls into the "stop word” field, it indicates that the user explicitly uses the instruction to stop the continuous session, and the session is terminated.
  • the corpus in the field of “stop words” is “good”, “thank you”, “nothing is wrong”
  • the field instruction is normally executed, that is, the voice instruction is generated according to the specific intention in the field that falls into, and then the corresponding operation is performed according to the voice instruction control device.
  • the process After the instruction is successfully executed, the process returns to the initial step S410 to determine the domain of the previous voice command, determine whether the user has the need to input the voice command again based on the previous voice command, and start the VAD process.
  • step S410 may be skipped, the VAD is directly turned on, and the continuous dialogue scenario is entered from step S420.
  • the processing method of the human-machine session provided by the embodiment of the present invention, after the device completes the previous voice instruction, identifies the content of the previous voice command, and determines whether the user has the requirement to input the voice command again based on the previous voice command; If it is determined that the user has the requirement to input the voice instruction again, the voice activity detection is started; otherwise, the session is ended, thereby effectively improving the session processing efficiency in the continuous session scenario.
  • the session process by pre-making the judgment conditions of the current session for each link, the session process can be ended when the conditions are satisfied, and the integrity of the continuous session is ensured.
  • FIG. 4b illustrates another method for processing a human-machine session according to an embodiment of the present invention. The method is slightly changed based on the method shown in FIG. 4a. As shown in FIG. 4b, the processing method of the human-machine session includes the following steps:
  • the user after inputting a voice command, the user often wants to input a new voice command again based on the content of the voice command and the execution result. For example, the previous voice input is “helping me to search for a deformation. King Kong", after the system search is completed, the search result list is displayed through the screen. At this time, the user is likely to input the voice command of "playing the first" search result.
  • the content of the voice command is first identified.
  • the specific identification process refer to step S410. related information.
  • step S410 After identifying the content of the voice command, it is determined whether there is a need to input the voice command again based on the voice command. For the specific judgment process, refer to the related content of step S410.
  • the historical data may be collected for the result of whether the user performs voice input again for the personalized habit of different users (voiceprint recognition), and the probability that the user inputs the voice instruction again is calculated according to the statistical result. If the obtained probability is greater than the preset probability threshold, it is determined that the user has a need to input the voice instruction again; otherwise, the user is determined not to input the voice command again.
  • the start voice activity detection is performed after the device completes the voice command to implement the continuous session; otherwise, the session ends after the device completes the voice command.
  • the processing method of the human-machine session provided by the embodiment of the present invention identifies the content of the received voice command, determines whether the user has the requirement to input the voice command again; and then performs the human-machine session operation according to the judgment result, thereby effectively improving Session processing efficiency in a continuous session scenario.
  • FIG. 5 is a structural diagram of a processing device for a human-machine session according to an embodiment of the present invention
  • the processing device of the human-machine session can be used to perform the method steps shown in FIG. 4a, including:
  • the instruction identification module 510 is configured to: after the device completes the previous voice instruction, identify the content of the previous voice instruction, and determine whether the user has the requirement to input the voice instruction again based on the previous voice instruction;
  • the voice detection module 520 is configured to start the voice activity detection VAD if it is determined that the user has the requirement to input the voice instruction again; otherwise, end the session.
  • the above apparatus further includes a voice recognition module 530;
  • the voice detection module 520 is further configured to: after the VAD is started, if the voice signal is not detected within the specified detection time, the session is ended; otherwise, the triggered voice recognition module 530 performs automatic voice recognition on the detected voice signal. .
  • the foregoing apparatus further includes a time calculation module 540, configured to: collect a first average time that the user wakes up from the device to the voice command during each session, and counts the user in each time.
  • the second average time from the initiation of the VAD to the user's voice command; the specified detection time is calculated according to the first average time and the second average time.
  • time calculation module 540 calculates the specified detection time according to the first average time and the second average time, including:
  • the specified detection time T4 is calculated; wherein T2 is the sum of the first average time T1 and the preset redundancy time, and T3 is the second average time.
  • the above apparatus further includes a semantic parsing module 550;
  • the speech recognition module 540 is further configured to: after the ASR is performed on the detected speech signal, if the text content is not recognized, the current session is ended; otherwise, the trigger semantic parsing module 550 performs semantic parsing on the recognized text content.
  • the above apparatus further includes an instruction execution module 560;
  • the semantic parsing module 550 is further configured to: after semantically parsing the recognized text content, if the parsed semantics does not enter any preset domain, or the parsed semantics are explicitly ended, the end is Session; otherwise, the voice instruction is generated according to the domain entered by the parsed semantics, and the trigger instruction execution module 560 controls the corresponding device to perform the operation according to the voice instruction.
  • the processing device for the human-machine session provided by the embodiment of the present invention identifies the content of the previous voice command after the device completes the previous voice command, and determines whether the user has the requirement to input the voice command again based on the previous voice command; If it is determined that the user has the requirement to input the voice command again, the voice activity detection VAD is started; otherwise, the session is ended, thereby effectively improving the session processing efficiency in the continuous session scenario.
  • the session process by pre-making the judgment conditions of the current session for each link, the session process can be ended when the conditions are satisfied, and the integrity of the continuous session is ensured.
  • FIG. 6 is a structural diagram of a processing device for a human-machine session according to an embodiment of the present invention
  • the processing device of the human-machine session can be used to perform the method steps shown in FIG. 4b, including:
  • a content identification module 610 configured to identify content of the received voice instruction
  • the requirement judging module 620 is configured to determine whether the user has a requirement for inputting a voice instruction again;
  • the operation module 630 is configured to perform a human-machine session operation according to the determination result.
  • the processing device for the human-machine session provided by the embodiment of the present invention identifies the content of the received voice command, determines whether the user has the requirement to input the voice command again, and then performs the human-machine session operation according to the determination result, thereby effectively improving Session processing efficiency in a continuous session scenario.
  • FIG. 7 is a schematic structural diagram of the electronic device according to the embodiment of the present invention, specifically including Memory 710 and processor 720.
  • the memory 710 is configured to store a program.
  • memory 710 can also be configured to store various other data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on an electronic device, contact data, phone book data, messages, pictures, videos, and the like.
  • Memory 710 can be implemented by any type of volatile or non-volatile storage device, or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Disk or Optical Disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read only memory
  • EPROM erasable Programmable Read Only Memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • Disk Disk or Optical Disk.
  • the processor 720 is coupled to the memory 710 for executing a program in the memory 710 for:
  • the device After the device completes the previous voice command, it identifies the content of the previous voice command to determine whether the user has the need to input the voice command again based on the previous voice command;
  • the voice activity detection VAD is initiated; otherwise, the session is ended.
  • the electronic device may further include: a communication component 730, a power component 740, an audio component 750, a display 760, and the like. Only some of the components are schematically illustrated in FIG. 7, and it is not meant that the electronic device includes only the components shown in FIG.
  • Communication component 730 is configured to facilitate wired or wireless communication between the electronic device and other devices.
  • the electronic device can access a wireless network based on a communication standard such as WiFi, 2G or 3G, or a combination thereof.
  • communication component 730 receives broadcast signals or broadcast associated information from an external broadcast management system via a broadcast channel.
  • communication component 730 also includes a near field communication (NFC) module to facilitate short range communication.
  • NFC near field communication
  • the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra-wideband
  • Bluetooth Bluetooth
  • a power component 740 provides power to various components of the electronic device.
  • Power component 740 can include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device.
  • the audio component 750 is configured to output and/or input an audio signal.
  • the audio component 750 includes a microphone (MIC) that is configured to receive an external audio signal when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode.
  • the received audio signal may be further stored in memory 710 or transmitted via communication component 730.
  • audio component 750 also includes a speaker for outputting an audio signal.
  • the display 760 includes a screen whose screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen can be implemented as a touch screen to receive input signals from the user.
  • the touch panel includes one or more touch sensors to sense touches, slides, and gestures on the touch panel. The touch sensor can sense not only the boundaries of the touch or sliding action, but also the duration and pressure associated with the touch or slide operation.
  • FIG. 8 it is a schematic structural diagram of the electronic device according to the embodiment of the present invention, which specifically includes Memory 810 and processor 820.
  • the memory 810 is configured to store a program.
  • memory 810 can also be configured to store various other data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on an electronic device, contact data, phone book data, messages, pictures, videos, and the like.
  • the memory 810 can be implemented by any type of volatile or non-volatile storage device, or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable. Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Disk or Optical Disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read only memory
  • EPROM Electrically erasable programmable read only memory
  • PROM Programmable Read Only Memory
  • ROM Read Only Memory
  • Magnetic Memory Flash Memory
  • Disk Disk or Optical Disk.
  • the processor 820 is coupled to the memory 810 for executing a program in the memory 810 for:
  • the human-machine session operation is performed.
  • the electronic device may further include: a communication component 830, a power component 840, an audio component 850, a display 860, and the like. Only some of the components are schematically illustrated in FIG. 8, and it is not meant that the electronic device includes only the components shown in FIG.
  • Communication component 830 is configured to facilitate wired or wireless communication between the electronic device and other devices.
  • the electronic device can access a wireless network based on a communication standard such as WiFi, 2G or 3G, or a combination thereof.
  • communication component 830 receives broadcast signals or broadcast associated information from an external broadcast management system via a broadcast channel.
  • communication component 830 also includes a near field communication (NFC) module to facilitate short range communication.
  • NFC near field communication
  • the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra-wideband
  • Bluetooth Bluetooth
  • a power supply assembly 840 provides power to various components of the electronic device.
  • Power component 840 can include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device.
  • the audio component 850 is configured to output and/or input audio signals.
  • the audio component 850 includes a microphone (MIC) that is configured to receive an external audio signal when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode.
  • the received audio signal may be further stored in memory 810 or transmitted via communication component 830.
  • audio component 850 also includes a speaker for outputting an audio signal.
  • the display 860 includes a screen whose screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen can be implemented as a touch screen to receive input signals from the user.
  • the touch panel includes one or more touch sensors to sense touches, slides, and gestures on the touch panel. The touch sensor can sense not only the boundaries of the touch or sliding action, but also the duration and pressure associated with the touch or slide operation.
  • the aforementioned program can be stored in a computer readable storage medium.
  • the program when executed, performs the steps including the foregoing method embodiments; and the foregoing storage medium includes various media that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

Provided are a human-machine conversation processing method and apparatus, and an electronic device. The method comprises: after a device completing the last voice command, identifying the content of the last voice command to determine whether a user has the need to input the voice command again based on the last voice command; if it is determined that the user has the need to input a voice command again, initiating voice activity detection; otherwise, ending the conversation. The solution of the embodiment of the present invention can meet the requirement that a user wants to actively have continuous conversation with a device without repeatedly waking up the device, thereby improving the user experience, and improving the conversation efficiency.

Description

人机会话的处理方法、装置及电子设备Manipine session processing method, device and electronic device
本申请要求2017年07月04日递交的申请号为201710539395.6、发明名称为“人机会话的处理方法、装置及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。The present application claims priority to Chinese Patent Application No. JP-A No. No. No. No. No. No. No. No. No. No. No. No. No. .
技术领域Technical field
本申请涉及计算机技术领域,尤其涉及一种人机会话的处理方法、装置及电子设备。The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, and an electronic device for processing a human-machine session.
背景技术Background technique
在人机会话的场景下,用户向设备输入一条语音指令后,设备会执行用户输入的语音指令,例如增大音量、播放视频等,指令执行完毕后,还可以通过TTS(Text To Speech,文字转语音)向用户进行反馈,比如播放“音量已经增大”、“视频已打开”等。当设备完成一条语音指令后,认为整个会话终止了,便进入休眠状态。In the scenario of a human-machine session, after the user inputs a voice command to the device, the device executes the voice command input by the user, such as increasing the volume, playing the video, etc. After the instruction is executed, the text can also be passed through TTS (Text To Speech). Turn voice to feedback to the user, such as playing "Volume has increased", "Video is already on", etc. When the device completes a voice command and assumes that the entire session has terminated, it enters a sleep state.
但是,在连续会话场景下,如果用户还有进一步的语音指令想输入,则需要重新唤醒设备。重新唤醒设备无论从时间上还是程序上都会导致较大的使用不便,例如,用户要重新输入语音唤醒词,并且,唤醒设备也会花一定的时间,从而严重影响使用体验。However, in a continuous session scenario, if the user has further voice commands to input, the device needs to be re-awakened. Re-waking up the device will lead to greater inconvenience in both time and program. For example, the user has to re-enter the voice wake-up words, and the wake-up device will take a certain amount of time, thus seriously affecting the experience.
发明内容Summary of the invention
本发明提供了一种人机会话的处理方法、装置及电子设备,在不需要反复唤醒设备的基础上,满足用户想主动与设备进行连续会话的需求,改善用户使用体验,提高会话效率。The invention provides a method, a device and an electronic device for processing a human-machine session. On the basis that the device does not need to be repeatedly awake, the user needs to actively perform continuous conversation with the device, improve the user experience and improve the session efficiency.
为达到上述目的,本发明的实施例采用如下技术方案:In order to achieve the above object, embodiments of the present invention adopt the following technical solutions:
第一方面,提供了一种人机会话的处理方法,包括:In a first aspect, a method for processing a human-machine session is provided, including:
在设备完成上一条语音指令后,对上一条语音指令的内容进行识别,确定用户是否有基于上一条语音指令而再次输入语音指令的需求;After the device completes the previous voice command, it identifies the content of the previous voice command to determine whether the user has the need to input the voice command again based on the previous voice command;
如果确定用户有再次输入语音指令的需求,则启动语音活动检测;否则,结束本次会话。If it is determined that the user has a need to input a voice command again, voice activity detection is initiated; otherwise, the session is ended.
第二方面,提供了另一种人机会话的处理方法,包括:In a second aspect, another method for processing a human-machine session is provided, including:
对所接收的语音指令的内容进行识别;Identifying the content of the received voice command;
判断用户是否有再次输入语音指令的需求;Determining whether the user has a need to input a voice command again;
根据判断结果,执行人机会话操作。According to the judgment result, the human-machine session operation is performed.
第三方面,提供了一种人机会话的处理装置,包括:The third aspect provides a processing device for a human-machine session, including:
指令识别模块,用于在设备完成上一条语音指令后,对上一条语音指令的内容进行识别,确定用户是否有基于上一条语音指令而再次输入语音指令的需求;The instruction identification module is configured to identify, after the device completes the previous voice instruction, the content of the previous voice instruction, to determine whether the user has the requirement to input the voice instruction again based on the previous voice instruction;
语音检测模块,用于如果确定用户有再次输入语音指令的需求,则启动语音活动检测;否则,结束本次会话。The voice detection module is configured to initiate voice activity detection if it is determined that the user has the requirement to input the voice command again; otherwise, the session is ended.
第四方面,提供了另一种人机会话的处理装置,包括:In a fourth aspect, a device for processing a human-machine session is provided, including:
内容识别模块,用于对所接收的语音指令的内容进行识别;a content identification module, configured to identify content of the received voice instruction;
需求判断模块,用于判断用户是否有再次输入语音指令的需求;a demand judging module, configured to determine whether the user has a need to input a voice command again;
执行操作模块,用于根据判断结果,执行人机会话操作。The operation module is executed to perform a human-machine session operation according to the judgment result.
第五方面,提供了一种电子设备,包括:In a fifth aspect, an electronic device is provided, including:
存储器,用于存储程序;Memory for storing programs;
处理器,耦合至所述存储器,用于执行所述程序,以用于:a processor coupled to the memory for executing the program for:
在设备完成上一条语音指令后,对上一条语音指令的内容进行识别,确定用户是否有基于上一条语音指令而再次输入语音指令的需求;After the device completes the previous voice command, it identifies the content of the previous voice command to determine whether the user has the need to input the voice command again based on the previous voice command;
如果确定用户有再次输入语音指令的需求,则启动语音活动检测;否则,结束本次会话。If it is determined that the user has a need to input a voice command again, voice activity detection is initiated; otherwise, the session is ended.
第六方面,提供了另一种电子设备,包括:In a sixth aspect, another electronic device is provided, including:
存储器,用于存储程序;Memory for storing programs;
处理器,耦合至所述存储器,用于执行所述程序,以用于:a processor coupled to the memory for executing the program for:
对所接收的语音指令的内容进行识别;Identifying the content of the received voice command;
判断用户是否有再次输入语音指令的需求;Determining whether the user has a need to input a voice command again;
根据判断结果,执行人机会话操作。According to the judgment result, the human-machine session operation is performed.
本发明提供的人机会话的处理方法、装置及电子设备,在设备完成上一条语音指令后,通过对用户是否会输入下一条语音指令进行预测判断,从而能够提高设备执行用户的连续语音指令的效率,提升用户体验。The method, device and electronic device for processing a human-machine session provided by the present invention can improve the device to execute continuous voice commands of the user by performing prediction and judgment on whether the user inputs the next voice command after the device completes the previous voice command. Efficiency and enhance the user experience.
上述说明仅是本申请技术方案的概述,为了能够更清楚了解本申请的技术手段,而可依照说明书的内容予以实施,并且为了让本申请的上述和其它目的、特征和优点能够更明显易懂,以下特举本申请的具体实施方式。The above description is only an overview of the technical solutions of the present application, and the technical means of the present application can be more clearly understood, and can be implemented in accordance with the contents of the specification, and the above and other objects, features and advantages of the present application can be more clearly understood. The following is a specific embodiment of the present application.
附图说明DRAWINGS
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本申请的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:Various other advantages and benefits will become apparent to those skilled in the art from a The drawings are only for the purpose of illustrating the preferred embodiments and are not intended to be limiting. Throughout the drawings, the same reference numerals are used to refer to the same parts. In the drawing:
图1为本发明实施例的人机会话的处理的逻辑示意图一;1 is a schematic diagram 1 of a process of a human-machine session according to an embodiment of the present invention;
图2为本发明实施例的人机会话的处理的逻辑示意图二;2 is a second schematic diagram of processing of a human-machine session according to an embodiment of the present invention;
图3为本发明实施例的人机会话的处理的***结构图;3 is a system structural diagram of processing of a human-machine session according to an embodiment of the present invention;
图4a为本发明实施例的人机会话的处理方法流程图一;4a is a flowchart 1 of a method for processing a human-machine session according to an embodiment of the present invention;
图4b为本发明实施例的人机会话的处理方法流程图一;4b is a flowchart 1 of a method for processing a human-machine session according to an embodiment of the present invention;
图5a为本发明实施例的人机会话的处理装置结构图一;FIG. 5 is a structural diagram 1 of a processing apparatus for a human-machine session according to an embodiment of the present invention; FIG.
图5b为本发明实施例的人机会话的处理装置结构图二;FIG. 5b is a second structural diagram of a device for processing a human-machine session according to an embodiment of the present invention;
图6为本发明实施例的人机会话的处理装置结构图三;6 is a third structural diagram of a device for processing a human-machine session according to an embodiment of the present invention;
图7为本发明实施例的电子设备的结构示意图一;FIG. 7 is a schematic structural diagram 1 of an electronic device according to an embodiment of the present invention; FIG.
图8为本发明实施例的电子设备的结构示意图二。FIG. 8 is a schematic structural diagram 2 of an electronic device according to an embodiment of the present invention.
具体实施方式Detailed ways
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the embodiments of the present invention have been shown in the drawings, the embodiments Rather, these embodiments are provided so that this disclosure will be more fully understood and the scope of the disclosure will be fully disclosed.
在现有人机会话的场景中,用户向机器输入一条语音指令后,机器会执行用户输入的语音指令,例如增大音量、播放视频等,指令执行完毕后,还可以通过TTS向用户进行反馈,比如播放“音量已经增大”。当然,通过TTS向用户进行反馈并不是必要的操作,设备可以执行语音反馈;或者设备会针对该语音指令进行回答,例如用户输入“讲个笑话”,则设备会选取笑话通过TTS来进行回答,再例如用户输入“今天天气如何”,设备会通过TTS播报天气预报。为了方便描述,无论是设备通过TTS向用户回答了问题还是进行了反馈还是仅执行了语音指令对应的操作而没有进行反馈,我们将这几种情形都统称为“设备完成语音指令”。In the scenario of an existing human-machine session, after the user inputs a voice command to the machine, the machine performs a voice command input by the user, such as increasing the volume, playing a video, etc., after the instruction is executed, the user can also feed back through the TTS. For example, the play "Volume has increased". Of course, feedback to the user through the TTS is not a necessary operation, and the device can perform voice feedback; or the device will respond to the voice command, for example, if the user inputs “speak a joke”, the device will select a joke to answer through the TTS. For example, if the user inputs "how is the weather today", the device will broadcast the weather forecast through TTS. For convenience of description, whether the device answers the question to the user through TTS or feedback or only performs the operation corresponding to the voice command without feedback, we refer to these situations as “device completion voice command”.
本发明改变了现有技术中,在设备完成语音指令后,马上结束会话的处理流程,其核心思想在于,在设备完成语音指令后,先对前一次用户输入的语音指令的内容进行判 断,确认用户是否还会输入下一条语音指令,如果判定为用户还会输入下一条语音指令,则在设备完成语音指令后,进入语音活动检测(Voice Activity Detection,VAD)流程,如果判定为用户不会输入下一条语音指令,则终止会话,从而提高用户连续会话的效率。The invention changes the processing flow of ending the session immediately after the device completes the voice instruction in the prior art, and the core idea is that after the device completes the voice instruction, the content of the voice command input by the previous user is first determined and confirmed. Whether the user will also input the next voice command. If it is determined that the user also inputs the next voice command, after the device completes the voice command, it enters the voice activity detection (VAD) process, and if it is determined that the user does not input The next voice command terminates the session, thereby increasing the efficiency of the user's continuous session.
如图1所示,为本发明实施例的人机会话的处理的逻辑示意图。在该逻辑图中,人机会话的基础流程依次为用户唤醒、VAD、语音输入、ASR(Automatic Speech Recognition,自动语音识别)、语义解析、指令执行、***反馈、TTS,此流程构成一个闭环。如果通过对上一条语音指令的内容进行判断后,用户需要有连续发语音指令的需求,则在TTS之后可再启动VAD进行语音检测,重复原有会话流程。FIG. 1 is a schematic diagram of a process of a human-machine session according to an embodiment of the present invention. In the logic diagram, the basic flow of the human-machine session is user wake-up, VAD, voice input, ASR (Automatic Speech Recognition), semantic analysis, instruction execution, system feedback, and TTS. This process forms a closed loop. If the user needs to have a continuous voice command after judging the content of the previous voice command, the VAD can be restarted after the TTS to perform voice detection, and the original session flow is repeated.
另外,本申请还要解决在考虑连续会话场景下,在上述流程中的各环节判断整个会话终止的问题。In addition, the present application also solves the problem of judging the termination of the entire session in each link in the above process in consideration of the continuous session scenario.
现有的判断会话终止的方案为从用户唤醒到最终TTS反馈整个流程完结作为会话终止判断条件,基本上不考虑连续会话场景,在会话流程中的各环节一旦出现异常情况,会进行错误类型判断,并通过TTS进行异常反馈,TTS播报之后认为一个会话单元终止。在一些特殊情况下,如语音***主动向用户提问时,TTS播报后会重新从VAD环节开始进行会话流程。The existing scheme for judging the termination of the session is the completion of the entire process from the user waking to the final TTS feedback as the session termination judgment condition, basically does not consider the continuous session scenario, and the error type judgment is performed once an abnormal situation occurs in each link in the session flow. And the abnormal feedback is performed through the TTS, and the TTS broadcasts that a session unit is terminated. In some special cases, such as when the voice system actively asks the user, the TTS will resume the session process from the VAD link after the broadcast.
图2为在连续会话场景下,终止会话的处理逻辑图,在该逻辑中,大体包括5个步骤:领域判断、VAD、ASR、语义解析、执行语音指令。2 is a processing logic diagram of terminating a session in a continuous session scenario. In this logic, there are generally five steps: domain judgment, VAD, ASR, semantic parsing, and execution of voice instructions.
领域判断:判断上一语音指令完成后是否需要进入连续会话(连续会话)状态。如果根据上一语音指令的内容,判断用户还有进一步发出语音指令的需求,则启动VAD;否则,结束本次会话。Domain Judgment: Determine whether you need to enter the continuous session (continuous session) state after the previous voice command is completed. If it is determined that the user has a need to further issue a voice command according to the content of the previous voice command, the VAD is started; otherwise, the session is ended.
VAD:启动VAD后,在设定时间内如果检测到语音信号,则将语音信号发送至ASR进行语音解析,形成文本;如果没有检测到语音信号,则结束本次会话。VAD: After the VAD is started, if a voice signal is detected within the set time, the voice signal is sent to the ASR for voice analysis to form a text; if no voice signal is detected, the session is ended.
ASR:对语音信号进行文本解析,如果解析得到文本内容,则将文本内容进行语义解析;如果文本解析后未得到文本内容,则终止本次会话。ASR: Perform text parsing on the speech signal. If the text content is parsed, the text content is semantically parsed; if the text content is not parsed after the text parsing, the session is terminated.
语义解析:对文本内容进行语义解析,判断文本中的语句是否进入预置的领域内,如果语句未进入任一领域,或者语句进入停止收音(终止会话)领域,则终止本次会话;如果语句进入已有的领域,则根据已有的领域形成语音指令。Semantic parsing: Semantic parsing of text content, judging whether the statement in the text enters the preset field, if the statement does not enter any field, or the statement enters the field of stopping the radio (terminating session), the session is terminated; if the statement When entering an existing field, voice instructions are formed according to the existing field.
执行指令:根据确定的语音指令,控制相应设备执行语音指令并通过TTS反馈。Execution command: according to the determined voice instruction, control the corresponding device to execute the voice instruction and feed back through TTS.
语音指令执行完成后,流程重新指向第一个步骤,即继续对上一指令的领域进行判断,确定用户是否需要连续会话。After the execution of the voice instruction is completed, the flow redirects to the first step, that is, to continue to judge the field of the previous instruction to determine whether the user needs a continuous session.
基于图1和图2所示的人机会话的处理方法的逻辑示意图,本发明实施例提供了一种人机会话的处理***,用以提高人机会话场景下,用户主动连续会话的效率。如图3所示,该***包括:设备310和服务器320。Based on the schematic diagram of the processing method of the human-machine session shown in FIG. 1 and FIG. 2, the embodiment of the present invention provides a processing system for the human-machine session, which is used to improve the efficiency of the user's active continuous session in the human-machine session scenario. As shown in FIG. 3, the system includes: a device 310 and a server 320.
设备310包括:Apparatus 310 includes:
人机会话过程中的人机交互设备,如麦克风、音响等以及执行语音指令的操作设备,如媒体播放设备、空调、电视、冰箱等。Human-computer interaction devices in the human-machine session, such as microphones, stereos, etc., and operating devices that execute voice commands, such as media playback devices, air conditioners, televisions, refrigerators, and the like.
设备310,用于在人机会话过程中,与人进行交互,包括语音信号采集、TTS反馈等,以及执行语音指令的具体操作。The device 310 is configured to interact with a person during a human-machine session, including voice signal collection, TTS feedback, and the like, and perform specific operations of the voice instruction.
服务器2具有控制VAD启动、ASR、语义解析、形成语音控制指令并反馈至设备等各环节的逻辑处理功能。The server 2 has logic processing functions for controlling VAD startup, ASR, semantic analysis, forming voice control commands, and feeding back to the device.
如图3所示,服务器2具体包括:人机会话的处理装置321和领域库322;As shown in Figure 3, the server 2 specifically includes: a human machine session processing device 321 and a domain library 322;
人机会话的处理装置321,包括:The processing device 321 of the human-machine session includes:
指令识别模块,用于在设备310完成上一条语音指令后,对上一条语音指令的内容进行识别,确定用户是否有基于上一条语音指令而再次输入语音指令的需求;在对语音指令的内容进行识别时,需要将语音指令在领域库322中进行需求类型的判断。领域库322中预置存储了多个不同领域内的多种意图。The instruction identification module is configured to: after the device 310 completes the previous voice instruction, identify the content of the previous voice instruction, determine whether the user has the requirement to input the voice instruction again based on the previous voice instruction; and perform the content of the voice instruction At the time of recognition, the voice instruction needs to be judged in the domain library 322 in the demand type. A plurality of intents in a plurality of different fields are preset stored in the domain library 322.
所谓领域,既人机交互***中实现用户的某一类需求的功能。领域识别即判断用户的某一语音属于哪一类需求的过程。The so-called domain is a function that implements a certain type of user's needs in a human-computer interaction system. Domain identification is the process of determining which type of demand a user's voice belongs to.
意图,既人机交互***中实现某一领域下用户某一单一明确需求的功能。意图识别即判断用户的某一语音属于某一领域下哪一个明确需求的过程。Intention, the function of realizing a single explicit requirement of a user in a certain field in a human-computer interaction system. Intent recognition is the process of determining which of a user's voice belongs to a specific requirement in a certain field.
通过将用户的上一条语音指令的内容在领域库322中识别判断,就可以判定用户是否还存在再次输入语音指令的需求。By identifying the content of the user's previous voice command in the domain library 322, it can be determined whether the user still has a need to re-enter the voice command.
语音检测模块,用于如果确定用户有再次输入语音指令的需求,则启动语音活动检测VAD;否则,结束本次会话。The voice detection module is configured to start the voice activity detection VAD if it is determined that the user has the requirement to input the voice instruction again; otherwise, end the session.
在设备310一侧,用于接收语音信号的设备如麦克风始终是开启状态,但只有服务器320确定启动VAD的检测流程后,麦克风接收的语音信号才会被传送至人机会话的处理装置321中,作为人机会话流程中的语音指令信号。而VAD的检测流程只有在每次会话开始时启动,用以检测语音信号,当***认为无语音输入时,会自动关闭VAD,直到再次用户唤醒设备才再次启动VAD。因此,当上述指令识别模块判定出用户还有再次输入语音指令的需求时,会触发语音检测模块启动VAD。On the side of the device 310, the device for receiving the voice signal, such as the microphone, is always on, but only after the server 320 determines the detection flow for starting the VAD, the voice signal received by the microphone is transmitted to the processing device 321 of the human-machine session. As a voice command signal in the human-machine session flow. The VAD detection process is only started at the beginning of each session to detect the voice signal. When the system considers that there is no voice input, the VAD is automatically turned off until the user wakes up the device again to start the VAD again. Therefore, when the command recognition module determines that the user has a need to input the voice command again, the voice detection module is triggered to activate the VAD.
语音检测模块,还用于在启动VAD后,如果在指定的检测时间内没有检测到语音信号,则结束本次会话;否则,触发语音识别模块对检测到的语音信号进行自动语音识别ASR。The voice detection module is further configured to end the session if the voice signal is not detected within the specified detection time after starting the VAD; otherwise, the voice recognition module is triggered to perform automatic voice recognition ASR on the detected voice signal.
时间计算模块,用于:Time calculation module for:
统计用户在各次会话过程中,用户从设备唤醒成功到发出语音指令的第一平均时间;统计用户在各次会话过程中,从启动VAD到用户发出语音指令的第二平均时间;根据第一平均时间和所述第二平均时间计算得到指定的检测时间。Counting the first average time that the user wakes up from the device to the voice command during each session; and counts the second average time that the user sends the voice command from the VAD to the user during each session; The average time and the second average time are calculated to obtain a specified detection time.
VAD检测的时间对VAD的效果影响比较大。VAD的检测时间应该根据用户在连续会话情景下的发音时间来决定,不同的用户发音***均时间,以及从启动VAD到用户发出语音指令的平均时间。The time of VAD detection has a greater impact on the effect of VAD. The detection time of the VAD should be determined according to the pronunciation time of the user in the continuous conversation scenario, and the pronunciation habits of different users are different. This requires dynamic adjustment of the VAD execution time, specifically based on the user's habit of using the speaker to dynamically optimize the VAD detection time. The time counted by these habits includes the average time the user wakes up from the device to the time the voice command is issued, and the average time from the start of the VAD to the user's voice command.
进一步地,时间计算模块根据第一平均时间和第二平均时间计算得到指定的检测时间,可包括:Further, the time calculation module calculates the specified detection time according to the first average time and the second average time, and may include:
根据T4=T3+(T2-T3)/2According to T4=T3+(T2-T3)/2
计算得到所述指定的检测时间T4;其中,T2为第一平均时间T1和预设冗余时间之和,T3为第二平均时间。The specified detection time T4 is calculated; wherein T2 is the sum of the first average time T1 and the preset redundancy time, and T3 is the second average time.
语音识别模块,还用于在对检测到的语音信号进行ASR后,如果未识别出文本内容,则结束本次会话;否则,触发语义解析模块对识别出的文本内容进行语义解析。The speech recognition module is further configured to: after the ASR is performed on the detected speech signal, if the text content is not recognized, the session is ended; otherwise, the semantic parsing module is triggered to perform semantic analysis on the recognized text content.
语义解析模块,用于在对识别出的文本内容进行语义解析后,如果解析得到的语义未进入任一预置的领域,或者解析得到的语义明确为结束本次会话,则结束本次会话;否则,根据解析得到的语义所进入的领域生成语音指令,并触发指令执行模块根据语音指令控制相应设备执行操作。The semantic parsing module is configured to end the session after the semantic analysis of the recognized text content, if the parsed semantics does not enter any of the preset fields, or the parsed semantics are explicitly ended in the session; Otherwise, the voice instruction is generated according to the domain entered by the parsed semantics, and the instruction execution module is triggered to control the corresponding device to perform the operation according to the voice instruction.
本发明实施例提供的人机会话的处理***,在设备完成上一条语音指令后,对上一条语音指令的内容进行识别,确定用户是否有基于上一条语音指令而再次输入语音指令的需求;如果确定用户有再次输入语音指令的需求,则启动语音活动检测VAD;否则,结束本次会话。进一步地,在连续会话过程中,根据各环节执行情况,预先设置结束会话的条件,并在判定条件形成后,结束本次会话,实现连续会话的完整流程。The processing system for the human-machine session provided by the embodiment of the present invention identifies the content of the previous voice command after the device completes the previous voice command, and determines whether the user has the requirement to input the voice command again based on the previous voice command; If it is determined that the user has the need to input the voice command again, the voice activity detection VAD is started; otherwise, the session is ended. Further, in the continuous session process, according to the execution status of each link, the conditions for ending the session are set in advance, and after the determination condition is formed, the session is ended, and the complete process of the continuous session is realized.
下面通过多个实施例来进一步说明本申请的技术方案。The technical solutions of the present application are further described below through various embodiments.
实施例一Embodiment 1
基于上述根据语音指令内容判定是否进行连续会话的方案思想,如图4a所示,其为本发明实施例示出的人机会话的处理方法流程图一,该方法的执行主体为图3中所示的人机会话的处理装置。如图4a所示,该人机会话的处理方法包括如下步骤:The flowchart of the method for determining whether to perform a continuous session according to the content of the voice instruction is as shown in FIG. 4a, which is a flow chart 1 of the processing method of the human-machine session shown in the embodiment of the present invention, and the execution body of the method is shown in FIG. The processing device of the human-machine session. As shown in FIG. 4a, the processing method of the human-machine session includes the following steps:
S410,在设备完成上一条语音指令后,对上一条语音指令的内容进行识别,确定用户是否有基于上一条语音指令而再次输入语音指令的需求。S410: After the device completes the previous voice command, identify the content of the previous voice command, and determine whether the user has the requirement to input the voice command again based on the previous voice command.
在现有的人机会话场景中,用户在输入一条语音指令后,往往还想基于该语音指令的内容以及执行结果,再次输入新的语音指令,例如,上一条语音输入是“帮我搜索变形金刚”,***搜索完毕后,通过屏幕展示搜索结果列表,此时用户很可能会再输入“播放第一个”搜索结果的语音指令。In the existing human-machine session scenario, after inputting a voice command, the user often wants to input a new voice command again based on the content of the voice command and the execution result. For example, the previous voice input is “helping me to search for a deformation. King Kong", after the system search is completed, the search result list is displayed through the screen. At this time, the user is likely to input the voice command of "playing the first" search result.
为了在满足用户连续会话需求的前提下,提高人机会话效率,本实施例中在设备每完成一条语音指令后,会对上一条已完成的语音指令的内容进行识别,确定用户是否有基于上一条语音指令而再次输入语音指令的需求。从而确定是否要启动连续会话的流程。In order to improve the efficiency of the human-machine session on the premise of satisfying the continuous session requirement of the user, in this embodiment, after completing a voice command, the content of the previous completed voice command is identified to determine whether the user has a basis. The requirement to input a voice command again with a voice command. Thereby determining whether to initiate the process of a continuous session.
领域判断,即判断某一领域指令完成后是否需要进入连续会话状态。连续会话的触发需要根据用户上一个语音指令的需求类型及执行情况进行判定。Domain judgment, that is, whether to determine the need to enter a continuous session state after a domain command is completed. The triggering of a continuous session needs to be determined according to the type of the user's previous voice command and the execution status.
判断连续会话的依据来自于但不局限于以下几个原则:The basis for judging a continuous conversation comes from but is not limited to the following principles:
1.是否明确不需要连续会话;如控制类指令(“我要把灯打开”)基本上是用户的一次操作,这类场景下不需要连续会话;1. Is it clear that continuous sessions are not needed; for example, the control class instruction ("I want to turn the light on") is basically a user's operation, and no continuous session is required in such a scenario;
2.是否经常有上下文诉求:如天气领域(“今天天气怎么样?”——“明天呢”);2. Is there often a contextual appeal: such as the weather area ("How is the weather today?" - "Tomorrow");
3.是否是一个连续多指令需求:如电影查询(“帮我搜索变形金刚”——“播放第一个”——“全屏播放”)。3. Is it a continuous multi-instruction requirement: such as a movie query ("Help me search for Transformers" - "play the first one" - "full screen playback").
基于上述原则来判定是否需要进行连续会话,如果需要进行连续会话,即认为用户有基于上一条语音指令而再次输入语音指令的需求。Based on the above principles, it is determined whether a continuous session is required. If a continuous session is required, the user is considered to have a need to input the voice command again based on the previous voice command.
如果确定用户有再次输入语音指令的需求,则执行步骤S420,否则执行S430。If it is determined that the user has a need to input the voice instruction again, step S420 is performed, otherwise, S430 is performed.
S420,启动语音活动检测VAD;S420, starting a voice activity detection VAD;
S430,结束本次会话。S430, end this session.
当确定用户有再次输入语音指令的需求时,可以控制设备侧开启VAD的检测流程,以采集用户可能输入的语音指令,并上传给服务器侧的人机会话的处理装置进行识别处理。When it is determined that the user has the requirement to input the voice command again, the device side can be controlled to open the VAD detection process to collect the voice command that the user may input, and upload it to the processing device of the human-machine session on the server side for identification processing.
当确定用户没有再次输入语音指令的需求时,可以结束本次会话,控制设备进入待 机状态。When it is determined that the user does not need to input the voice command again, the session can be ended and the control device enters the standby state.
本实施例中的对上一条语音指令的内容进行识别,确定用户是否有基于上一条语音指令而再次输入语音指令的需求的判定过程,可在服务器侧处理完成,即通过服务器控制前端设备是否开启连读会话流程。In the embodiment, the content of the previous voice command is identified, and the determination process of determining whether the user has the requirement to input the voice command again based on the previous voice command may be processed on the server side, that is, whether the front end device is turned on by the server. Read the session process.
进一步地,上述方法还包括:Further, the above method further includes:
在启动VAD后,如果在指定的检测时间内没有检测到语音信号,则结束本次会话;否则,对检测到的语音信号进行自动语音识别ASR。After the VAD is started, if no voice signal is detected within the specified detection time, the session is ended; otherwise, the detected voice signal is automatically voice-recognized ASR.
当确定用户有再次输入语音指令的需求时,服务器侧控制设备侧启动VAD流程,并将麦克风检测的语音信号传送至服务器侧进行解析。如果在指定的检测时间内没有检测到人声输入,则认为会话结束,VAD流程关闭;如在指定的检测时间内检测到人声输入,则将检测到的声音上传到服务器端进行ASR处理。When it is determined that the user has the requirement to input the voice instruction again, the server side controls the device side to start the VAD process, and transmits the voice signal detected by the microphone to the server side for analysis. If the vocal input is not detected within the specified detection time, the session is considered to be ended and the VAD process is closed. If the vocal input is detected within the specified detection time, the detected sound is uploaded to the server for ASR processing.
人声检测需要屏蔽掉噪音干扰,稳态的噪音比较好识别和屏蔽,如频率稳定不变的空调噪音,电机噪音;但是动态的噪音比较难屏蔽,如歌声,电视机噪音等频率变化较大且包含人声录音的噪音。因此VAD的检测时间对VAD的效果影响比较大。Vocal detection needs to shield out noise interference. Steady-state noise is better to identify and shield, such as air-conditioning noise with stable frequency and motor noise; but dynamic noise is more difficult to shield, such as singing, TV noise and other frequency changes. It also contains the noise of vocal recordings. Therefore, the detection time of VAD has a greater influence on the effect of VAD.
VAD的检测时间应该根据用户在连续会话下的发音时间来决定,不同的用户发音习惯不同。动态的调整VAD是根据用户使用音箱的习惯动态的优化VAD检测的时间,优化的策略如下:The detection time of the VAD should be determined according to the pronunciation time of the user in a continuous session, and the pronunciation habits of different users are different. The dynamic adjustment of the VAD is based on the user's habit of using the speaker dynamics to optimize the VAD detection time. The optimization strategy is as follows:
统计用户在各次会话过程中,用户从设备唤醒成功到发出语音指令的第一平均时间;统计用户在各次会话过程中,从启动VAD到用户发出语音指令的第二平均时间;根据第一平均时间和第二平均时间计算得到指定的检测时间。Counting the first average time that the user wakes up from the device to the voice command during each session; and counts the second average time that the user sends the voice command from the VAD to the user during each session; The average time and the second average time are calculated to obtain the specified detection time.
具体的,在用户的各次会话过程中,实时计算用户在正常唤醒设备的场景下,从设备唤醒成功到发出指令发出的平均时间T1(第一平均时间)。Specifically, during the session of the user, the average time T1 (first average time) from the wake-up of the device to the issuance of the command is calculated in real time in the scenario of the normal wake-up device.
设置初始连续会话下的VAD的检测时间T2=T1+3(s),由于连续会话状态下,用户的发音一般会较慢于设备正常唤醒下的发音,顾初始的连续会话的检测时间默认为在用户唤醒设备状态下的评估时间基础上增加3秒(冗余时间)。Set the detection time of the VAD in the initial continuous session to T2=T1+3(s). Because the user's pronunciation is generally slower than the normal wake-up under the continuous session state, the detection time of the initial continuous session defaults to Add 3 seconds (redundancy time) based on the evaluation time in the state of the user wake-up device.
设置的初始连续会话下的VAD的检测时间为一个较长的容错时间,需要根据用户在连续会话状态的实际平均发音时间进行收敛,顾在用户的各次会话过程中,还需要实时计算连续会话状态下,从启动VAD到用户发出语音指令的平均时间T3(第二平均时间)。The detection time of the VAD in the initial continuous session is a long fault-tolerant time. It needs to be converged according to the actual average pronunciation time of the user in the continuous session state. In the course of each session of the user, the continuous session needs to be calculated in real time. In the state, the average time T3 (second average time) from the start of VAD to the user's voice command.
在初始连续会话下的VAD的检测时间T2基础上,利用用户在连续会话状态下的实 际平均发音时间T3对T2进行修正,从而可得到比较合理的指定的检测时间T4。Based on the detection time T2 of the VAD in the initial continuous session, T2 is corrected by the actual average utterance time T3 of the user in the continuous session state, so that a relatively reasonable designated detection time T4 can be obtained.
进一步地,根据第一平均时间和第二平均时间计算得到指定的检测时间的处理步骤,可包括:Further, the processing step of calculating the specified detection time according to the first average time and the second average time may include:
通过如下公式:By the following formula:
T4=T3+(T2-T3)/2         (1)T4=T3+(T2-T3)/2 (1)
计算得到上述的指定的检测时间T4;其中,T2为第一平均时间T1和预设冗余时间之和,T3为所述第二平均时间。The specified detection time T4 is calculated; wherein T2 is the sum of the first average time T1 and the preset redundancy time, and T3 is the second average time.
进一步地,上述方法还包括:Further, the above method further includes:
在对检测到的语音信号进行ASR后,如果未识别出文本内容,则结束本次会话;否则,对识别出的文本内容进行语义解析。After the ASR is performed on the detected speech signal, if the text content is not recognized, the session is ended; otherwise, the recognized text content is semantically parsed.
在ASR阶段依然有判断连续会话终止的条件。当有噪音在VAD环节被误判为人声输入时,ASR可能无法识别此段声音,此时识别结果可能返回为空,故可以认为在连续会话场景下,ASR操作结果返回为空时此次会话结束。但也存在用户的语音指令由于客观因素影响导致ASR无法识别返回为空的情况,故在ASR环节终止本次会话的同时,可通过TTS提示用户设备没有听清,有需要请再次唤醒。There are still conditions for judging the termination of a continuous session during the ASR phase. When there is noise in the VAD link, the ASR may not recognize the voice input. At this time, the recognition result may return to null. Therefore, it can be considered that the session is returned to null when the ASR operation result is empty in the continuous session scenario. End. However, there is also a case where the user's voice command is not recognized by the objective factor, and the ASR cannot recognize that the return is empty. Therefore, when the session is terminated in the ASR session, the user equipment can be notified by the TTS that it is not clear, and wake up again if necessary.
当ASR执行完成后,识别出文本内容,可将文本内容继续传输至下一处理环节,即进行语义解析。When the execution of the ASR is completed, the text content is recognized, and the text content can be further transmitted to the next processing step, that is, semantic analysis.
进一步地,上述方法还包括:Further, the above method further includes:
在对识别出的文本内容进行语义解析后,如果解析得到的语义未进入任一预置的领域,或者解析得到的语义明确为结束本次会话,则结束本次会话;否则,根据解析得到的语义所进入的领域生成语音指令,并根据语音指令控制相应设备执行操作。After the semantic analysis of the recognized text content, if the parsed semantics does not enter any of the preset fields, or the parsed semantics are explicitly ended to end the session, the session ends; otherwise, according to the parsing The field into which the semantics enter generates voice instructions and controls the corresponding device to perform operations according to the voice instructions.
通过理解用户语音指令的实际语义可进行连续会话终止的判断。The judgment of continuous session termination can be made by understanding the actual semantics of the user's voice instructions.
语义解析可分为三个环节,领域识别,意图识别,执行逻辑判断,其中,用以判断连续对话是否终止的判断基本上在领域识别环节就可完成,具体可包括但不限于如下两种判断条件:Semantic analysis can be divided into three parts, domain identification, intention identification, and execution logic judgment. Among them, the judgment to judge whether the continuous dialogue is terminated can be basically completed in the field identification link, which may include but not limited to the following two judgments. condition:
当经ASR解析的文本中的语义未落到任何一个领域(领域库中预置有多个不同的领域)时,会话终止。如果连续会话场景下,干扰噪音被误判进入语义解析阶段,杂乱没有逻辑的语句不容易被语义理解到某一领域中,此时会话终止;The session terminates when the semantics in the text parsed by ASR does not fall into any of the fields (multiple different fields are preset in the domain library). If the interference noise is misjudged into the semantic analysis stage in a continuous session scenario, the messy and non-logical statements are not easily understood by semantics into a certain domain, and the session is terminated.
当解析得到的语义落入到“停止词”领域时,表示用户明确使用指令来停止连续会话的场景,此时会话终止。“停止词”领域中的语料如“好的”、“谢谢”、“没事儿 了”……。When the parsed semantics falls into the "stop word" field, it indicates that the user explicitly uses the instruction to stop the continuous session, and the session is terminated. The corpus in the field of “stop words” is “good”, “thank you”, “nothing is wrong”...
当语义落入除“停止词”以外的其他领域,正常执行该领域指令,即根据落入到的领域中的具体意图,生成语音指令,然后根据语音指令控制设备执行相应操作。When the semantics falls into other fields than the "stop word", the field instruction is normally executed, that is, the voice instruction is generated according to the specific intention in the field that falls into, and then the corresponding operation is performed according to the voice instruction control device.
指令执行成功后,可重新回到起始步骤S410进行上一条语音指令的领域判断,确定用户是否有基于上一条语音指令而再次输入语音指令的需求,以及启动VAD流程。After the instruction is successfully executed, the process returns to the initial step S410 to determine the domain of the previous voice command, determine whether the user has the need to input the voice command again based on the previous voice command, and start the VAD process.
指令执行失败,则可跳过步骤S410,直接开启VAD,从步骤S420开始进入连续对话情景。If the instruction execution fails, step S410 may be skipped, the VAD is directly turned on, and the continuous dialogue scenario is entered from step S420.
本发明实施例提供的人机会话的处理方法,在设备完成上一条语音指令后,对上一条语音指令的内容进行识别,确定用户是否有基于上一条语音指令而再次输入语音指令的需求;如果确定用户有再次输入语音指令的需求,则启动语音活动检测;否则,结束本次会话,从而有效的提高在连续会话场景下的会话处理效率。The processing method of the human-machine session provided by the embodiment of the present invention, after the device completes the previous voice instruction, identifies the content of the previous voice command, and determines whether the user has the requirement to input the voice command again based on the previous voice command; If it is determined that the user has the requirement to input the voice instruction again, the voice activity detection is started; otherwise, the session is ended, thereby effectively improving the session processing efficiency in the continuous session scenario.
进一步地,在连续会话执行过程中,通过对各个环节预制结束本次会话的判断条件,可在条件满足时,结束本次会话流程,保证连续会话的完整性。Further, in the continuous session execution process, by pre-making the judgment conditions of the current session for each link, the session process can be ended when the conditions are satisfied, and the integrity of the continuous session is ensured.
实施例二Embodiment 2
图4b为本发明实施例提供的另一种人机会话的处理方法,该方法在图4a所示方法的基础上,进行了少许改变。如图4b所示,该人机会话的处理方法包括如下步骤:FIG. 4b illustrates another method for processing a human-machine session according to an embodiment of the present invention. The method is slightly changed based on the method shown in FIG. 4a. As shown in FIG. 4b, the processing method of the human-machine session includes the following steps:
S440,对所接收的语音指令的内容进行识别。S440, identifying the content of the received voice command.
在现有的人机会话场景中,用户在输入一条语音指令后,往往还想基于该语音指令的内容以及执行结果,再次输入新的语音指令,例如,上一条语音输入是“帮我搜索变形金刚”,***搜索完毕后,通过屏幕展示搜索结果列表,此时用户很可能会再输入“播放第一个”搜索结果的语音指令。In the existing human-machine session scenario, after inputting a voice command, the user often wants to input a new voice command again based on the content of the voice command and the execution result. For example, the previous voice input is “helping me to search for a deformation. King Kong", after the system search is completed, the search result list is displayed through the screen. At this time, the user is likely to input the voice command of "playing the first" search result.
为了在满足用户连续会话需求的前提下,提高人机会话效率,本实施例中在设备每收到语音指令后,先会对该语音指令的内容进行识别,具体判识别程可参见步骤S410的相关内容。In order to improve the efficiency of the human-machine session on the premise of satisfying the continuous session requirements of the user, in the embodiment, after the voice command is received by the device, the content of the voice command is first identified. For the specific identification process, refer to step S410. related information.
S450,判断用户是否有再次输入语音指令的需求;S450, determining whether the user has a need to input a voice instruction again;
在对语音指令的内容进行识别后,判断用于是否存在基于该语音指令再次输入语音指令的需求。具体判断过程可参见步骤S410的相关内容。After identifying the content of the voice command, it is determined whether there is a need to input the voice command again based on the voice command. For the specific judgment process, refer to the related content of step S410.
例如,可针对不同用户(声纹识别)的个性化习惯对用户是否进行再次进行语音输入的结果进行历史数据的统计,根据统计结果计算用户再次输入语音指令的概率。如果 得到的概率大于预设的概率阈值,则判定用户有再次输入语音指令的需求;否则,判定用户没有再次输入语音指令的需求。For example, the historical data may be collected for the result of whether the user performs voice input again for the personalized habit of different users (voiceprint recognition), and the probability that the user inputs the voice instruction again is calculated according to the statistical result. If the obtained probability is greater than the preset probability threshold, it is determined that the user has a need to input the voice instruction again; otherwise, the user is determined not to input the voice command again.
S460,根据判断结果,执行人机会话操作。S460. Perform a human-machine session operation according to the judgment result.
通过用户输入的语音指令的内容,判断用户是否有再次输入语音的需求,并根据判断结果指导后续的处理操作。例如,如果确定用户有再次输入语音指令的需求,则在设备完成本次语音指令后执行启动语音活动检测,以实现连续会话;否则,在设备完成本次语音指令后结束本次会话。Through the content of the voice command input by the user, it is determined whether the user has a need to input the voice again, and the subsequent processing operation is guided according to the judgment result. For example, if it is determined that the user has the requirement to input the voice instruction again, the start voice activity detection is performed after the device completes the voice command to implement the continuous session; otherwise, the session ends after the device completes the voice command.
另外,需要说明的是,实施例一中所示的方法中的步骤也可在本实施例中的方法步骤中执行,在此对步骤原理不做赘述。In addition, it should be noted that the steps in the method shown in the first embodiment can also be performed in the method steps in this embodiment, and the step principle is not described herein.
本发明实施例提供的人机会话的处理方法,对所接收的语音指令的内容进行识别,判断用户是否有再次输入语音指令的需求;然后根据判断结果,执行人机会话操作,从而有效的提高在连续会话场景下的会话处理效率。The processing method of the human-machine session provided by the embodiment of the present invention identifies the content of the received voice command, determines whether the user has the requirement to input the voice command again; and then performs the human-machine session operation according to the judgment result, thereby effectively improving Session processing efficiency in a continuous session scenario.
实施例三Embodiment 3
如图5a所示,为本发明实施例的人机会话的处理装置结构图一,该人机会话的处理装置可用于执行如图4a所示的方法步骤,其包括:As shown in FIG. 5, which is a structural diagram of a processing device for a human-machine session according to an embodiment of the present invention, the processing device of the human-machine session can be used to perform the method steps shown in FIG. 4a, including:
指令识别模块510,用于在设备完成上一条语音指令后,对上一条语音指令的内容进行识别,确定用户是否有基于上一条语音指令而再次输入语音指令的需求;The instruction identification module 510 is configured to: after the device completes the previous voice instruction, identify the content of the previous voice instruction, and determine whether the user has the requirement to input the voice instruction again based on the previous voice instruction;
语音检测模块520,用于如果确定用户有再次输入语音指令的需求,则启动语音活动检测VAD;否则,结束本次会话。The voice detection module 520 is configured to start the voice activity detection VAD if it is determined that the user has the requirement to input the voice instruction again; otherwise, end the session.
进一步地,如图5b所示,上述装置还包括语音识别模块530;Further, as shown in Figure 5b, the above apparatus further includes a voice recognition module 530;
语音检测模块520,还用于在启动VAD后,如果在指定的检测时间内没有检测到语音信号,则结束本次会话;否则,触发语音识别模块530对检测到的语音信号进行自动语音识别ASR。The voice detection module 520 is further configured to: after the VAD is started, if the voice signal is not detected within the specified detection time, the session is ended; otherwise, the triggered voice recognition module 530 performs automatic voice recognition on the detected voice signal. .
进一步地,如图5b所示,上述装置还包括时间计算模块540,用于:统计用户在各次会话过程中,用户从设备唤醒成功到发出语音指令的第一平均时间;统计用户在各次会话过程中,从启动VAD到用户发出语音指令的第二平均时间;根据第一平均时间和第二平均时间计算得到上述指定的检测时间。Further, as shown in FIG. 5b, the foregoing apparatus further includes a time calculation module 540, configured to: collect a first average time that the user wakes up from the device to the voice command during each session, and counts the user in each time. During the session, the second average time from the initiation of the VAD to the user's voice command; the specified detection time is calculated according to the first average time and the second average time.
进一步地,上述时间计算模块540根据第一平均时间和第二平均时间计算得到指定的检测时间,包括:Further, the time calculation module 540 calculates the specified detection time according to the first average time and the second average time, including:
根据T4=T3+(T2-T3)/2According to T4=T3+(T2-T3)/2
计算得到指定的检测时间T4;其中,T2为第一平均时间T1和预设冗余时间之和,T3为第二平均时间。The specified detection time T4 is calculated; wherein T2 is the sum of the first average time T1 and the preset redundancy time, and T3 is the second average time.
进一步地,如图5b所示,上述装置还包括语义解析模块550;Further, as shown in Figure 5b, the above apparatus further includes a semantic parsing module 550;
语音识别模块540,还用于在对检测到的语音信号进行ASR后,如果未识别出文本内容,则结束本次会话;否则,触发语义解析模块550对识别出的文本内容进行语义解析。The speech recognition module 540 is further configured to: after the ASR is performed on the detected speech signal, if the text content is not recognized, the current session is ended; otherwise, the trigger semantic parsing module 550 performs semantic parsing on the recognized text content.
进一步地,如图5b所示,上述装置还包括指令执行模块560;Further, as shown in Figure 5b, the above apparatus further includes an instruction execution module 560;
语义解析模块550,还用于在对识别出的文本内容进行语义解析后,如果解析得到的语义未进入任一预置的领域,或者解析得到的语义明确为结束本次会话,则结束本次会话;否则,根据解析得到的语义所进入的领域生成语音指令,并触发指令执行模块560根据语音指令控制相应设备执行操作。The semantic parsing module 550 is further configured to: after semantically parsing the recognized text content, if the parsed semantics does not enter any preset domain, or the parsed semantics are explicitly ended, the end is Session; otherwise, the voice instruction is generated according to the domain entered by the parsed semantics, and the trigger instruction execution module 560 controls the corresponding device to perform the operation according to the voice instruction.
本发明实施例提供的人机会话的处理装置,在设备完成上一条语音指令后,对上一条语音指令的内容进行识别,确定用户是否有基于上一条语音指令而再次输入语音指令的需求;如果确定用户有再次输入语音指令的需求,则启动语音活动检测VAD;否则,结束本次会话,从而有效的提高在连续会话场景下的会话处理效率。The processing device for the human-machine session provided by the embodiment of the present invention identifies the content of the previous voice command after the device completes the previous voice command, and determines whether the user has the requirement to input the voice command again based on the previous voice command; If it is determined that the user has the requirement to input the voice command again, the voice activity detection VAD is started; otherwise, the session is ended, thereby effectively improving the session processing efficiency in the continuous session scenario.
进一步地,在连续会话执行过程中,通过对各个环节预制结束本次会话的判断条件,可在条件满足时,结束本次会话流程,保证连续会话的完整性。Further, in the continuous session execution process, by pre-making the judgment conditions of the current session for each link, the session process can be ended when the conditions are satisfied, and the integrity of the continuous session is ensured.
实施例四Embodiment 4
如图6所示,为本发明实施例提供的人机会话的处理装置结构图二,该人机会话的处理装置可用于执行如图4b所示的方法步骤,其包括:As shown in FIG. 6, which is a structural diagram of a processing device for a human-machine session according to an embodiment of the present invention, the processing device of the human-machine session can be used to perform the method steps shown in FIG. 4b, including:
内容识别模块610,用于对所接收的语音指令的内容进行识别;a content identification module 610, configured to identify content of the received voice instruction;
需求判断模块620,用于判断用户是否有再次输入语音指令的需求;The requirement judging module 620 is configured to determine whether the user has a requirement for inputting a voice instruction again;
执行操作模块630,用于根据判断结果,执行人机会话操作。The operation module 630 is configured to perform a human-machine session operation according to the determination result.
本发明实施例提供的人机会话的处理装置,对所接收的语音指令的内容进行识别,判断用户是否有再次输入语音指令的需求;然后根据判断结果,执行人机会话操作,从而有效的提高在连续会话场景下的会话处理效率。The processing device for the human-machine session provided by the embodiment of the present invention identifies the content of the received voice command, determines whether the user has the requirement to input the voice command again, and then performs the human-machine session operation according to the determination result, thereby effectively improving Session processing efficiency in a continuous session scenario.
实施例五Embodiment 5
前面实施例三描述了人机会话的处理装置的整体架构,该装置的功能可借助一种电子设备实现完成,如图7所示,其为本发明实施例的电子设备的结构示意图,具体包括:存储器710和处理器720。The foregoing embodiment 3 describes the overall architecture of the processing device of the human-machine session, and the function of the device can be implemented by using an electronic device, as shown in FIG. 7 , which is a schematic structural diagram of the electronic device according to the embodiment of the present invention, specifically including Memory 710 and processor 720.
存储器710,用于存储程序。The memory 710 is configured to store a program.
除上述程序之外,存储器710还可被配置为存储其它各种数据以支持在电子设备上的操作。这些数据的示例包括用于在电子设备上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。In addition to the above described procedures, memory 710 can also be configured to store various other data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on an electronic device, contact data, phone book data, messages, pictures, videos, and the like.
存储器710可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。 Memory 710 can be implemented by any type of volatile or non-volatile storage device, or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Disk or Optical Disk.
处理器720,耦合至存储器710,用于执行存储器710中的程序,以用于:The processor 720 is coupled to the memory 710 for executing a program in the memory 710 for:
在设备完成上一条语音指令后,对上一条语音指令的内容进行识别,确定用户是否有基于上一条语音指令而再次输入语音指令的需求;After the device completes the previous voice command, it identifies the content of the previous voice command to determine whether the user has the need to input the voice command again based on the previous voice command;
如果确定用户有再次输入语音指令的需求,则启动语音活动检测VAD;否则,结束本次会话。If it is determined that the user has a need to input a voice command again, the voice activity detection VAD is initiated; otherwise, the session is ended.
上述的具体处理操作已经在前面实施例中进行了详细说明,在此不再赘述。The specific processing operations described above have been described in detail in the foregoing embodiments, and are not described herein again.
进一步,如图7所示,电子设备还可以包括:通信组件730、电源组件740、音频组件750、显示器760等其它组件。图7中仅示意性给出部分组件,并不意味着电子设备只包括图7所示组件。Further, as shown in FIG. 7, the electronic device may further include: a communication component 730, a power component 740, an audio component 750, a display 760, and the like. Only some of the components are schematically illustrated in FIG. 7, and it is not meant that the electronic device includes only the components shown in FIG.
通信组件730被配置为便于电子设备和其他设备之间有线或无线方式的通信。电子设备可以接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合。在一个示例性实施例中,通信组件730经由广播信道接收来自外部广播管理***的广播信号或广播相关信息。在一个示例性实施例中,通信组件730还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。 Communication component 730 is configured to facilitate wired or wireless communication between the electronic device and other devices. The electronic device can access a wireless network based on a communication standard such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, communication component 730 receives broadcast signals or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, communication component 730 also includes a near field communication (NFC) module to facilitate short range communication. For example, the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
电源组件740,为电子设备的各种组件提供电力。电源组件740可以包括电源管理***,一个或多个电源,及其他与为电子设备生成、管理和分配电力相关联的组件。A power component 740 provides power to various components of the electronic device. Power component 740 can include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device.
音频组件750被配置为输出和/或输入音频信号。例如,音频组件750包括一个麦克风(MIC),当电子设备处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦 克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器710或经由通信组件730发送。在一些实施例中,音频组件750还包括一个扬声器,用于输出音频信号。The audio component 750 is configured to output and/or input an audio signal. For example, the audio component 750 includes a microphone (MIC) that is configured to receive an external audio signal when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may be further stored in memory 710 or transmitted via communication component 730. In some embodiments, audio component 750 also includes a speaker for outputting an audio signal.
显示器760包括屏幕,其屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与触摸或滑动操作相关的持续时间和压力。The display 760 includes a screen whose screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen can be implemented as a touch screen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touches, slides, and gestures on the touch panel. The touch sensor can sense not only the boundaries of the touch or sliding action, but also the duration and pressure associated with the touch or slide operation.
实施例六Embodiment 6
前面实施例四描述了人机会话的处理装置的整体架构,该装置的功能可借助一种电子设备实现完成,如图8所示,其为本发明实施例的电子设备的结构示意图,具体包括:存储器810和处理器820。The foregoing embodiment 4 describes the overall architecture of the processing device of the human-machine session. The function of the device can be implemented by using an electronic device. As shown in FIG. 8 , it is a schematic structural diagram of the electronic device according to the embodiment of the present invention, which specifically includes Memory 810 and processor 820.
存储器810,用于存储程序。The memory 810 is configured to store a program.
除上述程序之外,存储器810还可被配置为存储其它各种数据以支持在电子设备上的操作。这些数据的示例包括用于在电子设备上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。In addition to the above described procedures, memory 810 can also be configured to store various other data to support operations on the electronic device. Examples of such data include instructions for any application or method operating on an electronic device, contact data, phone book data, messages, pictures, videos, and the like.
存储器810可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。The memory 810 can be implemented by any type of volatile or non-volatile storage device, or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable. Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Disk or Optical Disk.
处理器820,耦合至存储器810,用于执行存储器810中的程序,以用于:The processor 820 is coupled to the memory 810 for executing a program in the memory 810 for:
对所接收的语音指令的内容进行识别;Identifying the content of the received voice command;
判断用户是否有再次输入语音指令的需求;Determining whether the user has a need to input a voice command again;
根据判断结果,执行人机会话操作。According to the judgment result, the human-machine session operation is performed.
上述的具体处理操作已经在前面实施例中进行了详细说明,在此不再赘述。The specific processing operations described above have been described in detail in the foregoing embodiments, and are not described herein again.
进一步,如图8所示,电子设备还可以包括:通信组件830、电源组件840、音频组件850、显示器860等其它组件。图8中仅示意性给出部分组件,并不意味着电子设备只包括图8所示组件。Further, as shown in FIG. 8, the electronic device may further include: a communication component 830, a power component 840, an audio component 850, a display 860, and the like. Only some of the components are schematically illustrated in FIG. 8, and it is not meant that the electronic device includes only the components shown in FIG.
通信组件830被配置为便于电子设备和其他设备之间有线或无线方式的通信。电子设备可以接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合。在一个示 例性实施例中,通信组件830经由广播信道接收来自外部广播管理***的广播信号或广播相关信息。在一个示例性实施例中,通信组件830还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。 Communication component 830 is configured to facilitate wired or wireless communication between the electronic device and other devices. The electronic device can access a wireless network based on a communication standard such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, communication component 830 receives broadcast signals or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, communication component 830 also includes a near field communication (NFC) module to facilitate short range communication. For example, the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
电源组件840,为电子设备的各种组件提供电力。电源组件840可以包括电源管理***,一个或多个电源,及其他与为电子设备生成、管理和分配电力相关联的组件。A power supply assembly 840 provides power to various components of the electronic device. Power component 840 can include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device.
音频组件850被配置为输出和/或输入音频信号。例如,音频组件850包括一个麦克风(MIC),当电子设备处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器810或经由通信组件830发送。在一些实施例中,音频组件850还包括一个扬声器,用于输出音频信号。The audio component 850 is configured to output and/or input audio signals. For example, the audio component 850 includes a microphone (MIC) that is configured to receive an external audio signal when the electronic device is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may be further stored in memory 810 or transmitted via communication component 830. In some embodiments, audio component 850 also includes a speaker for outputting an audio signal.
显示器860包括屏幕,其屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与触摸或滑动操作相关的持续时间和压力。The display 860 includes a screen whose screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen can be implemented as a touch screen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touches, slides, and gestures on the touch panel. The touch sensor can sense not only the boundaries of the touch or sliding action, but also the duration and pressure associated with the touch or slide operation.
本领域普通技术人员可以理解:实现上述各方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成。前述的程序可以存储于一计算机可读取存储介质中。该程序在执行时,执行包括上述各方法实施例的步骤;而前述的存储介质包括:ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。One of ordinary skill in the art will appreciate that all or part of the steps to implement the various method embodiments described above may be accomplished by hardware associated with the program instructions. The aforementioned program can be stored in a computer readable storage medium. The program, when executed, performs the steps including the foregoing method embodiments; and the foregoing storage medium includes various media that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.
最后应说明的是:以上各实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述各实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。Finally, it should be noted that the above embodiments are only for explaining the technical solutions of the present application, and are not limited thereto; although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that The technical solutions described in the foregoing embodiments may be modified, or some or all of the technical features may be equivalently replaced; and the modifications or substitutions do not deviate from the technical solutions of the embodiments of the present application. range.

Claims (13)

  1. 一种人机会话的处理方法,其特征在于,包括:A method for processing a human-machine session, comprising:
    在设备完成上一条语音指令后,对上一条语音指令的内容进行识别,确定用户是否有基于上一条语音指令而再次输入语音指令的需求;After the device completes the previous voice command, it identifies the content of the previous voice command to determine whether the user has the need to input the voice command again based on the previous voice command;
    如果确定用户有再次输入语音指令的需求,则启动语音活动检测;否则,结束本次会话。If it is determined that the user has a need to input a voice command again, voice activity detection is initiated; otherwise, the session is ended.
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method of claim 1 further comprising:
    在启动所述语音活动检测后,如果在指定的检测时间内没有检测到语音信号,则结束本次会话;否则,对检测到的语音信号进行自动语音识别。After the voice activity detection is initiated, if no voice signal is detected within the specified detection time, the session is ended; otherwise, automatic voice recognition is performed on the detected voice signal.
  3. 根据权利要求2所述的方法,其特征在于,所述方法还包括:The method of claim 2, wherein the method further comprises:
    统计所述用户在各次会话过程中,用户从设备唤醒成功到发出语音指令的第一平均时间;Counting, during each session, the first average time that the user wakes up from the device to the voice command;
    统计所述用户在各次会话过程中,从启动所述语音活动检测到用户发出语音指令的第二平均时间;Counting, by the user, a second average time from the initiation of the voice activity detection to the user's voice command during each session;
    根据所述第一平均时间和所述第二平均时间计算得到所述指定的检测时间。The specified detection time is calculated according to the first average time and the second average time.
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述第一平均时间和所述第二平均时间计算得到所述指定的检测时间,包括:The method according to claim 3, wherein the calculating the specified detection time according to the first average time and the second average time comprises:
    根据T4=T3+(T2-T3)/2According to T4=T3+(T2-T3)/2
    计算得到所述指定的检测时间T4;其中,所述T2为所述第一平均时间T1和预设冗余时间之和,所述T3为所述第二平均时间。Calculating the specified detection time T4; wherein the T2 is a sum of the first average time T1 and a preset redundancy time, and the T3 is the second average time.
  5. 根据权利要求2所述的方法,其特征在于,所述方法还包括:The method of claim 2, wherein the method further comprises:
    在对检测到的所述语音信号进行所述自动语音识别后,如果未识别出文本内容,则结束本次会话;否则,对识别出的文本内容进行语义解析。After the automatic speech recognition is performed on the detected speech signal, if the text content is not recognized, the session is ended; otherwise, the recognized text content is semantically parsed.
  6. 根据权利要求5所述的方法,其特征在于,所述方法还包括:The method of claim 5, wherein the method further comprises:
    在对识别出的所述文本内容进行语义解析后,如果解析得到的语义未进入任一预置的领域,或者解析得到的语义明确为结束本次会话,则结束本次会话;否则,根据解析得到的语义所进入的领域生成语音指令,并根据语音指令控制相应设备执行操作。After semantically parsing the identified text content, if the parsed semantics does not enter any of the preset fields, or the parsed semantics are explicitly ending the session, the session ends; otherwise, according to the parsing The obtained semantic field generates a voice instruction, and controls the corresponding device to perform an operation according to the voice instruction.
  7. 一种人机会话的处理方法,其特征在于,包括:A method for processing a human-machine session, comprising:
    对所接收的语音指令的内容进行识别;Identifying the content of the received voice command;
    判断用户是否有再次输入语音指令的需求;Determining whether the user has a need to input a voice command again;
    根据判断结果,执行人机会话操作。According to the judgment result, the human-machine session operation is performed.
  8. 根据权利要求7所述的方法,其特征在于,所述根据判断结果,执行人机会话操作包括:The method according to claim 7, wherein the performing the human-machine session operation according to the determination result comprises:
    如果确定用户有再次输入语音指令的需求,则在设备完成本次语音指令后执行启动语音活动检测;否则,在设备完成本次语音指令后结束本次会话。If it is determined that the user has the requirement to input the voice instruction again, the start voice activity detection is performed after the device completes the voice command; otherwise, the session ends after the device completes the voice command.
  9. 根据权利要求7所述的方法,其特征在于,所述判断用户是否有再次输入语音指令的需求包括:The method according to claim 7, wherein the determining whether the user has the voice command input again comprises:
    计算用户再次输入语音指令的概率;Calculate the probability that the user inputs the voice command again;
    如果所述概率大于预设的概率阈值,则判定用户有再次输入语音指令的需求;否则,判定用户没有再次输入语音指令的需求。If the probability is greater than the preset probability threshold, it is determined that the user has a need to input the voice instruction again; otherwise, the user is determined not to input the voice command again.
  10. 一种人机会话的处理装置,其特征在于,包括:A processing device for a human-machine session, comprising:
    指令识别模块,用于在设备完成上一条语音指令后,对上一条语音指令的内容进行识别,确定用户是否有基于上一条语音指令而再次输入语音指令的需求;The instruction identification module is configured to identify, after the device completes the previous voice instruction, the content of the previous voice instruction, to determine whether the user has the requirement to input the voice instruction again based on the previous voice instruction;
    语音检测模块,用于如果确定用户有再次输入语音指令的需求,则启动语音活动检测;否则,结束本次会话。The voice detection module is configured to initiate voice activity detection if it is determined that the user has the requirement to input the voice command again; otherwise, the session is ended.
  11. 一种人机会话的处理装置,其特征在于,包括:A processing device for a human-machine session, comprising:
    内容识别模块,用于对所接收的语音指令的内容进行识别;a content identification module, configured to identify content of the received voice instruction;
    需求判断模块,用于判断用户是否有再次输入语音指令的需求;a demand judging module, configured to determine whether the user has a need to input a voice command again;
    执行操作模块,用于根据判断结果,执行人机会话操作。The operation module is executed to perform a human-machine session operation according to the judgment result.
  12. 一种电子设备,其特征在于,包括:An electronic device, comprising:
    存储器,用于存储程序;Memory for storing programs;
    处理器,耦合至所述存储器,用于执行所述程序,以用于:a processor coupled to the memory for executing the program for:
    在设备完成上一条语音指令后,对上一条语音指令的内容进行识别,确定用户是否有基于上一条语音指令而再次输入语音指令的需求;After the device completes the previous voice command, it identifies the content of the previous voice command to determine whether the user has the need to input the voice command again based on the previous voice command;
    如果确定用户有再次输入语音指令的需求,则启动语音活动检测;否则,结束本次会话。If it is determined that the user has a need to input a voice command again, voice activity detection is initiated; otherwise, the session is ended.
  13. 一种电子设备,其特征在于,包括:An electronic device, comprising:
    存储器,用于存储程序;Memory for storing programs;
    处理器,耦合至所述存储器,用于执行所述程序,以用于:a processor coupled to the memory for executing the program for:
    对所接收的语音指令的内容进行识别;Identifying the content of the received voice command;
    判断用户是否有再次输入语音指令的需求;Determining whether the user has a need to input a voice command again;
    根据判断结果,执行人机会话操作。According to the judgment result, the human-machine session operation is performed.
PCT/CN2018/093225 2017-07-04 2018-06-28 Human-machine conversation processing method and apparatus, and electronic device WO2019007247A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710539395.6 2017-07-04
CN201710539395.6A CN109215642A (en) 2017-07-04 2017-07-04 Processing method, device and the electronic equipment of man-machine conversation

Publications (1)

Publication Number Publication Date
WO2019007247A1 true WO2019007247A1 (en) 2019-01-10

Family

ID=64949726

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/093225 WO2019007247A1 (en) 2017-07-04 2018-06-28 Human-machine conversation processing method and apparatus, and electronic device

Country Status (2)

Country Link
CN (1) CN109215642A (en)
WO (1) WO2019007247A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110012166B (en) * 2019-03-31 2021-02-19 联想(北京)有限公司 Information processing method and device
CN110223697B (en) 2019-06-13 2022-04-22 思必驰科技股份有限公司 Man-machine conversation method and system
CN110619873A (en) * 2019-08-16 2019-12-27 北京小米移动软件有限公司 Audio processing method, device and storage medium
CN111192597A (en) * 2019-12-27 2020-05-22 浪潮金融信息技术有限公司 Processing method of continuous voice conversation in noisy environment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140278435A1 (en) * 2013-03-12 2014-09-18 Nuance Communications, Inc. Methods and apparatus for detecting a voice command
CN104505093A (en) * 2014-12-16 2015-04-08 佛山市顺德区美的电热电器制造有限公司 Household appliance and voice interaction method thereof
CN106205612A (en) * 2016-07-08 2016-12-07 北京光年无限科技有限公司 Information processing method and system towards intelligent robot
CN106205615A (en) * 2016-08-26 2016-12-07 王峥嵘 A kind of control method based on interactive voice and system
CN106653021A (en) * 2016-12-27 2017-05-10 上海智臻智能网络科技股份有限公司 Voice wake-up control method and device and terminal

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102880649B (en) * 2012-08-27 2016-03-02 北京搜狗信息服务有限公司 A kind of customized information disposal route and system
US9640183B2 (en) * 2014-04-07 2017-05-02 Samsung Electronics Co., Ltd. Speech recognition using electronic device and server
US10614799B2 (en) * 2014-11-26 2020-04-07 Voicebox Technologies Corporation System and method of providing intent predictions for an utterance prior to a system detection of an end of the utterance
CN104539516B (en) * 2014-12-18 2018-07-06 北京奇虎科技有限公司 A kind of customer service method and a kind of customer care server
CN104699784B (en) * 2015-03-13 2017-12-19 苏州思必驰信息科技有限公司 A kind of data search method and device based on interactive mode input
CN105159996B (en) * 2015-09-07 2018-09-07 百度在线网络技术(北京)有限公司 Depth question and answer service providing method based on artificial intelligence and device
CN105912111B (en) * 2016-04-06 2018-11-09 北京地平线机器人技术研发有限公司 The method and speech recognition equipment of end voice dialogue in human-computer interaction
CN106250474B (en) * 2016-07-29 2020-06-23 Tcl科技集团股份有限公司 Voice control processing method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140278435A1 (en) * 2013-03-12 2014-09-18 Nuance Communications, Inc. Methods and apparatus for detecting a voice command
CN104505093A (en) * 2014-12-16 2015-04-08 佛山市顺德区美的电热电器制造有限公司 Household appliance and voice interaction method thereof
CN106205612A (en) * 2016-07-08 2016-12-07 北京光年无限科技有限公司 Information processing method and system towards intelligent robot
CN106205615A (en) * 2016-08-26 2016-12-07 王峥嵘 A kind of control method based on interactive voice and system
CN106653021A (en) * 2016-12-27 2017-05-10 上海智臻智能网络科技股份有限公司 Voice wake-up control method and device and terminal

Also Published As

Publication number Publication date
CN109215642A (en) 2019-01-15

Similar Documents

Publication Publication Date Title
WO2019007245A1 (en) Processing method, control method and recognition method, and apparatus and electronic device therefor
US11823670B2 (en) Activation trigger processing
US10861444B2 (en) Systems and methods for determining whether to trigger a voice capable device based on speaking cadence
JP7418526B2 (en) Dynamic and/or context-specific hotwords to trigger automated assistants
WO2019007247A1 (en) Human-machine conversation processing method and apparatus, and electronic device
JP2019117623A (en) Voice dialogue method, apparatus, device and storage medium
US8972252B2 (en) Signal processing apparatus having voice activity detection unit and related signal processing methods
KR102019719B1 (en) Image processing apparatus and control method thereof, image processing system
WO2017012511A1 (en) Voice control method and device, and projector apparatus
CN109147779A (en) Voice data processing method and device
CN110634483A (en) Man-machine interaction method and device, electronic equipment and storage medium
CN104347072A (en) Remote-control unit control method and device and remote-control unit
KR20160132748A (en) Electronic apparatus and the controlling method thereof
CN109360567A (en) The customizable method and apparatus waken up
WO2020003851A1 (en) Audio processing device, audio processing method, and recording medium
US20220399020A1 (en) Man-machine dialogue mode switching method
CN112201246A (en) Intelligent control method and device based on voice, electronic equipment and storage medium
JP7173049B2 (en) Information processing device, information processing system, information processing method, and program
US10540973B2 (en) Electronic device for performing operation corresponding to voice input
TW201939212A (en) Device awakening method, apparatus and electronic device
WO2019239656A1 (en) Information processing device and information processing method
CN112420044A (en) Voice recognition method, voice recognition device and electronic equipment
CN111862965A (en) Awakening processing method and device, intelligent sound box and electronic equipment
CN109658924B (en) Session message processing method and device and intelligent equipment
WO2023246036A1 (en) Control method and apparatus for speech recognition device, and electronic device and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18828551

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18828551

Country of ref document: EP

Kind code of ref document: A1