WO2023093280A1 - 语音控制方法、装置、电子设备及存储介质 - Google Patents

语音控制方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2023093280A1
WO2023093280A1 PCT/CN2022/121695 CN2022121695W WO2023093280A1 WO 2023093280 A1 WO2023093280 A1 WO 2023093280A1 CN 2022121695 W CN2022121695 W CN 2022121695W WO 2023093280 A1 WO2023093280 A1 WO 2023093280A1
Authority
WO
WIPO (PCT)
Prior art keywords
instruction
voice
target
trigger condition
information
Prior art date
Application number
PCT/CN2022/121695
Other languages
English (en)
French (fr)
Inventor
陈科鑫
冉茂松
张晓帆
Original Assignee
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司 filed Critical Oppo广东移动通信有限公司
Publication of WO2023093280A1 publication Critical patent/WO2023093280A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/7243User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages
    • H04M1/72433User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages for voice messaging, e.g. dictaphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72448User interfaces specially adapted for cordless or mobile telephones with means for adapting the functionality of the device according to specific conditions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72484User interfaces specially adapted for cordless or mobile telephones wherein functions are triggered by incoming communication events
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M2201/00Electronic components, circuits, software, systems or apparatus used in telephone systems
    • H04M2201/42Graphical user interfaces

Definitions

  • the present application relates to the technical field of electronic equipment, and more specifically, to a voice control method, device, electronic equipment, and storage medium.
  • voice recognition and natural language processing technologies can be combined to enable electronic devices to receive voice commands from users through auditory mode and complete corresponding interactive tasks.
  • the user can complete interface interaction operations through voice input.
  • the user may need to perform a corresponding interface operation only when a certain condition is met, but in related technologies, this type of non-immediately triggered instruction cannot be well completed, which affects user experience.
  • the present application proposes a voice control method, device, electronic equipment and storage medium.
  • the embodiment of the present application provides a voice control method, the method comprising: acquiring voice commands; identifying the command type of the voice commands; The trigger condition information and target execution information corresponding to the voice command; when the trigger condition corresponding to the trigger condition information is satisfied, execute the target operation on the target operation interface corresponding to the target execution information.
  • the embodiment of the present application provides a voice control device, the device includes: an instruction acquisition module, an information acquisition module, and an operation execution module, wherein the instruction acquisition module is used to acquire voice instructions; the information acquisition The module is used to obtain trigger condition information and target execution information corresponding to the voice command when the command type of the voice command is non-immediate; the operation execution module is used to meet the trigger condition information corresponding to the When the condition is triggered, the target operation is executed on the target operation interface corresponding to the target execution information.
  • the embodiment of the present application provides an electronic device, including: one or more processors; memory; one or more application programs, wherein the one or more application programs are stored in the memory and The one or more programs are configured to be executed by the one or more processors, and the one or more programs are configured to execute the voice control method provided in the first aspect above.
  • the embodiment of the present application provides a computer-readable storage medium, where program code is stored in the computer-readable storage medium, and the program code can be invoked by a processor to execute the voice provided by the first aspect above. Control Method.
  • an embodiment of the present application provides a computer program product, including computer programs/instructions, wherein the computer program/instructions implement the voice control method provided in the first aspect when executed by a processor.
  • FIG. 1 shows a schematic diagram of a scenario provided by an embodiment of the present application.
  • FIG. 2 shows a schematic diagram of another scenario provided by the embodiment of the present application.
  • FIG. 3 shows a schematic diagram of an application environment provided by an embodiment of the present application.
  • FIG. 4 shows another schematic diagram of the application environment provided by the embodiment of the present application.
  • Fig. 5 shows a flowchart of a voice control method according to an embodiment of the present application.
  • FIG. 6 shows another schematic diagram of a scenario provided by an embodiment of the present application.
  • Fig. 7 shows a flowchart of a voice control method according to another embodiment of the present application.
  • Fig. 8 shows a flowchart of a voice control method according to another embodiment of the present application.
  • Fig. 9 shows a flowchart of a voice control method according to yet another embodiment of the present application.
  • FIG. 10 shows a schematic diagram of the principle of the instruction recognition model provided by the embodiment of the present application.
  • Fig. 11 shows a flowchart of a voice control method according to yet another embodiment of the present application.
  • FIG. 12 shows a schematic structural diagram of an instruction type identification model provided by an embodiment of the present application.
  • Fig. 13 shows a flowchart of a voice control method according to still another embodiment of the present application.
  • Fig. 14 shows a block diagram of a voice control device according to an embodiment of the present application.
  • Fig. 15 is a block diagram of an electronic device for executing the voice control method according to the embodiment of the present application according to the embodiment of the present application.
  • Fig. 16 is a storage unit for storing or carrying program codes for realizing the voice control method according to the embodiment of the present application according to the embodiment of the present application.
  • GUI Graphic User Interface
  • user interaction interface GUI
  • UI User Interface
  • the VGUI solution focuses on user instructions executed immediately, and cannot support the technical capabilities of non-immediate instructions well.
  • Figure 1 it is a scene of voice control.
  • the user inputs "send a text message to mom" by voice. SMS, in which the user's instructions are executed immediately; and in the voice control scenario shown in Figure 2, the user inputs "When I get home, send a text message to my mother" through voice, at this time, the user adds A trigger condition "When I get home” turns the instruction into a non-instant voice control graphical interface instruction that is triggered only when the condition is met.
  • the smart terminal may not execute the instruction when the trigger condition is met. But it is executed immediately, so it is impossible to realize non-immediate voice commands well, which brings inconvenience to the user.
  • the inventor proposes the voice control method, device, electronic device and storage medium provided by the embodiment of the present application, which can realize non-immediate voice commands input by the user, and after identifying the trigger condition information, then according to the trigger condition The information performs corresponding interface operations, so that non-immediate voice commands can be better completed, thereby improving user experience.
  • the specific voice control method will be described in detail in the subsequent embodiments.
  • the voice control method provided in the embodiment of the present application may be executed by an electronic device.
  • all the steps in the voice control method provided in the embodiment of the present application may be executed by the electronic device.
  • voice commands can be collected by the voice collection device of the electronic device 100, and then the collected voice commands and the current user interface are transmitted to the processor, so that the processor recognizes the command type of the voice command , and then execute the steps involved in the voice control method provided by the present application according to the recognized instruction type.
  • the voice control method provided in the embodiment of the present application may also be executed by a server (cloud).
  • the electronic device can collect voice commands, and send the collected voice commands and the current user interface to the server synchronously, and then the server will recognize the voice commands, and then the server will trigger the electronic device Execute the target action.
  • the electronic device 100 may obtain a voice command, and then pass the voice command to the server 200 to identify the command type of the voice command, and when the command type is not immediate, identify the target corresponding to the voice command.
  • the execution information and the trigger condition information corresponding to the target execution information are returned to the electronic device 100; and the electronic device 100 executes the target operation on the target interface corresponding to the target execution information according to the trigger condition information.
  • the steps performed by the electronic device and the server respectively are not limited to the method described in the above examples.
  • the electronic device can be dynamically adjusted according to the actual situation Steps performed by the device and the server respectively.
  • the electronic device 100 can also be an in-vehicle device, a wearable device, a tablet computer, a notebook computer, a smart speaker, and the like.
  • the server 120 may be an independent physical server, a server cluster or a distributed system formed by multiple physical servers, or a cloud server, etc., which is not limited herein.
  • FIG. 5 shows a schematic flowchart of a voice control method provided by an embodiment of the present application.
  • the voice control method is applied to a voice control device 400 as shown in FIG. 14 and an electronic device 100 ( FIG. 15 ) configured with the voice control device 400 .
  • the following will take electronic equipment as an example to illustrate the specific process of this embodiment.
  • the electronic equipment used in this embodiment can be smart phones, tablet computers, smart watches, smart glasses, notebook computers, etc., here No limit.
  • the flow shown in Figure 5 will be described in detail below, and the voice control method may specifically include the following steps:
  • Step S110 Acquiring voice instructions.
  • the user can express his control intention by inputting voice into the electronic device.
  • the electronic device may use the voice uttered by the user as a voice instruction.
  • the electronic device can collect the voice input by the user through the audio collection device, so as to obtain the voice instruction.
  • the audio collection device is used for audio signal collection.
  • the audio collection device may include one or more audio collection devices, and the audio collection device may be a microphone.
  • the electronic device can detect the voice command input by the user when the voice control graphical interface function is turned on, and execute the steps involved in the voice control method provided in the present application according to the detected voice command.
  • the electronic device can collect voice instructions input by the user through the voice collection device. For example, when the voice assistant of the electronic device is turned on, the user inputs the voice command "Xiaoou, help me open the NFC access control when I go home", and the electronic device can collect the voice command.
  • the electronic device when it detects the trigger operation of the voice control, it may collect the voice input by the user, so as to obtain the voice instruction input by the user.
  • the screen of the electronic device may display a control for controlling the graphical interface by voice.
  • the voice collection When a trigger operation on the control is detected, the voice collection may be started in response to the trigger operation, and the voice collection device may collect the user input.
  • Voice commands Wherein, the operation may be a click operation, a press operation, a slide operation, etc., which are not limited herein.
  • the electronic device may also collect the voice input by the user when detecting the operation of the specified physical key, so as to obtain the voice instruction input by the user.
  • the specific manner for the electronic device to obtain the voice instruction may not be limited.
  • Step S120 when the instruction type of the voice instruction is non-immediate, acquire trigger condition information and target execution information corresponding to the voice instruction.
  • the electronic device after the electronic device acquires the voice command, it can identify the command type of the voice command, so as to identify the situation of the non-immediate control interface required by the user, and then accurately realize the control required by the user.
  • the instruction types of the voice instructions can include immediate types and non-immediate types.
  • the immediate type refers to the type corresponding to the instruction that needs to be executed immediately after the electronic device obtains the voice instruction; the non-immediate type indicates that the electronic device does not It is executed immediately, but only when the corresponding conditions are met.
  • the electronic device can recognize the acquired voice instruction to obtain the text content corresponding to the voice instruction. After the text content corresponding to the voice command is obtained, semantic recognition may be performed on the text content based on a pre-configured manner to identify whether the command type of the voice command is an immediate type or a non-immediate type. Wherein, the electronic device can convert voice instructions into corresponding text content based on a pre-configured automatic speech recognition method (Automatic Speech Recognition).
  • Automatic Speech Recognition Automatic Speech Recognition
  • the electronic device may recognize the text content corresponding to the voice command based on a pre-trained command type recognition model, so as to obtain the command type corresponding to the voice command.
  • the instruction type recognition model can be trained based on text sample data marked with instruction types.
  • the electronic device after the electronic device obtains the text content corresponding to the voice command, it can also perform word segmentation on the text content, and then obtain the keywords in the text content according to the word segmentation result; the electronic device then combines the recognized keywords with the The preset keyword is matched, and the preset keyword is the keyword corresponding to the preset non-immediate type voice command; if any of the identified keywords matches the keyword preset keyword, the voice command is determined
  • the instruction type of the voice instruction is a non-immediate type; if each identified keyword does not match a preset keyword, it is determined that the instruction type of the voice instruction is an immediate type.
  • the preset keywords include keywords related to trigger conditions, for example, “at”, “when”, “if”, “if”, “when”, “when” and so on.
  • the text content corresponding to the voice command input by the user is "When the battery power is 10%, turn off the mobile data network”, then the keyword "time” in the text content matches the preset keyword, so the The command type of the voice command is determined to be a non-instant type.
  • the specific manner of identifying the instruction type of the voice instruction by the electronic device may not be limited.
  • the electronic device may execute subsequent voice control steps according to the instruction type of the voice instruction.
  • the electronic device can determine whether the command type of the voice command is a non-immediate type; if the command type of the voice command is a non-immediate type, it can identify the trigger condition information and target execution information corresponding to the voice command, so as to execute the target according to the trigger condition information and target execution information. information to complete the voice control required by the user.
  • the target execution information can be understood as the control information obtained after the electronic device converts the voice command, which is used to represent the user's control intention on the interface;
  • the trigger condition information can be understood as the execution condition corresponding to the target execution information, that is, When any condition is met, the operation corresponding to the control information is executed, so as to complete the control required by the user.
  • the electronic device may obtain the instruction text corresponding to the voice instruction, and then obtain the trigger condition information and target execution information included in the instruction text.
  • the electronic device may perform semantic recognition on the text content converted from the voice command in a pre-configured manner, and then determine trigger condition information and target execution information according to the semantic recognition result.
  • control intent, control object, object attachment information and trigger conditions in the text content can be extracted based on natural language understanding (NLU), and integrated into a four-tuple with the style ⁇ action, object, information, condition ⁇ , then in this way, the result of semantic recognition is the quadruple.
  • action represents the control intention, or can be understood as the control purpose
  • object represents the control object
  • information represents the object's auxiliary information
  • condition represents the trigger condition.
  • the control intention, the control object, and object attachment information can be used as the target execution information
  • the trigger condition is the above-mentioned trigger condition information.
  • the text content obtained by converting the voice command is "when mom calls, reply to mom's short message and I am temporarily not available".
  • the user's intention is “send a text message”
  • the control object is "mother”
  • the object's auxiliary information is “temporarily unavailable”
  • the control condition is "when mother calls”
  • the four-tuple is recorded as: ⁇ (send text messages), (mom), (temporarily unavailable), (when mom calls) ⁇ .
  • the electronic device can also recognize the text content corresponding to the voice command through a pre-trained command recognition model, so as to obtain the target execution information corresponding to the voice command and the trigger condition corresponding to the target execution information information.
  • the instruction recognition model can be trained based on text samples pre-marked with trigger condition information and target execution information.
  • the specific manner of the electronic device specifically recognizing the target execution information and the trigger condition information corresponding to the voice instruction may not be limited.
  • Step S130 When the trigger condition corresponding to the trigger condition information is satisfied, execute the target operation on the target operation interface corresponding to the target execution information.
  • the electronic device when the above-mentioned voice command is of non-immediate type, and the electronic device recognizes the trigger condition information and target execution information corresponding to the voice command, the electronic device can execute the target execution information according to the trigger condition information .
  • the electronic device may execute the target operation on the target operation interface corresponding to the target execution information when the trigger condition corresponding to the trigger condition information is satisfied.
  • the electronic device can determine according to the positioning information When the device location is the home location, in the device mode setting interface, set the switch control corresponding to the flight mode to the on state.
  • the obtained quaternion is: ⁇ (send text message), (mom), (temporarily unavailable), (when mother calls) ⁇ , then the electronic When the device receives a call from mother, it can switch to the sending interface of the SMS application, write "Mom” in the recipient edit box, write "temporarily unavailable” in the text edit box, and then send the SMS.
  • the electronic device when the electronic device executes the target operation on the target operation interface corresponding to the target execution information, it can generate the target interface corresponding to the target execution information through system injection (an operation mode supported by Android) or by simulating screen clicks control instructions. For example, if the trigger condition information is satisfied, the user's click operation can be simulated to switch to the target interface corresponding to the target execution information, and the user's click operation can be simulated to execute the target control corresponding to the target execution information. Thus, the target interface corresponding to the target execution information is realized, and the corresponding target operation is executed.
  • system injection an operation mode supported by Android
  • the target operation information corresponding to the voice instruction may be identified, so as to complete the real-time voice control required by the user.
  • recognizing the target operation information corresponding to the voice command may be to recognize the target execution information corresponding to the voice command, and the manner of recognizing the target execution information corresponding to the voice command may refer to the manner of recognizing the interface target execution information in the foregoing embodiments, which will not be described here. Let me repeat. It should be noted that, when the command type of the voice command is an instant type, it may be to perform real-time voice control on the current user interface, or to perform real-time voice control on other interfaces of the electronic device.
  • the electronic device after the electronic device recognizes the target operation information, it can match the target operation information with the interface operable elements, so as to obtain the interface operable elements in the matched interface, and execute the interface in the interface The action corresponding to the actionable element.
  • real-time voice control may be performed on the current user interface.
  • the user may make the uttered voice more casual due to his own pronunciation habits, but the voice instruction corresponding to the more casual voice may not enable the electronic device to accurately determine the user's control intention.
  • the possible corresponding meaning of the next one may be the next one, and the corresponding meaning may also be downloading one.
  • a possible corresponding meaning can be the next one, for example, playing the next song.
  • the meaning corresponding to the next possibility may be to download one. For example, download an application.
  • the target operation information corresponding to the voice command can be updated according to the task scene corresponding to the current user interface to obtain the scene control command; the scene control command and the interface of the current user interface
  • the operable elements are matched, so as to determine the target operable elements from the interface operable elements of the current user interface.
  • the target operation information "next music” can be updated to "play the next song”.
  • the target operation information "download” can be updated. A music" was updated to "Download a music player app".
  • the voice control method provided by the embodiment of the present application can realize the recognition of the command type of the voice command input by the user.
  • the corresponding trigger condition information is executed according to the trigger condition information. Interface operations, so that non-immediate voice commands can be better completed, thereby improving user experience.
  • FIG. 7 shows a schematic flowchart of a voice control method provided by another embodiment of the present application.
  • the voice control method is applied to the above-mentioned electronic equipment, and the flow shown in FIG. 7 will be described in detail below.
  • the voice control method may specifically include the following steps:
  • Step S210 Acquiring voice instructions.
  • Step S220 when the instruction type of the voice instruction is non-immediate, acquire trigger condition information and target execution information corresponding to the voice instruction.
  • steps S210 to S220 reference may be made to the contents of other embodiments, which will not be repeated here.
  • step S210 and step S220 in the embodiment of the present application, reference may be made to the contents of other embodiments for step S210 and step S220, and details are not repeated here.
  • Step S230 When the trigger condition corresponding to the trigger condition information is satisfied, match the target execution information with the interface operable elements to obtain the target operation interface, and execute the target operation on the target operation interface.
  • the electronic device when the electronic device executes voice control according to the trigger condition information and the target execution information, it can match the target execution information with the interface operable elements when the trigger condition corresponding to the trigger condition information is met, The target operation interface is obtained, and the target operation is performed on the target operation interface.
  • the electronic device may pre-identify operable interface elements of various interfaces in the electronic device.
  • the interface refers to an interface that can be run and displayed by the electronic device, which may include a system interface, an interface corresponding to each installed application program, etc., which is not limited here.
  • the electronic device may match the above-identified target execution information with the pre-identified interface operable elements, so as to obtain the matched interface operable elements in the target operation interface that match the target execution information as the target operable elements, and then, Execute the operation corresponding to the target operable element on the target interface.
  • the short message sending interface of the short message application it includes the edit box of the recipient and the text edit box.
  • the elements are: the recipient edit box, the text edit box and the sending control in the SMS sending interface.
  • the electronic device executes the operation corresponding to the target operable element on the target interface, it can write in the recipient edit box in the SMS sending interface. Enter “Mom”, write “temporarily unavailable” in the text edit box, and then trigger the send control to realize the sending of SMS.
  • an interface may include various user-operable interface operable elements.
  • the operable elements of the interface may include a certain control in the interface, or may be aimed at the entire interface. For example, if the user's intention is to slide the page (for example, slide up, down, left, and right), or the intention is to switch the interface, or exit a certain interface, then the operable elements of the interface are the entire interface. For another example, if the user's intention is to click a certain position in the interface, then the operable element of the interface may be for a certain control in the interface. In the case that there may be multiple operations implemented on the interface, there may also be multiple interface operable elements corresponding to the interface.
  • the electronic device identifies the operable elements of the interface, which may at least include at least one of the following identification methods: identifying the interface based on code analysis; identifying the interface based on image-text identification; and The interface is identified based on the control classification model.
  • identifying the interface based on the code analysis method to obtain the interface operable elements corresponding to the interface can be understood as identifying the components or components included in the interface based on the code analysis method, and then The obtained interface operable elements may include identification and description information of identifiable components.
  • identifying the interface based on the code analysis method can be understood as obtaining the components included in the interface and the description information corresponding to the components based on the code analysis method.
  • the descriptive information may include information such as name, function, and trigger operation.
  • the interface can be identified based on code parsing based on Google accessibility service accessibility.
  • the recognition of the interface based on graphic recognition may include OCR (Optical Character Recognition, optical character recognition) to recognize the components, controls, icons, etc. in the interface, and obtain the recognized components , controls, icons, and more.
  • OCR Optical Character Recognition, optical character recognition
  • the position of components, controls, and icons in the user interface can be identified through OCR, and then traversed to obtain all components, controls, icons, etc. in the user interface, and then by analyzing the image content, determine the components, Description information of controls, icons, etc.
  • the training process of the control classification model includes: obtaining a user interface; obtaining controls classified from the user interface; and training a neural network to be trained by using the classified controls to obtain a control classification model.
  • the electronic device can store the pre-recognized interface operable elements of various interfaces, so that when the voice control is executed, it can be matched with the target execution information corresponding to the voice command, so as to obtain the interface operable elements in the corresponding target operation interface , and make it the target actionable element.
  • the electronic device may also identify the interface operable elements of various interfaces in the electronic device when the trigger condition corresponding to the trigger condition information is satisfied, and then compare the target execution information with the interface operable elements. Match to obtain the target operation interface, and execute the target operation on the target operation interface.
  • the electronic device identifies the operable elements of the interface, reference may be made to the foregoing implementation manners, and details are not repeated here.
  • the trigger condition information may include a condition field of the trigger condition and condition parameters
  • the target execution information may include the execution field and execution parameters.
  • the electronic device when the electronic device matches the target execution information with the interface operable elements, it can match the execution domain and execution parameters with the interface operable elements, so as to obtain the target in the target operation interface that matches the target execution information. actionable elements.
  • the electronic device can determine the corresponding interface according to the execution field, and then match the interface operable elements with the interface according to the execution parameters, so as to obtain the target operable elements matching the target execution information.
  • the interface is the control interface of the smart air conditioner corresponding to the smart home application program, and then according to the execution parameters, if the execution parameter is to lower the temperature of the smart air conditioner, then the matching operable elements of the interface can be determined For: A control for lowering the temperature of the smart air conditioner.
  • the electronic device can match more accurate interface operable elements when matching interface operable elements, and can improve matching efficiency ;
  • dividing the trigger condition information into condition fields and condition parameters can enable the electronic device to more accurately ensure that when the trigger condition information is met, the target will be
  • the operation interface performs the target operation and improves the accuracy of voice control.
  • the target execution information is combined with the target execution information.
  • the operable elements of the interface are matched, so that the interface operation required by the user can be quickly and accurately determined, and then the corresponding interface operation can be executed, so that the non-immediate voice command can be better completed, thereby improving the user experience.
  • FIG. 8 shows a schematic flowchart of a voice control method provided in another embodiment of the present application.
  • the voice control method is applied to the above-mentioned electronic equipment, and the flow shown in FIG. 8 will be described in detail below.
  • the voice control method may specifically include the following steps:
  • Step S310 Acquiring voice instructions.
  • Step S320 when the instruction type of the voice instruction is non-immediate, acquire trigger condition information and target execution information corresponding to the voice instruction.
  • step S310 and step S320 in this embodiment of the present application, reference may be made to the contents of other embodiments for step S310 and step S320, and details are not repeated here.
  • Step S330 Match the target execution information with interface operable elements to obtain the target operation interface.
  • the electronic device after the electronic device obtains the trigger condition information and target execution information corresponding to the voice command, it can match the target execution information with the operable elements of the interface to obtain The target operation interface, so that when the trigger condition corresponding to the trigger condition information is satisfied, the target operation is executed on the target operation interface corresponding to the target execution information.
  • the manner in which the electronic device matches the target execution information with the operable elements of the interface can refer to the content of the previous embodiment, which will not be repeated here.
  • Step S340 Generate corresponding control instructions according to the trigger condition information and the target operation interface.
  • the electronic device after the electronic device recognizes the trigger condition information and the target operation interface, it can synthesize the control instructions according to the trigger condition information and the target operation interface, and pass the synthesized control instructions to the graphical interface for further processing. implement.
  • the electronic device can generate corresponding control instructions according to the trigger condition information and the matched interface operable elements in the target operation interface.
  • the IFTTT instruction generation method may be used to generate corresponding control instructions according to the trigger condition information and the target operation interface.
  • Step S350 Execute the control instruction, the control instruction is used to execute the target operation on the target operation interface corresponding to the target execution information when the trigger condition corresponding to the trigger condition information is met.
  • control instruction since the above-mentioned control instruction is a non-immediate type instruction, the control instruction exists in the background of the application according to the specific trigger condition information and the interface operable elements of the target operation interface; and, through real-time monitoring of the trigger condition information state, such as monitoring the real state of the condition field and condition parameter in the foregoing embodiments, and making the corresponding execution module (for executing the target execution information) be in a standby state (existing in the memory); when it is detected that the trigger condition information is met , the execution module immediately executes the operations corresponding to the operable elements of the interface in the target operation interface, thereby completing the non-immediate voice control process.
  • the voice control method provided by the embodiment of the present application after identifying the corresponding trigger condition information and target execution information of the non-immediate voice command input by the user, matches the target execution information with the operable elements of the interface, so as to quickly and Accurately determine the interface operation required by the user, and then execute the corresponding interface operation according to the trigger condition information, so that non-immediate voice commands can be better completed, thereby improving user experience.
  • FIG. 9 shows a schematic flowchart of a voice control method provided in another embodiment of the present application.
  • the voice control method is applied to the above-mentioned electronic equipment, and the flow shown in FIG. 9 will be described in detail below.
  • the voice control method may specifically include the following steps:
  • Step S410 Obtain voice instructions.
  • Step S420 When the instruction type of the voice instruction is non-immediate, acquire the instruction text corresponding to the voice instruction.
  • Step S430 Input the instruction text corresponding to the voice instruction into the pre-trained instruction recognition model to obtain the trigger condition information and target execution information contained in the instruction text, and the instruction recognition model is based on hierarchical reinforcement learning Get trained.
  • the electronic device when the electronic device determines that the instruction type of the voice instruction is non-immediate, and recognizes the trigger condition information and target execution information corresponding to the voice instruction, it can input the instruction text corresponding to the voice instruction To the instruction recognition model trained in advance based on hierarchical reinforcement learning, the trigger condition information and target execution information contained in the instruction text are obtained. Since it is necessary to identify various tasks of trigger condition information and target execution information, the instruction recognition model is trained by hierarchical reinforcement learning, which can improve the recognition accuracy.
  • the instruction recognition model includes a first submodule corresponding to the trigger condition information, a second submodule corresponding to the target execution information, and a cooperative control module, wherein the first submodule corresponding to the trigger condition information is used to identify the trigger For the identification task of the condition information, the second sub-module corresponding to the target execution information is used to identify the identification task of the target execution information; the collaborative control module is used to decide the action of the corresponding identification task, specifically, the execution order of the work tasks of the sub-modules can be decided, Assign work tasks to submodules. In this way, the trigger condition information and target execution information can be more accurately identified through the layered reinforcement learning model.
  • the training process of the instruction recognition model includes: creating a first sub-module corresponding to the recognition task used to recognize trigger condition information, a second sub-module corresponding to the recognition task used to recognize target execution information, and a second sub-module used to coordinate the recognition task
  • the cooperative control module, the first submodule and the second submodule are used to decide the action of the corresponding recognition task, and the priority of the decision of the cooperative control module is higher than that of the first submodule and the second submodule Priority; based on the text samples marked with trigger condition information, target execution information and the recognition order of the recognition task, carry out deep reinforcement learning training on the collaborative control module, the first submodule and the second submodule, and obtain the trained Instruction recognition model.
  • the cooperative control module performs deep reinforcement learning training to obtain the trained instruction recognition model, which may include: inputting the text sample into the collaborative control module, the first submodule and the second submodule to obtain the The output results of the first sub-module and the second sub-module, and the coordinated execution order of the collaborative control module; based on the output results of the first sub-module and the trigger condition information marked by the text sample, determine the The first reward corresponding to the first sub-module; based on the output result of the second sub-module and the target execution information marked by the text sample, determine the second reward corresponding to the second sub-module; based on the collaborative The execution sequence coordinated by the control module and the recognition sequence in which the text samples are marked determine the third reward corresponding to the collaborative control module; perform deep reinforcement learning training on the first sub-module based on the first reward,
  • deep reinforcement learning training is performed on the first sub-module based on the first reward
  • deep reinforcement learning training is performed on the second sub-module based on the second reward
  • deep reinforcement learning training is performed on the collaborative control module based on the third reward
  • the algorithm for reinforcement learning training is not limited, for example, it can be the Advantage Actor Critic (A2C) algorithm, the Asynchronous Advantage Actor-Critic (A3C) algorithm or the deep Q value network ( Deep Q-Network, DQN) algorithm, etc.
  • condition field including the condition field and the condition parameter
  • target execution information including the execution field and the execution parameter
  • condition field, condition parameter, execution field and execution parameter are defined as four slots.
  • the instruction recognition model is shown in Figure 10, where Agent refers to the agent in the concept of reinforcement learning, which needs to be trained through training data.
  • Agent refers to the agent in the concept of reinforcement learning, which needs to be trained through training data.
  • the instruction recognition model there is a top-level Agent and four bottom-level Agents, and each bottom-level Agent is responsible for identifying the information of four specific slots to be recognized; the top-level Agent judges which bottom-level Agent should be selected for recognition according to the state of the top-level Agent , that is, to control the working order of the underlying Agent.
  • the top-level Agent can be understood as the above-mentioned coordination control module
  • the bottom-level Agent can be understood as the above-mentioned first sub-module and second sub-module, and the first sub-module and the second sub-module correspond to the two bottom-level Agents respectively.
  • the meaning of the state of the top-level Agent is the current work completion status of the top-level Agent. Since the work of the top-level Agent is responsible for assigning work to the bottom-level Agent, the state of the top-level Agent can be expressed by the following formula. SH represents the state of the top-level Agent, st1, st2, st3, and st4 represent the states of the four bottom-level agents respectively. The state of the bottom-level agent stores the confidence level of the slot recognition that needs to be recognized, the content of recognition, and the Action (action, execution steps of reinforcement learning Agent) taken in history, etc. .
  • the training purpose of the underlying Agent is to identify the information it is responsible for identifying (condition field, condition parameter, execution field and execution parameter) as accurately as possible. Therefore, in the training process, define the training Reward (reward, which refers to the reinforcement learning process) of the bottom Agent. Feedback on Agent execution result) is:
  • st represents the execution state of the underlying Agent
  • gt is the information marked in the sample text
  • gt is the information marked in the sample text
  • the training Reward is 1, and when the execution state is empty, the training Reward is -p
  • p is The set empty penalty; when the execution state is all other situations except the above state, the training Reward is -1.
  • each time the underlying Agent takes an action it will calculate a training Reward.
  • the training purpose of the top-level Agent is to allocate the execution order of the bottom-level Agent as reasonably and efficiently as possible, and to prioritize the identification of slots containing rich information.
  • the training Reward of the top-level Agent is defined as:
  • the Reward of the top-level Agent is the sum of the effective accumulated Rewards of the bottom-level Agent up to the i-th step, backtracking N steps; and, when the top-level Agent selects in a certain step
  • the bottom-level Agent of the step is the same as the real-marked bottom-level Agent (that is, the execution sequence is the same as the marked execution sequence), and it will be considered as a valid accumulation, otherwise the Reward of this step will be recorded as 0.
  • the instruction recognition model may also include an information specification module. Every time the underlying Agent is executed, it will update the current slot identification status, and the information protocol module will integrate the slot identification status. If the identification result probability of a certain underlying Agent is lower than the set threshold, the information protocol module can The slot information identified in the recognition result is matched with the operable elements of the interface to verify whether the recognition result is reasonable. For example, if it is identified as "adjusting NetEase Cloud Music to vibration mode", the information protocol module can match it with the operable elements of the interface, and through the matching, it can be determined that NetEase Cloud Music is a software application rather than a device terminal, and this cannot be done. Operation; The information specification module will mark the identification result of this step as the identification is not completed and needs further processing, and feed it back to the top-level Agent. Thus, the accuracy of recognition can be improved.
  • an information specification module Every time the underlying Agent is executed, it will update the current slot identification status, and the information protocol module will integrate the slot identification status. If the identification result
  • Step S440 When the trigger condition corresponding to the trigger condition information is met, execute the target operation on the target operation interface corresponding to the target execution information.
  • step S440 reference may be made to the content of other embodiments, which will not be repeated here.
  • the command recognition model trained in advance according to the method of layered reinforcement learning recognizes the corresponding trigger condition information and target execution information, Then, when the trigger condition corresponding to the trigger condition information is satisfied, the target operation is executed on the target operation interface corresponding to the target execution information, so that the interface operation required by the user can be quickly and accurately determined, and then the corresponding interface is executed according to the trigger condition information operation, so that non-immediate voice commands can be better completed, thereby improving user experience.
  • the instruction recognition model is trained based on hierarchical reinforcement learning, it can ensure the accuracy of identifying trigger condition information and target execution information, thereby improving the accuracy of voice control.
  • FIG. 11 shows a schematic flowchart of a voice control method provided in yet another embodiment of the present application.
  • the voice control method is applied to the above-mentioned electronic equipment, and the flow shown in FIG. 11 will be described in detail below.
  • the voice control method may specifically include the following steps:
  • Step S510 Obtain voice instructions.
  • Step S520 Input the instruction text corresponding to the voice instruction into the pre-trained instruction type recognition model to obtain the instruction type of the voice instruction, and the instruction type recognition model is used to identify the instruction type corresponding to the input instruction text as even type or non-immediate type.
  • the recognition when recognizing the instruction type of the voice instruction, the recognition may be performed based on a pre-trained instruction type recognition model.
  • the text vector of the instruction text corresponding to the voice instruction can be input into the pre-trained instruction type recognition model, so as to obtain the instruction type of the voice instruction.
  • the instruction type recognition model can be a text intent binary classification (non-immediate type or not) deep learning network based on the BERT (Deep Bidirectional Transformers for Language Understanding, deep bidirectional Transformer model for semantic understanding) model.
  • the instruction type recognition model is composed of an encoding module, a decoding module and a BERT text classification model, wherein the BERT model may be a semantic model that has been pre-trained publicly.
  • the instruction text is input into the instruction type recognition model, it is first converted to the input format received by the BERT text classification model through the encoding module; after encoding, the encoded text vector is input into the BERT network for classification; then, the classification result vector is decoded
  • the device performs decoding to obtain classification results, and the classification results include two situations of immediate type and non-immediate type.
  • the electronic device before the instruction text corresponding to the voice instruction is input into the above-mentioned instruction type recognition model, the electronic device can also perform preset correction processing on the instruction text corresponding to the voice instruction; then, the preset corrected instruction The text is input to the above instruction type recognition model to obtain the instruction type of the voice instruction.
  • the preset correction processing may include: vocabulary calibration based on edit distance, common vocabulary correction based on Bayesian method, etc.
  • the specific correction processing may not be limited.
  • Step S530 When the instruction type of the voice instruction is non-immediate, acquire trigger condition information and target execution information corresponding to the voice instruction.
  • Step S540 When the trigger condition corresponding to the trigger condition information is met, execute the target operation on the target operation interface corresponding to the target execution information.
  • step S530 and step S540 in the embodiment of the present application, reference may be made to the contents of the foregoing embodiments for step S530 and step S540, and details are not repeated here.
  • the pre-trained command type recognition model when recognizing the command type of the voice command, is used to identify the command type to obtain the command type; for the non-immediate type of voice command input by the user, by identifying its trigger condition Information and target execution information, and execute the target execution information according to the trigger condition information. In this way, it is possible to better recognize the type of the voice command, and complete the voice command of the immediate type and the non-immediate type, thereby improving the user experience.
  • the electronic device After the electronic device obtains the voice command, it performs voice command analysis on the voice command to obtain the voice text corresponding to the voice command; Through instruction identification, the trigger condition information and target execution information are identified; then the instruction is synthesized according to the target execution information and trigger condition information, and executed by the graphical interface; if it is recognized as an instant type, it directly matches the operable elements of the interface, In this way, the corresponding target operable elements are obtained, and then synthesized into instructions, which are executed by the graphical interface.
  • FIG. 14 shows a structural block diagram of a voice control device 400 provided by an embodiment of the present application.
  • the voice control device 400 applies the above-mentioned electronic equipment, and the voice control device 400 includes: an instruction acquisition module 410 , an information acquisition module 420 and an operation execution module 430 .
  • the instruction acquisition module 410 is used to acquire voice instructions
  • the information acquisition module 420 is used to acquire trigger condition information and target execution information corresponding to the voice instruction when the instruction type of the voice instruction is a non-immediate type.
  • the operation execution module 430 is configured to execute the target operation on the target operation interface corresponding to the target execution information when the trigger condition corresponding to the trigger condition information is met.
  • the operation execution module 430 may be specifically configured to: when the trigger condition corresponding to the trigger condition information is satisfied, match the target execution information with the operable interface elements to obtain the target operation interface, Execute a target operation on the target operation interface.
  • the operation execution module 430 may be specifically configured to: match the target execution information with interface operable elements, and obtain the matched interface operable elements in the target operation interface as target operable elements; Execute the operation corresponding to the target operable element on the target interface.
  • the voice control device 400 may also include an interface recognition module.
  • the interface identification module is used for: when the trigger condition corresponding to the trigger condition information is satisfied, before the target operation interface corresponding to the target execution information executes the target operation, combine the target execution information with the interface operable elements Matching is performed to obtain the target operation interface.
  • the operation execution module 430 may be specifically configured to: generate a corresponding control instruction according to the trigger condition information and the target operation interface; execute the control instruction, and the control instruction is used to When the trigger condition corresponding to the trigger condition information is met, the target operation is executed on the target operation interface corresponding to the target execution information.
  • the information obtaining module 420 may be specifically configured to: obtain the instruction text corresponding to the voice instruction; obtain trigger condition information and target execution information included in the instruction text. .
  • the information acquisition module 420 may be specifically configured to: input the instruction text corresponding to the voice instruction into a pre-trained instruction recognition model, and obtain the trigger condition information and target information contained in the instruction text. Execution information, the instruction recognition model is trained based on hierarchical reinforcement learning.
  • the instruction recognition model includes a first submodule corresponding to trigger condition information, a second submodule corresponding to target execution information, and a cooperative control module.
  • the voice control device 400 may also include a model training module.
  • the model training module can be used to: create a first sub-module corresponding to the identification task used to identify trigger condition information, a second sub-module corresponding to the identification task used to identify target execution information, and a collaborative control for coordinating the identification task module, the first sub-module and the second sub-module are used to decide the actions of their corresponding recognition tasks, and the decision-making priority of the cooperative control module is higher than that of the first sub-module and the second sub-module Decision-making priority of modules; based on the text samples marked with trigger condition information, target execution information and the recognition sequence of the recognition task, the cooperative control module, the first sub-module and the second sub-module are Deep reinforcement learning training to obtain the trained instruction recognition model.
  • the model training module may be specifically configured to: input the text sample into the collaborative control module, the first submodule, and the second submodule, and obtain the first submodule and the second submodule.
  • the output results of the sub-modules, and the coordinated execution order of the cooperative control module based on the output results of the first sub-module and the trigger condition information marked with the text sample, determine the first sub-module corresponding to Reward; based on the output result of the second sub-module and the target execution information marked by the text sample, determine the second reward corresponding to the second sub-module; based on the coordinated execution sequence of the collaborative control module, and the The recognition order of the text samples being marked, determine the third reward corresponding to the collaborative control module; perform deep reinforcement learning training on the first sub-module based on the first reward, and train the first sub-module based on the second reward
  • the second sub-module performs deep reinforcement learning training, and performs deep reinforcement learning training on the collaborative control module based on the third reward, until a preset
  • the speech recognition device 400 may further include: a type recognition module.
  • the type identification module is used to input the instruction text corresponding to the voice instruction into the The instruction type recognition model trained in advance obtains the instruction type of the voice instruction, and the instruction type recognition model is used to identify whether the instruction type corresponding to the input instruction text is an immediate type or a non-immediate type.
  • the type identification module may also be specifically configured to: perform preset correction processing on the instruction text corresponding to the voice instruction; input the preset corrected instruction text into the pre-trained instruction A type recognition model is used to obtain the instruction type of the voice instruction.
  • the information acquisition module 420 can also be used to identify the target operation information corresponding to the voice command after the voice command is acquired, if the command type of the voice command is an immediate type; the operation execution module 440 can also In response to the target operation information, the operation corresponding to the target operation information is executed on the interface corresponding to the target operation information.
  • the coupling between the modules may be electrical, mechanical or other forms of coupling.
  • each functional module in each embodiment of the present application may be integrated into one processing module, each module may exist separately physically, or two or more modules may be integrated into one module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules.
  • the solution provided by this application obtains the voice instruction, when the instruction type of the voice instruction is non-immediate, acquires the trigger condition information and target execution information corresponding to the voice instruction, and when the trigger condition information is satisfied When the corresponding trigger condition is met, the target operation is executed on the target operation interface corresponding to the target execution information.
  • the electronic device 100 may be an electronic device capable of running application programs, such as a smart phone, a tablet computer, a smart watch, smart glasses, and a notebook computer.
  • the electronic device 100 in the present application may include one or more of the following components: a processor 110, a memory 120, and one or more application programs, wherein one or more application programs may be stored in the memory 120 and configured to be used by One or more processors 110 are executed, and one or more programs are configured to execute the methods described in the foregoing method embodiments.
  • Processor 110 may include one or more processing cores.
  • the processor 110 uses various interfaces and lines to connect various parts of the entire electronic device 100, and executes or executes instructions, programs, code sets or instruction sets stored in the memory 120, and calls data stored in the memory 120 to execute Various functions of the electronic device 100 and processing data.
  • the processor 110 may adopt at least one of Digital Signal Processing (Digital Signal Processing, DSP), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA), and Programmable Logic Array (Programmable Logic Array, PLA). implemented in the form of hardware.
  • the processor 110 may integrate one or a combination of a central processing unit (Central Processing Unit, CPU), a graphics processing unit (Graphics Processing Unit, GPU), a modem, and the like.
  • the CPU mainly handles the operating system, user interface and application programs, etc.; the GPU is used to render and draw the displayed content; the modem is used to handle wireless communication. It can be understood that, the above-mentioned modem may not be integrated into the processor 110, but may be realized by a communication chip alone.
  • the memory 120 may include random access memory (Random Access Memory, RAM), and may also include read-only memory (Read-Only Memory).
  • the memory 120 may be used to store instructions, programs, codes, sets of codes, or sets of instructions.
  • the memory 120 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playback function, an image playback function, etc.) , instructions for implementing the following method embodiments, and the like.
  • the storage data area can also store data created during use of the electronic device 100 (such as phonebook, audio and video data, chat record data) and the like.
  • FIG. 16 shows a structural block diagram of a computer-readable storage medium provided by an embodiment of the present application.
  • Program codes are stored in the computer-readable medium 800, and the program codes can be invoked by a processor to execute the methods described in the foregoing method embodiments.
  • the computer readable storage medium 800 may be an electronic memory such as flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM.
  • the computer-readable storage medium 800 includes a non-transitory computer-readable storage medium (non-transitory computer-readable storage medium).
  • the computer-readable storage medium 800 has a storage space for program code 810 for executing any method steps in the above-mentioned methods. These program codes can be read from or written into one or more computer program products.
  • Program code 810 may, for example, be compressed in a suitable form.
  • An embodiment of the present application also provides a computer program product, including computer programs/instructions, wherein, when the computer program/instructions are executed by a processor, the voice control method provided in the foregoing embodiments is implemented.

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Business, Economics & Management (AREA)
  • Business, Economics & Management (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

一种语音控制方法、装置、电子设备及存储介质,该语音控制方法包括:获取语音指令(S110);识别该语音指令的指令类型;当该语音指令的指令类型为非即时类型时,获取语音指令所对应的触发条件信息和目标执行信息(S120);当满足触发条件信息所对应的触发条件时,于该目标执行信息对应的目标操作界面执行目标操作(S130)。本方法可以实现通过语音控制图形界面时的非即时指令的实现,提升用户体验。

Description

语音控制方法、装置、电子设备及存储介质
相关申请的交叉引用
本申请要求于2021年11月29日提交的申请号为202111433111.8的中国申请的优先权,其在此出于所有目的通过引用将其全部内容并入本文。
技术领域
本申请涉及电子设备技术领域,更具体地,涉及一种语音控制方法、装置、电子设备及存储介质。
背景技术
随着科技水平的快速进步,可以结合语音识别和自然语言处理技术,使电子设备通过听觉模态接收用户发出的语音指令并完成对应的交互任务。由此,用户可以通过语音输入来完成界面交互操作。然而,在一些情况下,用户可能需要在满足一定的条件时,才执行相应的界面操作,但相关技术中,并不能较好地完成该类型的非即时触发的指令,影响了用户体验。
发明内容
鉴于上述问题,本申请提出了一种语音控制方法、装置、电子设备及存储介质。
第一方面,本申请实施例提供了一种语音控制方法,所述方法包括:获取语音指令;识别所述语音指令的指令类型;当所述语音指令的指令类型为非即时类型时,获取所述语音指令所对应的触发条件信息和目标执行信息;当满足所述触发条件信息所对应的触发条件时,于所述目标执行信息对应的目标操作界面执行目标操作。
第二方面,本申请实施例提供了一种语音控制装置,所述装置包括:指令获取模块、信息获取模块以及操作执行模块,其中,所述指令获取模块用于获取语音指令;所述信息获取模块用于当所述语音指令的指令类型为非即时类型时,获取所述语音指令所对应的触发条件信息和目标执行信息;所述操作执行模块用于当满足所述触发条件信息所对应的触发条件时,于所述目标执行信息对应的目标操作界面执行目标操作。
第三方面,本申请实施例提供了一种电子设备,包括:一个或多个处理器;存储器;一个或多个应用程序,其中所述一个或多个应用程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个程序配置用于执行上述第一方面提供的语音控制方法。
第四方面,本申请实施例提供了一种计算机可读取存储介质,所述计算机可读取存储介质中存储有程序代码,所述程序代码可被处理器调用执行上述第一方面提供的语音控制方法。
第五方面,本申请实施例提供了一种计算机程序产品,包括计算机程序/指令,其特征在于,该计算机程序/指令被处理器执行时实现上述第一方面提供的语音控制方法。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1示出了本申请实施例提供的一种场景示意图。
图2示出了本申请实施例提供的另一种场景示意图。
图3示出了本申请实施例提供的应用环境的一种示意图。
图4示出了本申请实施例提供的应用环境的另一种示意图。
图5示出了根据本申请一个实施例的语音控制方法流程图。
图6示出了本申请实施例提供的又一种场景示意图。
图7示出了根据本申请另一个实施例的语音控制方法流程图。
图8示出了根据本申请又一个实施例的语音控制方法流程图。
图9示出了根据本申请再一个实施例的语音控制方法流程图。
图10示出了本申请实施例提供的指令识别模型的原理示意图。
图11示出了根据本申请又另一个实施例的语音控制方法流程图。
图12示出了本申请实施例提供的指令类型识别模型的结构示意图。
图13示出了根据本申请又再一个实施例的语音控制方法流程图。
图14示出了根据本申请一个实施例的语音控制装置的一种框图。
图15是本申请实施例的用于执行根据本申请实施例的语音控制方法的电子设备的框图。
图16是本申请实施例的用于保存或者携带实现根据本申请实施例的语音控制方法的程序代码的存储单元。
具体实施方式
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。
智能终端设备的普及给生活带来了种种便利。在智能终端设备诞生之初,GUI(Graphic User Interface,用户交互界面)一直作为重要的载体,用于用户与智能手机交互的载体,GUI一般也可以简称为UI。到了语音交互不断发展的今天,使用更为方便的智能语音交互与智能终端设备进行交互已经成为了重要的人机交互手段,而VGUI(Voice Graphic User Interface,语音控制图形交互界面)能够为用户提供更为便捷、直接的服务交互手段,或在用户不便于触控GUI时,为用户提供一种在有障碍情境下的一种无缝交互解决方案。
但是,发明人经过长时间的研究并发现,相关技术中,VGUI解决方案着重于立即执行的用户指令,并不能较好地对非即时指令的技术能力支持。例如,如图1所示为语音控制的一种场景,用户通过语音输入“给妈妈发短信”,智能终端设备在解析语音信息后,在GUI界面上寻找发送的应用,然后执行操作发送相应的短信,其中用户的指令是立即执行的;而如图2所示的语音控制的场景,用户通过语音输入“当我到家的时候,给妈妈发短信”,此时,用户在表达基础上添加了一个触发条件“当我到家时”,将指令变成了一个条件满足才触发的非即时的语音控制图形界面的指令,然而,智能终端可能并不会在满足该触发条件时才执行该指令,而是立即执行的,因此无法较好地实现非即时类型的语音指令,给用户带来了不便。
针对上述问题,发明人提出了本申请实施例提供的语音控制方法、装置、电子设备以及存储介质,可以实现针对用户输入的非即时类型的语音指令,识别其触发条件信息后,再根据触发条件信息执行对应的界面操作,从而能够较佳地完成非即时类型的语音指令,进而提升用户体验。其中,具体的语音控制方法在后续的实施例中进行详细的说明。
下面先对本申请实施例所涉及的应用场景进行介绍。
在本申请实施例中,本申请实施例提供的语音控制方法可以由电子设备执行。在由电子设备执行的这种方式中,本申请实施例提供的语音控制方法中所有步骤可以均由电子设备执行。例如,如图3所示,通过电子设备100的语音采集装置可以采集语 音指令,然后将采集到的语音指令以及当前用户界面均传输给处理器,使得处理器对语音指令的指令类型进行识别后,再根据识别出的指令类型,执行本申请提供的语音控制方法涉及的步骤。
再者,本申请实施例提供的语音控制方法也可以由服务器(云端)进行执行。对应的,在由服务器执行的这种方式中,可以由电子设备采集语音指令,并将采集的语音指令以及当前用户界面同步发送给服务器,然后由服务器识别语音指令后,然后由服务器触发电子设备执行目标操作。
另外,还可以由电子设备和服务器协同执行。在由电子设备和服务器协同执行的这种方式中,本申请实施例提供的语音控制方法中的部分步骤由电子设备执行,而另外部分的步骤则由服务器来执行。
示例性的,如图4所示,电子设备100可以获取语音指令,然后将语音指令交由服务器200来识别语音指令的指令类型,并在指令类型为非即时类型时,识别语音指令对应的目标执行信息以及目标执行信息对应的触发条件信息,再返回至电子设备100;电子设备100再根据触发条件信息,于目标执行信息对应的目标界面执行目标操作。
需要说明的是,在由电子设备和服务器协同执行的这种方式中,电子设备和服务器分别执行的步骤不限于上述示例中所介绍的方式,在实际应用中,可以根据实际情况动态的调整电子设备和服务器分别执行的步骤。
需要说明的是,该电子设备100除了可以为图1和图2中所示的智能手机外,还可以为车机设备、可穿戴设备、平板电脑、笔记本电脑、智能音箱等。服务器120可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式***,还可以是云服务器等,在此不做限定。
下面结合附图对本申请实施例提供的语音控制方法进行详细介绍。
请参阅图5,图5示出了本申请一个实施例提供的语音控制方法的流程示意图。在具体的实施例中,所述语音控制方法应用于如图14所示的语音控制装置400以及配置有所述语音控制装置400的电子设备100(图15)。下面将以电子设备为例,说明本实施例的具体流程,当然,可以理解的,本实施例所应用的电子设备可以为智能手机、平板电脑、智能手表、智能眼镜、笔记本电脑等,在此不做限定。下面将针对图5所示的流程进行详细的阐述,所述语音控制方法具体可以包括以下步骤:
步骤S110:获取语音指令。
在本申请实施例中,用户可以通过向电子设备输入语音,以表达其控制意图。对应的,电子设备可以将用户发出的语音作为语音指令。其中,电子设备可以通过音频采集装置采集用户输入的语音,从而得到语音指令。音频采集装置用于进行音频信号采集。可选的,音频采集装置可以包括一个或多个音频采集器件,该音频采集器件可以为麦克风。
在一些实施方式中,电子设备可以在开启语音控制图形界面的功能的情况下,检测用户输入的语音指令,并根据检测的语音指令,执行本申请提供的语音控制方法涉及的步骤。可选地,电子设备可以在开启语音助手的情况下,通过语音采集装置采集用户输入的语音指令。例如,电子设备的语音助手开启的情况下,用户输入语音指令“小欧,在我回家的时候帮我开启NFC门禁”,则电子设备可以采集到该语音指令。
在另一些实施方式中,电子设备可以在检测到语音控制的触发操作时,采集用户输入的语音,从而得到用户输入的语音指令。可选地,电子设备的屏幕中可以显示有用于语音控制图形界面的控件,在检测到对该控件的触发操作时,则可以响应该触发操作,开启语音采集,通过语音采集装置采集用户输入的语音指令。其中,该操作可以为点击操作、按压操作、滑动操作等,在此不做限定。可选地,电子设备也可以在检测到指定实体按键的操作时,采集用户输入的语音,从而得到用户输入的语音指令。 当然,电子设备获取语音指令的具体方式可以不做限定。
步骤S120:当所述语音指令的指令类型为非即时类型时,获取所述语音指令所对应的触发条件信息和目标执行信息。
在本申请实施例中,电子设备在获取到语音指令后,可以对该语音指令的指令类型进行识别,以便识别出用户需求的非即时控制界面的情况,进而准确地实现用户所需的控制。其中,语音指令的指令类型可以包括即时类型以及非即时类型,即时类型指的是电子设备获取到语音指令后,需立即执行的指令对应的类型;非即时类型表示电子设备获取到语音指令后不立即执行,而是在满足相应的条件才执行的指令类型。
在一些实施方式中,电子设备可以对获取到的语音指令进行识别,得到语音指令对应的文本内容。在得到语音指令对应的文本内容后,可以基于预先配置的方式对该文本内容进行语义识别,以识别出语音指令的指令类型为即时类型或者非即时类型。其中,电子设备可以基于预先配置的自动语音识别方式(Automatic Speech Recognition)将语音指令转换为对应的文本内容。
作为一种可能的实施方式,电子设备可以基于预先训练的指令类型识别模型,对语音指令对应的文本内容进行识别,从而得到语音指令对应的指令类型。其中,该指令类型识别模型可以基于被标注有指令类型的文本样本数据训练得到。
在另一些实施方式中,电子设备在得到语音指令对应的文本内容后,也可以对文本内容进行分词,然后根据分词结果,获得文本内容中的关键词;电子设备再将识别出的关键词与预设关键词进行匹配,该预设关键词为预先设置的非即时类型的语音指令对应的关键词;若识别出的关键词中任一与关键词预设关键词匹配,则确定该语音指令的指令类型为非即时类型;识别出的每个关键词与预设关键词均不匹配,则确定该语音指令的指令类型为即时类型。可选地,预设关键词包括与触发条件相关的关键词,例如,“在”、“当”、“若”、“如果”、“时”、“时候”等。示例性地,用户输入的语音指令对应的文本内容为“电池电量为10%的时候,关闭移动数据网络”,则该文本内容中的关键词“时候”与预设关键词匹配,因此将该语音指令的指令类型确定为非即时类型。
当然,电子设备识别语音指令的指令类型的具体方式可以不做限定。
在本申请实施例中,电子设备在识别语音指令的指令类型后,可以根据语音指令的指令类型,执行后续的语音控制步骤。电子设备可以确定语音指令的指令类型是否为非即时类型;若语音指令的指令类型为非即时类型,则可以识别该语音指令对应的触发条件信息和目标执行信息,以根据触发条件信息和目标执行信息,完成用户所需的语音控制。其中,目标执行信息可以理解为电子设备对语音指令进行转换后所获取的,用于表征用户对界面的控制意图的控制信息;触发条件信息可以理解为该目标执行信息所对应的执行条件,即满足何种条件时,执行该控制信息所对应的操作,从而完成用户所需的控制。
在一些实施方式中,电子设备可以获取语音指令对应的指令文本,然后获取指令文本中包含的触发条件信息和目标执行信息。
作为一种可能的实施方式,电子设备可以基于语音指令转换得到的文本内容,利用预先配置的方式对该文本内容进行语义识别,再根据语义识别结果确定出触发条件信息以及目标执行信息。
可选的,可以基于自然语言理解(NLU)的方式抽取该文本内容中的控制意图、控制对象、对象附属信息以及触发条件,整合为样式为{action,object,information,condition}的四元组,则在这种方式中,语义识别的结果为该四元组。其中,action表征控制意图,或者可以理解为控制目的,object表征控制对象,information则表征对象附属信息,condition表征触发条件。其中,控制意图、控制对象以及对象附属信息即可作为目标执行信息,触发条件即为上述的触发条件信息。
示例性地,对语音指令进行转换得到的文本内容为“妈妈来电时,回复妈妈短信暂时没空”。基于自然语言理解的方式可以理解用户意图为“发短信”,控制对象为“妈妈”,对象附属信息为“暂时没空”,控制条件为“妈妈来电时”,用四元组记为:{(发短信),(妈妈),(暂时没空),(妈妈来电时)}。
作为另一种可能的实施方式,电子设备也可以通过预先训练的指令识别模型,对语音指令对应的文本内容进行识别,从而得到该语音指令对应的目标执行信息,以及目标执行信息对应的触发条件信息。其中,指令识别模型可以根据预先标注有触发条件信息以及目标执行信息的文本样本训练得到。
当然,电子设备具体识别语音指令对应的目标执行信息以及触发条件信息的具体方式可以不做限定。
步骤S130:当满足所述触发条件信息所对应的触发条件时,于所述目标执行信息对应的目标操作界面执行目标操作。
在本申请实施例中,在上述语音指令为非即时类型,电子设备在识别得到语音指令对应的触发条件信息和目标执行信息的情况下,电子设备则可以根据该触发条件信息,执行目标执行信息。其中,电子设备可以在满足触发条件信息对应的触发条件的情况下,于目标执行信息对应的目标操作界面执行目标操作。
例如,请参阅图6,用户在语音输入“回家时,设置飞行模式”,则上述触发条件信息为“回家时”,目标执行信息为设置飞行模式”,则电子设备可以根据定位信息确定出设备位置为家庭位置的情况下,在设备模式的设置界面中,将飞行模式对应的开关控件设置为开启状态。
又例如,在上述识别语音指令对应的四元组信息的举例中,得到的四元组为:{(发短信),(妈妈),(暂时没空),(妈妈来电时)},则电子设备可以在接收到妈妈的来电时,切换为短信应用的发送界面,并在收信人编辑框写入“妈妈”,文本编辑框中写入“暂时没空”,然后执行短信发送。
在一些实施方式中,电子设备于目标执行信息对应的目标操作界面执行目标操作时,可以通过***注入(Android所支持的一种操作方式)或模拟屏幕点击的方法生成目标执行信息对应的目标界面的控制指令。例如,满足触发条件信息情况下,可以模拟用户点击操作,切换为该目标执行信息时对应的目标界面,并模拟用户点击操作,执行目标执行信息对应的目标控制。由此,实现于目标执行信息对应的目标界面,执行对应的目标操作。
在本申请实施例中,若识别出语音指令的指令类型为即时类型,则可以识别语音指令对应的目标操作信息,以完成用户所需的实时语音控制。其中,识别语音指令对应的目标操作信息,可以是识别语音指令对应的目标执行信息,识别语音指令对应的目标执行信息的方式,可以参阅前述实施例中识别界面目标执行信息的方式,在此不再赘述。需要说明的是,语音指令的指令类型为即时类型的情况下,可以是对当前用户界面进行实时语音控制,也可以是对电子设备的其他界面进行实时语音控制。
在一些实施方式中,电子设备在识别得到目标操作信息后,则可以将目标操作信息与界面可操作元素进行匹配,从而得到匹配的界面中的界面可操作元素,并于该界面中执行该界面可操作元素所对应的操作。其中,该实施方式可以参阅前述实施例的内容,在此不再赘述。
在一种可能的实施方式中,在指令类型为即时类型的情况下,可以是对当前用户界面进行实时的语音控制。其中,用户在发出语音的过程中,可能因为自己的发音习惯问题而使得所发出的语音较为随意,但是较为随意的语音所对应的语音指令可能并不能使得电子设备准确的确定用户的控制意图。例如,若语音指令本身对应的内容为“下一个”,对于该下一个可能所对应的意思可以为接来下的一个,所对应的意思也可能为下载一个。例如,在音频播放场景下一个可能所对应的意思可以为接来下的一 个,例如,播放接下来的一首歌。而在软件下载场景中,下一个可能所对应的意思可以为下载一个。例如,下载一个应用程序。因此,为了能够更为准确的确定用户的真实意图,可以在根据当前用户界面对应的任务场景对语音指令对应的目标操作信息进行更新,得到场景控制指令;将场景控制指令与当前用户界面的界面可操作元素,进行匹配,以从当前用户界面的界面可操作元素中,确定目标可操作元素。上述举例中,若当前应用场景为音乐播放场景,则可以将目标操作信息“下一个音乐”更新为“播放下一首歌”,若当前应用场景为应用下载场景,可以将目标操作信息“下一个音乐”更新为“下载一个音乐播放应用”。由此,可以实现更为准确地语音控制。
本申请实施例提供的语音控制方法,可以实现对用户输入的语音指令的指令类型的识别,针对用户输入的非即时类型的语音指令,识别其触发条件信息后,再根据触发条件信息执行对应的界面操作,从而能够较佳地完成非即时类型的语音指令,进而提升用户体验。
请参阅图7,图7示出了本申请另一个实施例提供的语音控制方法的流程示意图。该语音控制方法应用于上述电子设备,下面将针对图7所示的流程进行详细的阐述,所述语音控制方法具体可以包括以下步骤:
步骤S210:获取语音指令。
步骤S220:当所述语音指令的指令类型为非即时类型时,获取所述语音指令所对应的触发条件信息和目标执行信息。
在本申请实施例中,步骤S210至步骤S220可以参阅其他实施例的内容,在此不再赘述。
在本申请实施例中,步骤S210以及步骤S220可以参阅其他实施例的内容,在此不再赘述。
步骤S230:当满足所述触发条件信息所对应的触发条件时,将所述目标执行信息与界面可操作元素进行匹配,得到所述目标操作界面,于所述目标操作界面执行目标操作。
在本申请实施例中,电子设备在根据触发条件信息和目标执行信息,执行语音控制时,可以在满足触发条件信息对应的触发条件的情况下,将目标执行信息与界面可操作元素进行匹配,得到目标操作界面,并于目标操作界面执行目标操作。
在一些实施方式中,电子设备可以预先识别有电子设备中多种界面的界面可操作元素。界面指的是电子设备可运行并显示的界面,其可以包括***界面、安装的各个应用程序对应的界面等,在此不做限定。电子设备可以将上述识别得到的目标执行信息与预先识别的界面的界面可操作元素进行匹配,从而得到与目标执行信息匹配的目标操作界面中匹配的界面可操作元素作为目标可操作元素,然后,于目标界面执行目标可操作元素对应的操作。例如,对于短信应用的短信发送界面而言,其包括收件人的编辑框和文本编辑框,若识别出的目标执行信息为“给妈妈发短信,说暂时没空”,则匹配的界面操作元素为:短信发送界面中收件人编辑框、文本编辑框以及发送控件,电子设备于目标界面执行目标可操作元素对应的操作时,则可以在短信发送界面中,于收件人编辑框写入“妈妈”,于文本编辑框写入“暂时没空”,然后触发发送控件,实现短信发送。
可以理解地,对于一个界面而言,可能会包括有多种用户可操作的界面可操作元素。界面可操作元素可以包括界面中某个控件,也可以是针对整个界面。例如,若用户的意图为进行页面的滑动(例如,上滑,下滑,左滑以及右滑),或者意图为进行界面的切换,再或者是退出某个界面,那么则界面可操作元素为整个界面。再例如,若用户的意图是点击界面中的某个位置,那么界面可操作元素则可以为针对界面中某个控件。在针对界面所实施的操作可以有多种的情况下,界面所对应的界面可操作元素也可以有多个。
在一些实施方式中,电子设备对界面的界面可操作元素进行识别,可以至少包括下列识别方式中的至少一项:基于代码解析方式对界面进行识别;基于图文识别方式对界面进行识别;以及基于控件分类模型对界面进行识别。
作为一种可能的实施方式,基于代码解析方式对界面进行识别,以获取界面对应的界面可操作元素,可以理解为基于代码解析的方式,对当界面中所包括的组件或者组件进行识别,进而所得到的界面可操作元素可以包括所能识别出的组件的标识以及描述信息。对应的,基于代码解析方式对界面进行识别,则可以理解为基于代码解析的方式来获取界面中所包括的组件以及组件对应的描述信息。描述信息可以包括名称、功能、触发操作等信息。可选地,可以基于Google无障碍服务accessibility实现基于代码解析方式对界面进行识别。
作为一种可能的实施方式,基于图文识别方式对界面进行识别,可以包括OCR(Optical Character Recognition,光学字符识别)方式,对界面中的组件、控件、图标等进行识别,得到识别出的组件、控件、图标等的描述信息。具体地,可以通过OCR方式,识别出用户界面中的组件、控件、图标的位置,然后执行遍历,获取到用户界面中所有的组件、控件、图标等,再通过分析图像内容,确定出组件、控件、图标等的描述信息。
作为一种可能的实施方式,控件分类模型的训练过程包括:获取用户界面;获取从用户界面中分类出的控件;通过分类出的控件对待训练的神经网络进行训练,以得到控件分类模型。
电子设备可以将预先识别得到的多种界面的界面可操作元素进行存储,以在执行语音控制时,与语音指令对应的目标执行信息进行匹配,从而得到对应的目标操作界面中的界面可操作元素,并将其作为目标可操作元素。
在另一些实施方式中,电子设备也可以在满足触发条件信息所对应的触发条件的情况下,再识别电子设备中多种界面的界面可操作元素,然后将目标执行信息与界面可操作元素进行匹配,得到目标操作界面,并于目标操作界面执行目标操作。其中,电子设备识别界面可操作元素的方式可以参阅上述实施方式,在此不再赘述。
在一些实施方式中,触发条件信息可以包括触发条件的条件领域以及条件参数,以及目标执行信息可以包括执行领域以及执行参数。条件领域指的是非即时类型的语音指令的触发条件所属的服务类型,例如,电子设备的电池电量、电子设备所处的位置、电子设备的时间、电子设备的日期、电子设备接收到的消息通知、接收到的来电等;条件参数指的是触发非即时类型的语音指令所属服务的具体触发状态或参数值等,例如,具体的电量值、具体的时间、具体的日期、具体的位置、接收到何种消息、接收到何种来电等;执行领域指的是非即时类型的语音指令具体执行的操作所属服务领域,例如,控制家电设备、控制电子设备的参数、操作应用软件等;执行参数指的是执行领域的操作对应的具体操作参数,例如,控制智能空调的具体设置温度、电子设备的设备音量的具体数值等。
在该实施方式下,电子设备将目标执行信息与界面可操作元素进行匹配时,可以将执行领域以及执行参数与界面可操作元素进行匹配,从而得到与目标执行信息匹配的目标操作界面中的目标可操作元素。其中,电子设备可以根据执行领域,确定出对应的界面,再根据执行参数与该界面的界面可操作元素进行匹配,从而得到与目标执行信息匹配的目标可操作元素。例如,若执行领域为控制智能空调,则界面为智能家居应用程序对应的智能空调的控制界面,再根据执行参数,若执行参数为降低智能空调的温度,则可以确定出匹配的界面可操作元素为:用于降低智能空调的温度的控件。
在该实施方式中,通过将目标执行信息划分为执行领域以及执行参数,可以使电子设备在匹配界面的界面可操作元素时,能够匹配到更为准确的界面可操作元素,并且能够提升匹配效率;同样地,将触发条件信息划分为条件领域以及条件参数,可以 使电子设备在根据触发条件信息执行匹配到的界面可操作元素时,能够更为准确地保证在满足触发条件信息时,于目标操作界面执行目标操作,提升语音控制的准确性。
本申请实施例提供的语音控制方法,针对用户输入的非即时类型的语音指令,识别其对应的触发条件信息以及目标执行信息后,在满足触发条件信息对应的触发条件时,将目标执行信息与界面可操作元素进行匹配,从而能够快速且准确地确定出用户所需的界面操作,再执行对应的界面操作,从而能够较佳地完成非即时类型的语音指令,进而提升用户体验。
请参阅图8,图8示出了本申请又一个实施例提供的语音控制方法的流程示意图。该语音控制方法应用于上述电子设备,下面将针对图8所示的流程进行详细的阐述,所述语音控制方法具体可以包括以下步骤:
步骤S310:获取语音指令。
步骤S320:当所述语音指令的指令类型为非即时类型时,获取所述语音指令所对应的触发条件信息和目标执行信息。
在本申请实施例中,步骤S310以及步骤S320可以参阅其他实施例的内容,在此不再赘述。
步骤S330:将所述目标执行信息与界面可操作元素进行匹配,得到所述目标操作界面。
与前一个实施例不同的是,在本申请实施例中,电子设备在获取到语音指令所对应的触发条件信息和目标执行信息之后,即可将目标执行信息与界面可操作元素进行匹配,得到目标操作界面,以便在满足触发条件信息所对应的触发条件时,于目标执行信息对应的目标操作界面执行目标操作。其中,电子设备将目标执行信息与界面可操作元素进行匹配的方式,可以参阅前一个实施例的内容,在此不再赘述。
步骤S340:根据所述触发条件信息以及所述目标操作界面,生成对应的控制指令。
在本申请实施例中,电子设备在识别出触发条件信息以及目标操作界面之后,则可以根据触发条件信息以及目标操作界面,进行控制指令的合成,并将合成之后的控制指令传递给图形界面进行执行。其中,电子设备可以根据触发条件信息,以及匹配到的目标操作界面中的界面可操作元素,生成对应的控制指令。可选地,可以采用IFTTT指令的生成方式,根据触发条件信息以及目标操作界面,生成对应的控制指令。
步骤S350:执行所述控制指令,所述控制指令用于当满足所述触发条件信息所对应的触发条件时,于所述目标执行信息对应的目标操作界面执行目标操作。
在本申请实施例中,由于上述控制指令是非即时类型的指令,因此控制指令根据具体的触发条件信息以及目标操作界面的界面可操作元素,存在于应用后台;并且,通过实时的监测触发条件信息的状态,例如监测前述实施例中的条件领域以及条件参数的真实状态,并使得相应的执行模块(用于执行目标执行信息)处于待命状态(存在于内存中);当检测到满足触发条件信息时,执行模块立即执行目标操作界面中界面可操作元素对应的操作,从而完成非即时类型的语音控制流程。
本申请实施例提供的语音控制方法,针对用户输入的非即时类型的语音指令,识别其对应的触发条件信息以及目标执行信息后,将目标执行信息与界面可操作元素进行匹配,从而能够快速且准确地确定出用户所需的界面操作,再根据触发条件信息执行对应的界面操作,从而能够较佳地完成非即时类型的语音指令,进而提升用户体验。
请参阅图9,图9示出了本申请再一个实施例提供的语音控制方法的流程示意图。该语音控制方法应用于上述电子设备,下面将针对图9所示的流程进行详细的阐述,所述语音控制方法具体可以包括以下步骤:
步骤S410:获取语音指令。
步骤S420:当所述语音指令的指令类型为非即时类型时,获取所述语音指令对应的指令文本。
步骤S430:将所述语音指令对应的指令文本,输入至预先训练的指令识别模型,得到所述指令文本中包含的触发条件信息以及目标执行信息,所述指令识别模型基于分层强化学习的方式训练得到。
在本申请实施例中,电子设备在确定出语音指令的指令类型为非即时类型的情况下,对语音指令对应的触发条件信息以及目标执行信息进行识别时,可以将语音指令对应的指令文本输入至预先基于分层强化学习的方式训练的指令识别模型,得到指令文本中包含的触发条件信息以及目标执行信息。由于需要识别触发条件信息以及目标执行信息的多种任务,因此指令识别模型采用分层强化学习的方式训练得到,能够提升识别准确性。
在一些实施方式中,该指令识别模型包括触发条件信息对应的第一子模块、目标执行信息对应的第二子模块以及协同控制模块,其中,触发条件信息对应的第一子模块用于识别触发条件信息的识别任务,目标执行信息对应的第二子模块用于识别目标执行信息的识别任务;协同控制模块用于决策对应的识别任务的动作,具体可以决策子模块的工作任务的执行顺序,为子模块分派工作任务。该方式中,通过分层强化学习的模型可以更准确地识别触发条件信息以及目标执行信息。
该方式下,指令识别模型的训练过程包括:创建用于识别触发条件信息的识别任务对应的第一子模块、用于识别目标执行信息的识别任务对应的第二子模块以及用于协调识别任务的协同控制模块,第一子模块以及所述第二子模块用于决策其对应的识别任务的动作,且所述协同控制模块的决策优先级高于第一子模块以及第二子模块的决策优先级;基于被标注有触发条件信息、目标执行信息以及所述识别任务的识别顺序的文本样本,对协同控制模块、第一子模块以及第二子模块进行深度强化学习训练,得到训练后的指令识别模型。
作为一种可能的实施方式,基于被标注有触发条件信息、目标执行信息以及所述识别任务的识别顺序的文本样本,对所述协同控制模块、所述第一子模块以及所述第二子模块进行深度强化学习训练,得到训练后的所述指令识别模型,可以包括:将所述文本样本输入至所述协同控制模块、所述第一子模块以及所述第二子模块,得到所述第一子模块以及所述第二子模块的输出结果,以及所述协同控制模块协调的执行顺序;基于所述第一子模块的输出结果以及所述文本样本被标注的触发条件信息,确定所述第一子模块对应的第一奖励;基于所述第二子模块的输出结果以及所述文本样本被标注的目标执行信息,确定所述第二子模块对应的第二奖励;基于所述协同控制模块协调的执行顺序、以及所述文本样本被标注的识别顺序,确定所述协同控制模块对应的第三奖励;基于所述第一奖励对所述第一子模块进行深度强化学习训练,基于所述第二奖励对所述第二子模块进行深度强化学习训练,以及基于所述第三奖励对所述协同控制模块进行深度强化学习训练,直至满足预设终止条件,得到训练后的所述指令识别模型。
该实施方式中,基于第一奖励对第一子模块进行深度强化学习训练,,基于第二奖励对第二子模块进行深度强化学习训练,以及基于第三奖励对协同控制模块进行深度强化学习训练,即模型训练,强化学习训练的算法可以不做限定,例如可以是优势动作评论(Advantage Actor Critic,A2C)算法、异步的优势动作评论(AsynchronousAdvantage Actor-Critic,A3C)算法或深度Q值网络(Deep Q-Network,DQN)算法等。
下面以触发条件信息包括条件领域以及条件参数,目标执行信息包括执行领域以及执行参数为例,对本申请实施例中的指令识别模型进行说明。其中,将条件领域、条件参数、执行领域以及执行参数定义为四个槽位。条件领域、条件参数、执行领域以及执行参数的定义可以参阅前述实施例的内容,在此不再赘述。
该指令识别模型如图10所示,其中Agent指的是强化学习概念中的智能体,需要 通过训练数据训练得到。指令识别模型中存在一个顶层Agent和四个底层Agent,每一个底层Agent分别负责对四个具体待识别槽位的信息进行识别;顶层Agent则根据顶层Agent的状态判断应该选择哪一个底层Agent进行识别,即控制底层Agent的工作顺序。顶层Agent即可以理解为上述的协调控制模块,底层Agent即可以理解为上述的第一子模块和第二子模块,且第一子模块和第二子模块分别对应两个底层Agent。
顶层Agent的状态的含义是目前顶层Agent的工作完成状态,由于顶层Agent的工作是负责分派工作给底层Agent,因此顶层Agent的状态可以使用如下公式来进行表示,SH表示顶层Agent的状态,st1、st2、st3、st4是分别表示四个底层Agent的状态,底层Agent的状态中保存了目前需要识别的槽位识别置信度、识别内容、历史采取的Action(行动,强化学习Agent的执行步骤)等。即:
S H={st 1,st 2,st 3,st 4}
底层Agent的训练目的是尽可能准确的识别所负责识别的信息(条件领域、条件参数、执行领域以及执行参数),因此,在训练过程中,定义底层Agent的训练Reward(奖励,指强化学习中对Agent执行结果的反馈)为:
Figure PCTCN2022121695-appb-000001
其中,st表示底层Agent的执行状态,gt为样本文本标注的信息;当执行状态与真实状态完全一致的时候,训练Reward为1,当执行状态为空的时候,训练Reward为-p,p是设定的取空惩罚;当执行状态为除上述状态的所有其他情况时候,训练Reward为-1。在训练过程中底层Agent每采取一次行动,就会计算一次训练Reward。
进一步地,由于底层槽位中的信息存在互相制约互相依赖的关系,在正确识别某一个槽位信息后,能够缩小另一个待识别槽位的信息搜索范围。例如,识别到执行领域为调节空调开关状态,那么执行函数就只能是电子设备中的智能家居控制应用,又例如,识别到执行函数为音量大小,那么执行领域就只可能是电子设备的图形界面上音量调节按钮。因此,顶层Agent的训练目的是尽可能合理高效的分配底层Agent的执行顺序,优先识别包含信息丰富的槽位。在训练过程中,定义顶层Agent的训练Reward为:
Figure PCTCN2022121695-appb-000002
其中,
Figure PCTCN2022121695-appb-000003
表示顶层Agent执行状态为st、真实标注为gt时候的累计训练Reward;顶层Agent的Reward是截至第i步为止,回溯N步的有效累计底层Agent的Reward总和;并且,当顶层Agent在某一步选择的底层Agent与真实标注的底层Agent(即执行顺序与标注的执行顺序)相同,才会被认定为有效累计,否则该步的Reward记为0。
另外,指令识别模型还可以包括信息规约模块。每一次底层Agent执行完毕,会更新目前的槽位识别状态,信息规约模块对槽位识别状态进行整合处理,若某一次底层Agent的识别结果概率低于设定的阈值,则信息规约模块可以对识别结果中识别的槽位信息,与界面可操作元素进行匹配,以验证该识别结果是否合理。例如识别为“把网易云音乐调整为震动模式”,则信息规约模块可以将其与界面可操作元素进行匹配,并且,通过匹配可以确定出网易云音乐为软件应用而非设备终端,不能进行该操作; 信息规约模块会将该步识别结果打上识别未完成需要进一步处理的标记,反馈给顶层Agent。由此,可以提升识别的准确性。
当然,对指令识别模型进行深度强化学习训练的具体方式可以不做限定。
步骤S440:当满足所述触发条件信息所对应的触发条件时,于所述目标执行信息对应的目标操作界面执行目标操作。
在本申请实施例中,步骤S440可以参阅其他实施例的内容,在此不再赘述。
本申请实施例提供的语音控制方法,针对用户输入的非即时类型的语音指令,通过预先根据分层强化学习的方式训练得到的指令识别模型,识别其对应的触发条件信息以及目标执行信息后,然后在满足触发条件信息所对应的触发条件时,于目标执行信息对应的目标操作界面执行目标操作,从而能够快速且准确地确定出用户所需的界面操作,再根据触发条件信息执行对应的界面操作,从而能够较佳地完成非即时类型的语音指令,进而提升用户体验。并且,由于指令识别模型是基于分层强化学习的方式训练得到的,因此能够保证识别触发条件信息以及目标执行信息的准确性,进而提升语音控制的准确性。
请参阅图11,图11示出了本申请又另一个实施例提供的语音控制方法的流程示意图。该语音控制方法应用于上述电子设备,下面将针对图11所示的流程进行详细的阐述,所述语音控制方法具体可以包括以下步骤:
步骤S510:获取语音指令。
在本申请实施例中,步骤S410可以参阅前述实施例的内容,在此不再赘述。
步骤S520:将所述语音指令对应的指令文本输入至预先训练的指令类型识别模型,得到所述语音指令的指令类型,所述指令类型识别模型用于识别输入的指令文本对应的指令类型为即使类型或非即时类型。
在本申请实施例中,对语音指令的指令类型进行识别时,可以基于预先训练的指令类型识别模型进行识别。其中,可以将语音指令对应的指令文本的文本向量输入至预先训练的指令类型识别模型,从而得到语音指令的指令类型。指令类型识别模型可以是基于BERT(Deep Bidirectional Transformers for Language Understanding,语义理解的深层双向Transformer模型)模型的文本意图二分类(是否为非即时类型)深度学习网络。
在一些实施方式中,请参阅图12,指令类型识别模型由编码模块、解码模块和BERT文本分类模型构成,其中BERT模型可以是公开预训练完成的语义模型。指令文本输入到指令类型识别模型后,首先经过编码模块转换为BERT文本分类模型接收的输入格式;编码之后会将编码后的文本向量输入到BERT网络中进行分类;然后,将分类结果向量通过解码装置进行解码,得到分类结果,分类结果包括即时类型以及非即时类型两种情况。
在一些实施方式中,在将语音指令对应的指令文本输入至上述指令类型识别模型之前,电子设备还可以对语音指令对应的指令文本进行预设校正处理;然后,将预设校正处理后的指令文本输入至上述指令类型识别模型,得到语音指令的指令类型。
可选地,预设校正处理可以包括:基于编辑距离的词汇校准、基于贝叶斯方法的常用词汇纠正等,具体的校正处理可以不做限定。
步骤S530:当所述语音指令的指令类型为非即时类型时,获取所述语音指令所对应的触发条件信息和目标执行信息。
步骤S540:当满足所述触发条件信息所对应的触发条件时,于所述目标执行信息对应的目标操作界面执行目标操作。
在本申请实施例中,步骤S530以及步骤S540可以参阅前述实施例的内容,在此不再赘述。
本申请实施例提供的语音控制方法,在识别语音指令的指令类型时,通过预先训 练的指令类型识别模型进行识别,得到指令类型;针对用户输入的非即时类型的语音指令,通过识别其触发条件信息以及目标执行信息,并根据触发条件信息,执行目标执行信息。由此,可以实现较佳地对语音指令的类型进行识别,并完成即时类型以及非即时类型的语音指令,进而提升用户体验。
下面再通过图13对前述实施例涉及的语音控制方法进行介绍。
如图13所示,电子设备在获取到语音指令后,对语音指令进行语音指令解析,得到语音指令对应的语音文本;再进行非即时类型的指令类型的判断;如果识别为非即时类型,则通过指令识别,识别出触发条件信息以及目标执行信息;然后根据目标执行信息以及触发条件信息进行指令合成,并交由图形界面执行;如果识别为即时类型,则直接进行界面可操作元素的匹配,从而得到对应的目标可操作元素,再合成为指令,并交由图形界面执行。
请参阅图14,其示出了本申请实施例提供的一种语音控制装置400的结构框图。该语音控制装置400应用上述的电子设备,该语音控制装置400包括:指令获取模块410、信息获取模块420以及操作执行模块430。其中,所述指令获取模块410用于获取语音指令;所述信息获取模块420用于当所述语音指令的指令类型为非即时类型时,获取所述语音指令所对应的触发条件信息和目标执行信息;所述操作执行模块430用于当满足所述触发条件信息所对应的触发条件时,于所述目标执行信息对应的目标操作界面执行目标操作。
在一些实施方式中,操作执行模块430可以具体用于:当满足所述触发条件信息所对应的触发条件时,将所述目标执行信息与界面可操作元素进行匹配,得到所述目标操作界面,于所述目标操作界面执行目标操作。
作为一种可能的实施方式,操作执行模块430可以具体用于:将所述目标执行信息与界面可操作元素进行匹配,得到所述目标操作界面中匹配的界面可操作元素作为目标可操作元素;于所述目标界面执行所述目标可操作元素对应的操作。
在一些实施方式中,该语音控制装置400还可以包括界面识别模块。界面识别模块用于:在所述当满足所述触发条件信息所对应的触发条件时,于所述目标执行信息对应的目标操作界面执行目标操作之前,将所述目标执行信息与界面可操作元素进行匹配,得到所述目标操作界面。
作为一种可能的实施方式,操作执行模块430可以具体用于:根据所述触发条件信息以及所述目标操作界面,生成对应的控制指令;执行所述控制指令,所述控制指令用于当满足所述触发条件信息所对应的触发条件时,于所述目标执行信息对应的目标操作界面执行目标操作。
在一些实施方式中,信息获取模块420可以具体用于:获取所述语音指令对应的指令文本;获取所述指令文本中包含的触发条件信息和目标执行信息。。
在一种可能的实施方式中,信息获取模块420可以具体用于:将所述语音指令对应的指令文本,输入至预先训练的指令识别模型,得到所述指令文本中包含的触发条件信息以及目标执行信息,所述指令识别模型基于分层强化学习的方式训练得到。
可选地,所述指令识别模型包括触发条件信息对应的第一子模块、目标执行信息对应的第二子模块以及协同控制模块。该语音控制装置400还可以包括模型训练模块。模型训练模块可以用于:创建用于识别触发条件信息的识别任务对应的第一子模块、用于识别目标执行信息的识别任务对应的第二子模块以及用于协调所述识别任务的协同控制模块,所述第一子模块以及所述第二子模块用于决策其对应的识别任务的动作,且所述协同控制模块的决策优先级高于所述第一子模块以及所述第二子模块的决策优先级;基于被标注有触发条件信息、目标执行信息以及所述识别任务的识别顺序的文本样本,对所述协同控制模块、所述第一子模块以及所述第二子模块进行深度强化学习训练,得到训练后的所述指令识别模型。
进一步地,模型训练模块可以具体用于:将所述文本样本输入至所述协同控制模块、所述第一子模块以及所述第二子模块,得到所述第一子模块以及所述第二子模块的输出结果,以及所述协同控制模块协调的执行顺序;基于所述第一子模块的输出结果以及所述文本样本被标注的触发条件信息,确定所述第一子模块对应的第一奖励;基于所述第二子模块的输出结果以及所述文本样本被标注的目标执行信息,确定所述第二子模块对应的第二奖励;基于所述协同控制模块协调的执行顺序、以及所述文本样本被标注的识别顺序,确定所述协同控制模块对应的第三奖励;基于所述第一奖励对所述第一子模块进行深度强化学习训练,基于所述第二奖励对所述第二子模块进行深度强化学习训练,以及基于所述第三奖励对所述协同控制模块进行深度强化学习训练,直至满足预设终止条件,得到训练后的所述指令识别模型。
在一些实施方式中,该语音识别装置400还可以包括:类型识别模块。类型识别模块用于在所述当所述语音指令的指令类型为非即时类型时,获取所述语音指令所对应的触发条件信息和目标执行信息之前,将所述语音指令对应的指令文本输入至预先训练的指令类型识别模型,得到所述语音指令的指令类型,所述指令类型识别模型用于识别输入的指令文本对应的指令类型为即时类型或非即时类型。
在一种可能的实施方式中,类型识别模块还可以具体用于:对所述语音指令对应的指令文本进行预设校正处理;将所述预设校正处理后的指令文本输入至预先训练的指令类型识别模型,得到所述语音指令的指令类型。
在一些实施方式中,信息获取模块420还可以用于在所述获取语音指令之后,若所述语音指令的指令类型为即时类型,识别所述语音指令对应的目标操作信息;操作执行模块440还可以用于响应于所述目标操作信息,于所述目标操作信息对应的界面执行所述目标操作信息对应的操作。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述装置和模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,模块相互之间的耦合可以是电性,机械或其它形式的耦合。
另外,在本申请各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。
综上所述,本申请提供的方案,通过获取语音指令,当该语音指令的指令类型为非即时类型时,获取该语音指令所对应的触发条件信息和目标执行信息,当满足该触发条件信息所对应的触发条件时,于该目标执行信息对应的目标操作界面执行目标操作。由此,可以实现针对用户输入的非即时类型的语音指令,识别其触发条件信息后,再根据触发条件信息执行对应的界面操作,从而能够较佳地完成非即时类型的语音指令,进而提升用户体验。
请参考图15,其示出了本申请实施例提供的一种电子设备的结构框图。该电子设备100可以是智能手机、平板电脑、智能手表、智能眼镜、笔记本电脑等能够运行应用程序的电子设备。本申请中的电子设备100可以包括一个或多个如下部件:处理器110、存储器120、以及一个或多个应用程序,其中一个或多个应用程序可以被存储在存储器120中并被配置为由一个或多个处理器110执行,一个或多个程序配置用于执行如前述方法实施例所描述的方法。
处理器110可以包括一个或者多个处理核。处理器110利用各种接口和线路连接整个电子设备100内的各个部分,通过运行或执行存储在存储器120内的指令、程序、代码集或指令集,以及调用存储在存储器120内的数据,执行电子设备100的各种功能和处理数据。可选地,处理器110可以采用数字信号处理(Digital Signal Processing,DSP)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)、可编程逻辑 阵列(Programmable Logic Array,PLA)中的至少一种硬件形式来实现。处理器110可集成中央处理器(Central Processing Unit,CPU)、图形处理器(Graphics Processing Unit,GPU)和调制解调器等中的一种或几种的组合。其中,CPU主要处理操作***、用户界面和应用程序等;GPU用于负责显示内容的渲染和绘制;调制解调器用于处理无线通信。可以理解的是,上述调制解调器也可以不集成到处理器110中,单独通过一块通信芯片进行实现。
存储器120可以包括随机存储器(Random Access Memory,RAM),也可以包括只读存储器(Read-Only Memory)。存储器120可用于存储指令、程序、代码、代码集或指令集。存储器120可包括存储程序区和存储数据区,其中,存储程序区可存储用于实现操作***的指令、用于实现至少一个功能的指令(比如触控功能、声音播放功能、图像播放功能等)、用于实现下述各个方法实施例的指令等。存储数据区还可以存储电子设备100在使用中所创建的数据(比如电话本、音视频数据、聊天记录数据)等。
请参考图16,其示出了本申请实施例提供的一种计算机可读存储介质的结构框图。该计算机可读介质800中存储有程序代码,所述程序代码可被处理器调用执行上述方法实施例中所描述的方法。
计算机可读存储介质800可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。可选地,计算机可读存储介质800包括非易失性计算机可读介质(non-transitory computer-readable storage medium)。计算机可读存储介质800具有执行上述方法中的任何方法步骤的程序代码810的存储空间。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。程序代码810可以例如以适当形式进行压缩。
本申请实施例还提供了一种计算机程序产品,包括计算机程序/指令,其特征在于,该计算机程序/指令被处理器执行时实现前述实施例提供的语音控制方法。
最后应说明的是:以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不驱使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (20)

  1. 一种语音控制方法,其特征在于,所述方法包括:
    获取语音指令;
    当所述语音指令的指令类型为非即时类型时,获取所述语音指令所对应的触发条件信息和目标执行信息;
    当满足所述触发条件信息所对应的触发条件时,于所述目标执行信息对应的目标操作界面执行目标操作。
  2. 根据权利要求1所述的方法,其特征在于,所述当满足所述触发条件信息所对应的触发条件时,于所述目标执行信息对应的目标界面执行目标操作,包括:
    当满足所述触发条件信息所对应的触发条件时,将所述目标执行信息与界面可操作元素进行匹配,得到所述目标操作界面,于所述目标操作界面执行目标操作。
  3. 根据权利要求2所述的方法,其特征在于,所述将所述目标执行信息与界面可操作元素进行匹配,得到所述目标操作界面,于所述目标操作界面执行目标操作,包括:
    将所述目标执行信息与界面可操作元素进行匹配,得到所述目标操作界面中匹配的界面可操作元素作为目标可操作元素;
    于所述目标界面执行所述目标可操作元素对应的操作。
  4. 根据权利要求1所述的方法,其特征在于,在所述当满足所述触发条件信息所对应的触发条件时,于所述目标执行信息对应的目标操作界面执行目标操作之前,所述方法还包括:
    将所述目标执行信息与界面可操作元素进行匹配,得到所述目标操作界面。
  5. 根据权利要求4所述的方法,其特征在于,所述当满足所述触发条件信息所对应的触发条件时,于所述目标执行信息对应的目标操作界面执行目标操作,包括:
    根据所述触发条件信息以及所述目标操作界面,生成对应的控制指令;
    执行所述控制指令,所述控制指令用于当满足所述触发条件信息所对应的触发条件时,于所述目标执行信息对应的目标操作界面执行目标操作。
  6. 根据权利要求1-5任一项所述的方法,其特征在于,所述获取所述语音指令所对应的触发条件信息和目标执行信息,包括:
    获取所述语音指令对应的指令文本;
    获取所述指令文本中包含的触发条件信息和目标执行信息。
  7. 根据权利要求6所述的方法,其特征在于,所述获取所述指令文本中包含的触发条件信息和目标执行信息,包括:
    将所述语音指令对应的指令文本,输入至预先训练的指令识别模型,得到所述指令文本中包含的触发条件信息以及目标执行信息,所述指令识别模型基于分层强化学习的方式训练得到。
  8. 根据权利要求7所述的方法,其特征在于,所述指令识别模型包括触发条件信息对应的第一子模块、目标执行信息对应的第二子模块以及协同控制模块,所述指令识别模型的训练过程包括:
    创建用于识别触发条件信息的识别任务对应的第一子模块、用于识别目标执行信息的识别任务对应的第二子模块以及用于协调所述识别任务的协同控制模块,所述第一子模块以及所述第二子模块用于决策其对应的识别任务的动作,且所述协同控制模块的决策优先级高于所述第一子模块以及所述第二子模块的决策优先级;
    基于被标注有触发条件信息、目标执行信息以及所述识别任务的识别顺序的文本样本,对所述协同控制模块、所述第一子模块以及所述第二子模块进行深度强化学习 训练,得到训练后的所述指令识别模型。
  9. 根据权利要求8所述的方法,其特征在于,所述基于被标注有触发条件信息、目标执行信息以及所述识别任务的识别顺序的文本样本,对所述协同控制模块、所述第一子模块以及所述第二子模块进行深度强化学习训练,得到训练后的所述指令识别模型,包括:
    将所述文本样本输入至所述协同控制模块、所述第一子模块以及所述第二子模块,得到所述第一子模块以及所述第二子模块的输出结果,以及所述协同控制模块协调的执行顺序;
    基于所述第一子模块的输出结果以及所述文本样本被标注的触发条件信息,确定所述第一子模块对应的第一奖励;
    基于所述第二子模块的输出结果以及所述文本样本被标注的目标执行信息,确定所述第二子模块对应的第二奖励;
    基于所述协同控制模块协调的执行顺序、以及所述文本样本被标注的识别顺序,确定所述协同控制模块对应的第三奖励;
    基于所述第一奖励对所述第一子模块进行深度强化学习训练,基于所述第二奖励对所述第二子模块进行深度强化学习训练,以及基于所述第三奖励对所述协同控制模块进行深度强化学习训练,直至满足预设终止条件,得到训练后的所述指令识别模型。
  10. 根据权利要求1-5任一项所述的方法,其特征在于,所述获取所述语音指令所对应的触发条件信息和目标执行信息,包括:
    对所述语音指令对应的文本内容进行语义识别,得到语义识别结果;
    根据所述语义识别结果,确定所述触发条件信息以及目标执行信息。
  11. 根据权利要求1-10任一项所述的方法,其特征在于,在所述当所述语音指令的指令类型为非即时类型时,获取所述语音指令所对应的触发条件信息和目标执行信息之前,所述方法还包括:
    将所述语音指令对应的指令文本输入至预先训练的指令类型识别模型,得到所述语音指令的指令类型,所述指令类型识别模型用于识别输入的指令文本对应的指令类型为即时类型或非即时类型。
  12. 根据权利要求11所述的方法,其特征在于,在所述将所述语音指令对应的指令文本输入至预先训练的指令类型识别模型,得到所述语音指令的指令类型之前,所述方法还包括:
    对所述语音指令对应的指令文本进行预设校正处理;
    所述将所述语音指令对应的指令文本输入至预先训练的指令类型识别模型,得到所述语音指令的指令类型,包括:
    将所述预设校正处理后的指令文本输入至预先训练的指令类型识别模型,得到所述语音指令的指令类型。
  13. 根据权利要求1-10任一项所述的方法,其特征在于,在所述当所述语音指令的指令类型为非即时类型时,获取所述语音指令所对应的触发条件信息和目标执行信息之前,所述方法还包括:
    对所述语音指令对应的文本内容进行分词,得到分词结果;
    根据所述分词结果,获取所述文本内容中的关键词;
    将所述文本内容中的关键词与预设关键词匹配,所述预设关键词为预先设置的非即时类型的语音指令对应的关键词;
    若所述文本内容中的关键词中任一关键词与预设关键词匹配,则确定所述语音指令的指令类型为非即时类型;
    若所述文本内容中的关键词中每个关键词与预设关键词均不匹配,则确定所述语音指令的指令类型为即时类型。
  14. 根据权利要求1-13任一项所述的方法,其特征在于,在所述获取语音指令之后,所述方法还包括:
    若所述语音指令的指令类型为即时类型,识别所述语音指令对应的目标操作信息;
    响应于所述目标操作信息,于所述目标操作信息对应的界面执行所述目标操作信息对应的操作。
  15. 根据权利要求1-14任一项所述的方法,其特征在于,所述获取语音指令,包括:
    在开启语音助手的情况下,通过语音采集装置采集用户输入的语音指令。
  16. 根据权利要求1-14任一项所述的方法,其特征在于,所述获取语音指令,包括:
    显示用于语音控制图形界面的控件;
    响应于对所述控件的触发操作,开启语音采集装置,并通过所述语音采集装置采集用户输入的语音指令。
  17. 一种语音控制装置,其特征在于,所述装置包括:指令获取模块、信息获取模块以及操作执行模块,其中,
    所述指令获取模块用于获取语音指令;
    所述信息获取模块用于当所述语音指令的指令类型为非即时类型时,获取所述语音指令所对应的触发条件信息和目标执行信息;
    所述操作执行模块用于当满足所述触发条件信息所对应的触发条件时,于所述目标执行信息对应的目标操作界面执行目标操作。
  18. 一种电子设备,其特征在于,包括:
    一个或多个处理器;
    存储器;
    一个或多个应用程序,其中所述一个或多个应用程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个程序配置用于执行如权利要求1-16任一项所述的方法。
  19. 一种计算机可读取存储介质,其特征在于,所述计算机可读取存储介质中存储有程序代码,所述程序代码可被处理器调用执行如权利要求1-16任一项所述的方法。
  20. 一种计算机程序产品,包括计算机程序/指令,其特征在于,该计算机程序/指令被处理器执行时实现权利要求1-16任一项所述方法的步骤。
PCT/CN2022/121695 2021-11-29 2022-09-27 语音控制方法、装置、电子设备及存储介质 WO2023093280A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111433111.8 2021-11-29
CN202111433111.8A CN114121005A (zh) 2021-11-29 2021-11-29 语音控制方法、装置、电子设备及存储介质

Publications (1)

Publication Number Publication Date
WO2023093280A1 true WO2023093280A1 (zh) 2023-06-01

Family

ID=80371572

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/121695 WO2023093280A1 (zh) 2021-11-29 2022-09-27 语音控制方法、装置、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN114121005A (zh)
WO (1) WO2023093280A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114121005A (zh) * 2021-11-29 2022-03-01 Oppo广东移动通信有限公司 语音控制方法、装置、电子设备及存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008170806A (ja) * 2007-01-12 2008-07-24 Yamaha Corp 発音期間を特定する音信号処理装置およびプログラム
CN103973870A (zh) * 2013-01-28 2014-08-06 联想(北京)有限公司 信息处理设备和信息处理方法
CN105049963A (zh) * 2015-07-31 2015-11-11 小米科技有限责任公司 终端控制方法、装置及终端
CN108597499A (zh) * 2018-04-02 2018-09-28 联想(北京)有限公司 语音处理方法以及语音处理装置
CN110459222A (zh) * 2019-09-06 2019-11-15 Oppo广东移动通信有限公司 语音控制方法、语音控制装置及终端设备
CN112415908A (zh) * 2020-11-26 2021-02-26 珠海格力电器股份有限公司 智能设备控制方法、装置、可读存储介质和计算机设备
CN113641114A (zh) * 2020-04-27 2021-11-12 青岛海尔空调器有限总公司 智能起床场景的环境控制方法及***
CN114121005A (zh) * 2021-11-29 2022-03-01 Oppo广东移动通信有限公司 语音控制方法、装置、电子设备及存储介质

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008170806A (ja) * 2007-01-12 2008-07-24 Yamaha Corp 発音期間を特定する音信号処理装置およびプログラム
CN103973870A (zh) * 2013-01-28 2014-08-06 联想(北京)有限公司 信息处理设备和信息处理方法
CN105049963A (zh) * 2015-07-31 2015-11-11 小米科技有限责任公司 终端控制方法、装置及终端
CN108597499A (zh) * 2018-04-02 2018-09-28 联想(北京)有限公司 语音处理方法以及语音处理装置
CN110459222A (zh) * 2019-09-06 2019-11-15 Oppo广东移动通信有限公司 语音控制方法、语音控制装置及终端设备
CN113641114A (zh) * 2020-04-27 2021-11-12 青岛海尔空调器有限总公司 智能起床场景的环境控制方法及***
CN112415908A (zh) * 2020-11-26 2021-02-26 珠海格力电器股份有限公司 智能设备控制方法、装置、可读存储介质和计算机设备
CN114121005A (zh) * 2021-11-29 2022-03-01 Oppo广东移动通信有限公司 语音控制方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN114121005A (zh) 2022-03-01

Similar Documents

Publication Publication Date Title
CN107210033B (zh) 基于众包来更新用于数字个人助理的语言理解分类器模型
CN110634483B (zh) 人机交互方法、装置、电子设备及存储介质
CN112513833A (zh) 用于基于预先合成的对话提供人工智能服务的电子设备和方法
CN112970059B (zh) 用于处理用户话语的电子装置及其控制方法
CN111261151B (zh) 一种语音处理方法、装置、电子设备及存储介质
US20200160863A1 (en) Electronic apparatus for processing user utterance and controlling method thereof
US20200051560A1 (en) System for processing user voice utterance and method for operating same
CN108882101B (zh) 一种智能音箱的播放控制方法、装置、设备及存储介质
WO2020024620A1 (zh) 语音信息的处理方法以及装置、设备和存储介质
WO2020119541A1 (zh) 一种语音数据的识别方法、装置及***
WO2023082703A1 (zh) 语音控制方法、装置、电子设备及可读存储介质
CN111640429A (zh) 提供语音识别服务的方法和用于该方法的电子装置
WO2023093280A1 (zh) 语音控制方法、装置、电子设备及存储介质
KR20210001082A (ko) 사용자 발화를 처리하는 전자 장치와 그 동작 방법
WO2023103917A1 (zh) 语音控制方法、装置、电子设备及存储介质
WO2023103918A1 (zh) 语音控制方法、装置、电子设备及存储介质
KR20210066651A (ko) 전자 장치 및 이의 제어 방법
US20220270604A1 (en) Electronic device and operation method thereof
KR20200033140A (ko) 보이스 어시스턴트 서비스를 제공하는 시스템 및 방법
CN114822598A (zh) 服务器及语音情感识别方法
US11893996B1 (en) Supplemental content output
US20220262359A1 (en) Electronic device and operation method thereof
US20220028385A1 (en) Electronic device for processing user utterance and method for operating thereof
CN115438625A (zh) 文本纠错服务器、终端设备及文本纠错方法
KR20220129927A (ko) 음성 인식 서비스를 제공하는 전자 장치 및 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22897366

Country of ref document: EP

Kind code of ref document: A1