EP3933570A1 - Procédé et appareil pour commander un assistant vocal, et support de stockage lisible sur ordinateur - Google Patents

Procédé et appareil pour commander un assistant vocal, et support de stockage lisible sur ordinateur Download PDF

Info

Publication number: EP3933570A1
Authority: EP; European Patent Office
Prior art keywords: speech; speech data; target; control instruction; interface
Prior art date: 2020-06-30
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.): Pending

Application number

EP21158588.0A

Other languages

German (de)

English (en)

Inventor

Can ZHOU

Meng Wen

Xiaochuang LU

Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)

Beijing Xiaomi Pinecone Electronic Co Ltd

Original Assignee

Beijing Xiaomi Pinecone Electronic Co Ltd

Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)

2020-06-30

Filing date

2021-02-23

Publication date

2022-01-05

2021-02-23 Application filed by Beijing Xiaomi Pinecone Electronic Co Ltd filed Critical Beijing Xiaomi Pinecone Electronic Co Ltd

2022-01-05 Publication of EP3933570A1 publication Critical patent/EP3933570A1/fr

Status Pending legal-status Critical Current

Links

238000000034 method Methods 0.000 title claims abstract description 65
230000004044 response Effects 0.000 claims description 75
238000001514 detection method Methods 0.000 claims description 33
230000008569 process Effects 0.000 claims description 18
238000004590 computer program Methods 0.000 claims description 7
230000002618 waking effect Effects 0.000 claims description 5
238000012544 monitoring process Methods 0.000 claims description 3
238000005516 engineering process Methods 0.000 description 11
238000012545 processing Methods 0.000 description 11
238000004891 communication Methods 0.000 description 10
230000003993 interaction Effects 0.000 description 9
238000010586 diagram Methods 0.000 description 5
230000003287 optical effect Effects 0.000 description 4
230000005236 sound signal Effects 0.000 description 4
230000008859 change Effects 0.000 description 3
238000005265 energy consumption Methods 0.000 description 3
230000009191 jumping Effects 0.000 description 3
238000007726 management method Methods 0.000 description 3
230000001133 acceleration Effects 0.000 description 2
230000009471 action Effects 0.000 description 2
230000033001 locomotion Effects 0.000 description 2
238000003058 natural language processing Methods 0.000 description 2
238000000926 separation method Methods 0.000 description 2
230000003068 static effect Effects 0.000 description 2
238000012549 training Methods 0.000 description 2
238000004458 analytical method Methods 0.000 description 1
238000003491 array Methods 0.000 description 1
238000013473 artificial intelligence Methods 0.000 description 1
230000009286 beneficial effect Effects 0.000 description 1
230000000295 complement effect Effects 0.000 description 1
238000010276 construction Methods 0.000 description 1
238000013500 data storage Methods 0.000 description 1
238000000605 extraction Methods 0.000 description 1
230000010006 flight Effects 0.000 description 1
238000003384 imaging method Methods 0.000 description 1
230000002452 interceptive effect Effects 0.000 description 1
239000004973 liquid crystal related substance Substances 0.000 description 1
229910044991 metal oxide Inorganic materials 0.000 description 1
150000004706 metal oxides Chemical class 0.000 description 1
238000012986 modification Methods 0.000 description 1
230000004048 modification Effects 0.000 description 1
230000002093 peripheral effect Effects 0.000 description 1
239000004065 semiconductor Substances 0.000 description 1
239000002699 waste material Substances 0.000 description 1

Images

Classifications

- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/22—Interactive procedures; Man-machine interfaces
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
- G06F3/167—Audio in a user interface, e.g. using voice commands for navigating, audio feedback
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command

Definitions

the present disclosure relates to the technical field of artificial intelligence, and more particularly, to a method and apparatus for speech assistant control, and a computer-readable storage medium.
More and more intelligent devices apply a speech assistant to realize speech control of the intelligent devices by users.
a user can make a terminal device perform a corresponding operation by transmitting speech to a speech assistant.
a terminal device usually can receive data within a short time after a conversation between a user and itself.
the speech assistant needs to be waken up again when the short time is exceeded.
the terminal device usually may directly exit the speech assistant when jumping from the speech assistant to an interface of other application than the speech assistant. That is, the user cannot realize control through the speech assistant when another application is activated.
the present disclosure provides a method and apparatus for speech assistant control, and a computer-readable storage medium.
a method for speech assistant control which includes:
the displaying an interface corresponding to the target control instruction may include: displaying a window interface in the target interface in response to there is the window interface corresponding to the target control instruction.
the method may further include: closing the window interface in response to a display duration of the window interface reaching a target duration.
the determining whether a target control instruction to be executed is included in received second speech data based on the second speech data may include:
the instruction execution condition may include at least one of following conditions:
the method may further include: in response to the target control instruction being included in the second speech data, displaying text information corresponding to the second speech data at a position corresponding to the speech reception identifier.
the method may further include:
Determining that the speech assistant meets the sleep state may be based on at least one of following situations:
the method may further include:
the determining whether the received second speech data is speech data sent by the user to the terminal based on the detection information may include:
an apparatus for speech assistant control which includes:
the second display module may be configured to: display a window interface in the target interface in response to there is the window interface corresponding to the target control instruction.
the apparatus may further include: a closing module, configured to close the window interface in response to a display duration of the window interface reaching a target duration.
a closing module configured to close the window interface in response to a display duration of the window interface reaching a target duration.
the first determination module may further include:
the instruction execution condition may include at least one of following conditions:
the apparatus may further include: a third display module, configured to, in response to the target control instruction being included in the second speech data, display text information corresponding to the second speech data at a position corresponding to the speech reception identifier.
a third display module configured to, in response to the target control instruction being included in the second speech data, display text information corresponding to the second speech data at a position corresponding to the speech reception identifier.
the apparatus may further include:
Determining that the speech assistant meets the sleep state may be based on at least one of following situations:
the apparatus may further include:
the first determination module may be configured to determine, based on the received second speech data, whether the target control instruction to be executed is included in the second speech data, in response to determining that the second speech data is speech data sent by the user to the terminal.
the second determination module may include:
an apparatus for speech assistant control which includes:
the processor may be configured to:
a computer-readable storage medium stores computer program instructions that, when executed by a processor, implement the operations of the method for speech assistant control provided by the first aspect of the present disclosure.
a target interface corresponding to the control instruction is displayed.
a speech reception identifier is displayed in the target interface, and speech data is controlled to be continuously received.
interfaces of other applications can be displayed during interaction with the speech assistant, and the speech assistant can continuously receive speech data in the process of displaying the interfaces of other applications, so that corresponding operations can be executed in the interfaces of other applications through the speech assistant.
speech data can be continuously received in the displaying process of the target interface, so that repeated waking-up operations are not needed for a user.
operations can be carried out through the speech assistant when the target interface is displayed, so that the comprehensive control based on a graphical user interface and a speech user interface can be realized, the execution path of operations carried out by a user can be effectively shortened, and the operations of the user can be simplified.
FIG. 1 is a flowchart of a method for speech assistant control according to an exemplary embodiment. As shown in FIG. 1 , the method may include the following operations.
a target interface corresponding to a control instruction corresponding to received speech data is displayed according to the control instruction.
a speech assistant may be waken up by an existing technology of detecting a wake-up word, such as by pre-recording speech data about a wake-up word and then training a wake-up word detection model to enable speech from a user to be detected in real time by the wake-up word detection model.
the speech assistant can be waken up when it is determined that the speech from the user includes the wake-up word.
the speech assistant may be waken up by clicking a speech assistant icon or button, which is not limited in the present disclosure.
the speech assistant may receive speech from a user, so that speech data received by the speech assistant may be analyzed to determine a corresponding control instruction.
the method provided by the present disclosure may be applied to a terminal device with a display interface.
speech data sent by the user may be received in an displaying process of an interface corresponding to the speech assistant, so that speech recognition may be performed on the speech data, text information corresponding to the speech data may be obtained, and a control instruction included in the text information may be further determined. Therefore, a target interface corresponding to the control instruction may be displayed.
a user sends a speech "Please open application A to reserve an airline ticket from B city to C city tomorrow" in an interface corresponding to a speech assistant, and in response to a control instruction corresponding to the speech data, a target interface, namely an inquiry interface for the airline ticket from B city to C city tomorrow in the application A, may be displayed.
the date corresponding to tomorrow may be calculated by obtaining the current time of the terminal.
FIG. 2 a schematic diagram of a target interface is shown.
a speech reception identifier is displayed in the target interface, and speech data is controlled to be continuously received.
the speech data may be continuously received based on a full duplex technology.
a terminal device usually directly exits the speech assistant when jumping from the speech assistant to an interface of other application than the speech assistant. That is, a user cannot realize control through the speech assistant when other applications are activated.
a speech reception identifier may be displayed in the target interface and controlled continuous reception of speech data may be achieved, i.e., the speech assistant may be controlled to be continuously in an operating state.
the speech reception identifier may be displayed in the lower portion of the target interface.
the transparency of the speech reception identifier may be adjusted by a user to meet the user's requirements for page displaying.
the speech reception identifier may be a static or dynamic picture identifier, and as shown at P in FIG. 3 , may be displayed at the borderline in the lower portion of the target interface, i.e., the speech reception identifier coincides with the lower boundary of the target interface.
FIG. 3 is merely an exemplary display mode, for example, the display position and size of the speech reception identifier may be set according to an actual application scene or a setting instruction of a user. The present disclosure is not limited thereto.
displaying the speech reception identifier in the target interface may prompt a user that speech still can be sent at present for corresponding control by the speech assistant, and may avoid the speech assistant from being repeatedly waken up by the user by continuously receiving speech data.
an operation instruction of the user may be received in the target interface, for example, an operation that the user views flights in the target interface in a sliding mode, so that the inquired flight information may be displayed in the target interface in a sliding mode in response to the operation instruction of the user.
user data may be continuously received during the displaying process of the target interface to complete the interaction between a user and the speech assistant. Therefore, the received second speech data may include ambient sound data, such as speech data of a user's conversation with another user, and speech data of other users.
whether a target control instruction to be executed is included in the second speech data may be determined by analyzing the received second speech data, so that the impact of ambient sound data can be removed from the received speech data, and the accuracy of the method for speech assistant control is improved.
an interface corresponding to the target control instruction is displayed in response to the target control instruction being included in the second speech data.
a target interface corresponding to the control instruction may be displayed.
a speech reception identifier may be displayed in the target interface and speech data may be controlled to be continuously received. Then, based on received second speech data in a displaying process of the target interface, it may be determined whether a target control instruction to be executed is included in the second speech data, and an interface corresponding to the target control instruction may be displayed in response to the target control instruction being included in the second speech data.
an interface of another application can be displayed during interaction between the user and the speech assistant, and the speech assistant may continuously receive speech data in the process of displaying the interface of another application, so that corresponding operations can be executed in the interface of another application through the speech assistant.
speech data may be continuously received in the displaying process of the target interface, so that repeated waking-up operations are not needed to be executed by a user, convenience for the user to use the speech assistant and the use experience of the user are improved.
operations can be carried out through the speech assistant in the displaying process of the target interface, so that the comprehensive control based on a graphical user interface and a speech user interface can be realized, the execution path of the operations carried out by the user can be effectively shortened, and the operations of the user can be simplified.
an exemplary implementation manner of determining, based on received second speech data, whether the target control instruction to be executed is included in the second speech data is as follows.
the operation may include:
speech recognition is performed on the second speech data to obtain text information corresponding to the second speech data
the text information is taken to match with instructions in an instruction library according to the text information.
the text information may be obtained through speech recognition by an automatic speech recognition (ASR) technology.
ASR automatic speech recognition
fuzzy matching can be carried out on the instructions in the instruction library and the text information based on the text information. Matching can be carried out in an instruction matching mode commonly used in the related art, and detailed description is not repeated.
parameters corresponding to the target instruction may be determined by analyzing the text information, so that a target control instruction can be determined. For example, when speech data sent by a user is "play ABC song", the target instruction determined by matching with instructions in the instruction library may be playing song, and then a parameter corresponding to the target instruction may be determined to be ABC song by analyzing the text information, thereby generating a target control instruction to play the ABC song.
the continuous reception of speech data can be controlled during the displaying of the target interface, and therefore, an instruction actually to be executed needs to be determined from the received speech data. Therefore, according to the above technical solution, a target instruction corresponding to second speech data may be determined by analyzing the second speech data, meanwhile, whether the target instruction is an instruction actually needed to be executed may be determined by determining whether the text information meets the instruction execution condition, so as to provide data support for accurately determining the target control instruction. Meanwhile, the impact of an instruction in the ambient sound data on the accuracy of the method for speech assistant control can be effectively avoided, thereby ensuring the accuracy of the method for speech assistant control.
the instruction execution condition may include at least one of following conditions.
the last speech data is speech data corresponding to a last control instruction executed by the speech assistant.
the voiceprint features of the second speech data may be extracted when the second speech data is received, so that the voiceprint features of the second speech data may be compared with the voiceprint features of the last speech data. If the voiceprint features are matched with the voiceprint features of the last speech data, then the speech data and the last speech data are sent out by the same user. In this case, it may be determined that the text information meets the instruction execution condition, so that the impact of a speech sent by other users can be avoided.
voiceprint features corresponding to the text information are voiceprint features of a target user.
the target user may be a master user of a terminal device, or may be the master user and other preset legal users.
the voiceprint features of the target user may be pre-recorded and extracted to store the voiceprint features of the target user.
the voiceprint features of the second speech data may be directly extracted and compared with the voiceprint features of the target user. If the voiceprint features matched with the voiceprint features of the second speech data exist in the voiceprint features of the target user, it is indicated that the second speech data is sent by the target user. In this case, it may be determined that the text information meets the instruction execution condition, so that the impact of speech sent by other users can be avoided.
a user usually interacts with a speech assistant in the same scene, i.e. a speech from a user is usually continuous contextual information. Therefore, in this embodiment, a semantic feature judgment model may be trained in advance based on training sentences with continuous semantic features through a natural language processing (NLP) method, so that the text information corresponding to the second speech data and the text information corresponding to the last speech data may be input into the semantic feature judgment model after the text information corresponding to the second speech data is determined. Therefore, whether the semantic features between the text information and the text information corresponding to the last speech data are continuous or not can be determined.
NLP natural language processing
the text information corresponding to the second speech data is subsequent to the text information of the last speech data and that the target instruction corresponding to the second speech data is an actual instruction sent by a user to a speech assistant.
the text information may be determined to meet the instruction execution condition so as to ensure the accuracy of the determined target control instruction.
the multiple instruction execution conditions described above may be determined comprehensively.
An instruction execution condition may be determined to be satisfied when the instruction execution condition includes multiple conditions and the multiple conditions are simultaneously satisfied.
the target instruction can be executed through the voiceprint feature corresponding to the text information or the semantic feature corresponding to the text information, so that the target instruction determined from the second speech data can be further verified, a real target operation instruction for speech assistant control can be determined, the accuracy of the speech assistant control can be further guaranteed, and the use experience is improved.
the present disclosure also provides the following embodiments.
the method may further include:
the second speech data is not speech data sent by the user to the terminal, it is indicated that the second speech data is ambient sound data rather than data for performing interaction with the speech assistant. In this case, it may not be necessary to parse the second speech data.
the operation 13 of determining, based on the received second speech data, whether the target control instruction to be executed is included in the second speech data is executed.
the second speech data may be preliminarily judged in advance.
the second speech data is determined to be the speech data sent by the user to the terminal, namely the data used for interacting with the speech assistant
the second speech data can be analyzed, so that the volume of speech data to be processed by the speech assistant can be effectively reduced, and resource waste caused by analyzing ambient sound data is avoided. Meanwhile, the accuracy of subsequent operations of the speech assistant and real-time responses can be guaranteed.
an exemplary implementation manner of determining whether the received second speech data is speech data sent by the user to the terminal based on the detection information is as follows.
the operation may include: when the detection information is rotation angle information of the terminal, determining that the second speech data is speech data sent by the user to the terminal in response to determining that a distance between a microphone array of the terminal and a speech data source is reduced based on the rotation angle information of the terminal.
an angular velocity at which the terminal rotates can be detected by a gyroscope, and then the rotation angle information of the terminal can be determined by integrating the angular velocity.
Whether the second speech data is speech data sent by the user to the terminal may be determined by determining a distance between a microphone array of the terminal and a source of speech data during this rotation.
the change in the distance between the microphone array and the speech data source to which the speech data corresponds may be determined based on the rotation angle and the position of the microphone array in the terminal.
the second speech data may be determined to be the speech data sent by the user to the terminal.
the distance between the microphone array and the speech data becomes larger, it indicates that the user rotates the terminal to enable the terminal to get far away from the user, that is, the user no longer interacts with the speech assistant in the terminal.
movement information of the terminal may also be acquired using an accelerometer.
the second speech data is determined to be the speech data sent by the user to the terminal.
gaze estimation may be performed based on the face image information, and it may be determined that the second speech data is speech data sent by the user to the terminal in response to determining that a gaze point corresponding to the face image information is at the terminal based on the gaze estimation.
the face image information may be acquired through a camera device in the terminal, then face recognition and face key point extraction may be carried out, and then a gaze point corresponding to a face in the face image information may be determined through a gaze estimation technology.
the gaze point is at the terminal, it indicates that a user is looking at the terminal. In this case, it may be determined that the second speech data is the speech data sent by the user to the terminal.
second speech data is speech data sent by a user to a terminal by acquiring detection information of the terminal, so that data actually sent to the speech assistant can be directly and quickly determined
technical support can be provided for subsequently reducing the data volume of analysis of speech data, the impact of ambient sound data on the method for speech assistant control can be effectively avoided, and the use requirement of the user is met.
a user who sends speech data to a terminal may be determined based on a method for speaker orientation by a microphone array in a voiceprint recognition technology.
a user who is actually sending speech to the terminal may be determined by blind source separation.
the orientation method and the blind source separation technology are the conventional art and will not be described in detail herein.
displaying an interface corresponding to the target control instruction in response to the target control instruction being included in the second speech data may include the following embodiments.
an application corresponding to the target control instruction may be determined firstly. If multiple applications corresponding to the target control instruction exist in the terminal, for example, multiple music players exist in the terminal when the target control instruction is to play a ABC song, then a default application for playing music in the terminal may be determined to be the application corresponding to the target control instruction, or an application that is most frequently used when the user plays music may be determined as the application corresponding to the target control instruction. Then, after the application corresponding to the target control instruction is determined, the interface corresponding to the target control instruction can be determined from the interface of the application.
the interface corresponding to the target control instruction when the interface corresponding to the target control instruction is determined, if the determined interface and the target interface belong to an interface in a same application, the interface corresponding to the target control instruction may be directly displayed. If the determined interface and the target interface do not belong to an interface in a same application, the determined interface may be displayed in an application to which the determined interface belongs after the current application is jumped to the application to which the determined interface belongs.
an exemplary implementation manner of displaying the interface corresponding to the target control instruction is as follows.
the operation may include: under the condition that there is a window interface corresponding to the target control instruction, the window interface is displayed in the target interface.
window interfaces may be set in advance for multiple instructions, such as a calculator and weather, and a window interface corresponding relationship may be stored.
an instruction corresponding to a window interface may be set in advance.
the window interface may be displayed, i.e., the window interface is displayed in the currently displayed target interface.
the window interface is located on the upper layer of the target interface. In an example, the size of the window interface is smaller than the size of the target interface.
the target control instruction in response to the target control instruction being determined to be included in the second speech data, it may be firstly queried whether there is a window interface corresponding to the target control instruction according to the window interface corresponding relationship.
a speech “query the weather condition of city C" may be sent out.
the speech assistant determines a target control instruction for querying the weather based on the speech data, it may be queried whether a window interface corresponding to the target control instruction exists according to the window interface corresponding relationship.
a window interface corresponding to the weather query result may then be displayed in the target interface for the airline ticket query, as shown at Q in FIG. 4 .
a window interface corresponding to the target control instruction can be displayed in a currently displayed target interface.
a result may be prompted to a user without switching between applications.
the use requirement of a user can be met, a response delay caused by switching between applications can be effectively avoided, and the use experience can be further improved.
the method may further include: the window interface is closed in response to a display duration of the window interface reaching a target duration.
the target duration may be set according to the actual usage scenario, which is not limited by the present disclosure.
a current application may be switched to the indicated application to display an interface corresponding to the target control instruction, so that an execution result of the target control instruction can be displayed.
the method may further include: in response to the target control instruction being included in the second speech data, text information corresponding to the second speech data is displayed at a position corresponding to the speech reception identifier.
a user sends a speech "query weather conditions of city C".
text information may be displayed at a position corresponding to a speech reception identifier, as shown at M in FIG. 5 .
the query result is displayed through a window interface, as shown at Q in FIG. 4 .
a speech sent by the speech assistant can be prompted to a user, so that the user can conveniently determine whether the target control instruction executed by the speech assistant is accurate or not; on the other hand, the speech of the user can be responded before the interface corresponding to the target control instruction is displayed, so that the real-time performance of human-computer interaction is improved, and the user can use the speech assistant conveniently.
text information corresponding to the second speech data may be displayed at a position corresponding to the speech reception identifier in the present disclosure, a user can experience more accurate interaction and the accuracy of speech assistant control is improved.
the method may further include: displaying the speech waiting identifier in the target interface and monitoring a wake-up word or a speech hot word when the speech assistant meets a sleep state.
the wake-up word and the speech hot word can be detected in a manner similar to the wake-up word detection described above, and further description thereof will be omitted.
the display images corresponding to the speech waiting identifier and the speech reception identifier are different, as shown at N in FIG. 6 .
the size and position corresponding to the speech waiting identifier and the speech reception identifier may be the same or different, and the present disclosure is not limited thereto.
the speech assistant meets the sleep state based on at least one of following situations:
the duration of the first preset time period and the second preset time period may be set according to an actual use scenario.
the duration of the first preset time period may be set to 10 minutes
the duration of the second preset time period may be set to 20 minutes.
the target control instruction is not included in the received speech data within the first preset time period, it may be determined that the speech assistant satisfies the sleep state, that is, the target control instruction is not determined from the speech data received within 10 minutes after the execution of the last target control instruction. It indicates that the current user no longer interacts with the speech assistant, and it may be determined that the speech assistant satisfies the sleep state.
no speech data is received within the second preset time period. It may be determined that the speech assistant satisfies the sleep state, that is, the no speech data is received within 20 minutes after the execution of the last target control instruction. It indicates that the current user no longer interacts with the speech assistant, and it may be determined that the speech assistant satisfies the sleep state.
the speech assistant may be controlled to enter the sleep state, so that resources and energy consumption occupied by the speech assistant can be effectively saved.
the speech reception identifier is displayed in the target interface. That is, the speech assistant is waken up when the wake-up word is detected, and the speech reception identifier is displayed in the target interface for continuously receiving speech data.
a control instruction corresponding to the speech hot word is executed under the condition that the speech hot word is detected.
the speech hot word may be used for waking up the speech assistant.
the speech hot word also includes a control instruction.
the speech assistant may be directly waken up, a control instruction corresponding to the speech hot word may be executed, and a speech reception identifier may be displayed in the target interface for continuously receiving speech data.
the speech assistant when it is determined that the speech assistant satisfies the sleep state, the speech assistant may directly quit, and then the speech assistant may be waken up by detecting a wake-up word.
a speech assistant can be controlled to sleep in response to determining that a user does not interact with the speech assistant any longer, so that resources and energy consumption occupied by the speech assistant can be effectively saved.
a speech waiting identifier is displayed in the target interface to prompt that the speech assistant is in a sleep state, so that a user can activate the speech assistant through speech hot words later and can use the speech assistant conveniently. The use process is simplified, thereby further improving user experience.
the apparatus 10 includes: a first display module 100, a control module 200, a first determination module 300, and a second display module 400.
the first display module 100 is configured to, according to a control instruction corresponding to received speech data, display a target interface corresponding to the control instruction after a speech assistant is waken up.
the control module 200 is configured to, in response to the target interface being different from an interface of the speech assistant, display a speech reception identifier in the target interface and control to continuously receive speech data.
the first determination module 300 is configured to, based on received second speech data in a displaying process of the target interface, determine whether a target control instruction to be executed is included in the second speech data.
the second display module 400 is configured to display an interface corresponding to the target control instruction in response to the target control instruction being included in the second speech data.
the second display module is configured to: in response to there is a window interface corresponding to the target control instruction, display the window interface in the target interface.
the apparatus may further include: a closing module, configured to close the window interface in response to a display duration of the window interface reaching a target duration.
a closing module configured to close the window interface in response to a display duration of the window interface reaching a target duration.
the first determination module may further include:
the instruction execution condition may include at least one of following conditions:
the apparatus may further include: a third display module, configured to, in response to the target control instruction being included in the second speech data, display text information corresponding to the second speech data at a position corresponding to the speech reception identifier.
a third display module configured to, in response to the target control instruction being included in the second speech data, display text information corresponding to the second speech data at a position corresponding to the speech reception identifier.
the apparatus may further include:
the speech assistant meets the sleep state based on at least one of following situations:
the apparatus may further include:
the first determination module is configured to determine, based on received second speech data, whether the target control instruction to be executed is included in the second speech data is executed, in response to determining that the second speech data is speech data sent by the user to the terminal.
the second determination module may include:
the present disclosure also provides a computer-readable storage medium.
the computer-readable storage medium stores computer program instructions that, when executed by a processor, implement the operations of the method for speech assistant control provided by the present disclosure.
FIG. 8 is a block diagram of an apparatus 800 for speech assistant control according to an exemplary embodiment.
the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a gaming console, a tablet, a medical device, exercise equipment, a personal digital assistant, and the like.
the apparatus 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.
a processing component 802 a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.
the processing component 802 typically controls overall operations of the apparatus 800, such as the operations associated with display, telephone calls, data communications, camera operations, and recording operations.
the processing component 802 may include one or more processors 820 to execute instructions to perform all or part of the operations in the above described method for speech assistant control.
the processing component 802 may include one or more modules which facilitate the interaction between the processing component 802 and other components.
the processing component 802 may include a multimedia module to facilitate the interaction between the multimedia component 808 and the processing component 802.
the memory 804 is configured to store various types of data to support the operation of the apparatus 800. Examples of such data include instructions for any applications or methods operated on the apparatus 800, contact data, phonebook data, messages, pictures, video, etc.
the memory 804 may be implemented using any type of volatile or non-volatile memory devices, or a combination thereof, such as a static random access memory (SRAM), an electrically erasable programmable read-only memory (EEPROM), an erasable programmable read-only memory (EPROM), a programmable read-only memory (PROM), a read-only memory (ROM), a magnetic memory, a flash memory, a magnetic or optical disk.
SRAM static random access memory
EEPROM electrically erasable programmable read-only memory
EPROM erasable programmable read-only memory
PROM programmable read-only memory
ROM read-only memory
magnetic memory a magnetic memory
flash memory a flash memory
magnetic or optical disk a magnetic or optical
the power component 806 provides power to various components of the apparatus 800.
the power component 806 may include a power management system, one or more power sources, and any other components associated with the generation, management, and distribution of power in the apparatus 800.
the multimedia component 808 includes a screen providing an output interface between the apparatus 800 and the user.
the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes the TP, the screen may be implemented as a touch screen to receive an input signal from the user.
the TP includes one or more touch sensors to sense touches, swipes and gestures on the TP. The touch sensors may not only sense a boundary of a touch or swipe action but also detect a duration and pressure associated with the touch or swipe action.
the multimedia component 808 includes a front camera and/or a rear camera. The front camera and the rear camera may receive an external multimedia datum while the apparatus 800 is in an operation mode, such as a photographing mode or a video mode. Each of the front camera and the rear camera may be a fixed optical lens system or have focus and optical zoom capability.
the audio component 810 is configured to output and/or input audio signals.
the audio component 810 includes a microphone (MIC) configured to receive an external audio signal when the apparatus 800 is in an operation mode, such as a call mode, a recording mode, and a speech recognition mode.
the received audio signal may be further stored in the memory 804 or transmitted via the communication component 816.
the audio component 810 may further include a speaker to output audio signals.
the I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, such as a keyboard, a click wheel, buttons, and the like.
the buttons may include, but are not limited to, a home button, a volume button, a starting button, and a locking button.
the sensor component 814 includes one or more sensors to provide status assessments of various aspects of the apparatus 800.
the sensor component 814 may detect an open/closed status of the apparatus 800, relative positioning of components, e.g., the display and the keypad, of the apparatus 800, a change in position of the apparatus 800 or a component of the apparatus 800, a presence or absence of user contact with the apparatus 800, an orientation or an acceleration/deceleration of the apparatus 800, and a change in temperature of the apparatus 800.
the sensor component 814 may include a proximity sensor configured to detect presence of an object nearby without any physical contact.
the sensor component 814 may also include a light sensor, such as a complementary metal oxide semiconductor (CMOS) or charge coupled device (CCD) image sensor, configured for use in an imaging application.
CMOS complementary metal oxide semiconductor
CCD charge coupled device
the sensor component 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor or a temperature sensor.
the communication component 816 is configured to facilitate communication, wired or wirelessly, between the apparatus 800 and other devices.
the apparatus 800 may access a wireless network based on a communication standard, such as WiFi, 2G, or 3G, or a combination thereof.
the communication component 816 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel.
the communication component 816 may further include a near field communication (NFC) module to facilitate short-range communications.
the NFC module may be implemented based on a radio frequency identification (RFID) technology, an infrared data association (IrDA) technology, an ultra-wideband (UWB) technology, a Bluetooth (BT) technology, and other technologies.
RFID radio frequency identification
IrDA infrared data association
UWB ultra-wideband
BT Bluetooth
the apparatus 800 may be implemented with one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable gate arrays (FPGAs), controllers, micro-controllers, microprocessors, or other electronic components, for performing the above described methods.
ASICs application specific integrated circuits
DSPs digital signal processors
DSPDs digital signal processing devices
PLDs programmable logic devices
FPGAs field programmable gate arrays
controllers micro-controllers, microprocessors, or other electronic components, for performing the above described methods.
non-transitory computer readable storage medium including instructions, such as included in the memory 804, executable by the processor 820 in the apparatus 800, for performing the above-described method for speech assistant control.
the non-transitory computer-readable storage medium may be a ROM, a random access memory (RAM), a CD-ROM, a magnetic tape, a floppy disc, an optical data storage device and the like.
a computer program product is further provided.
the computer program product includes a computer program that can be executed by a programmable apparatus, and the computer program has a code part for performing the above method for speech assistant control when executed by the programmable apparatus.

Landscapes

Engineering & Computer Science (AREA)
Theoretical Computer Science (AREA)
Health & Medical Sciences (AREA)
Audiology, Speech & Language Pathology (AREA)
Physics & Mathematics (AREA)
Multimedia (AREA)
Human Computer Interaction (AREA)
General Physics & Mathematics (AREA)
General Engineering & Computer Science (AREA)
General Health & Medical Sciences (AREA)
Acoustics & Sound (AREA)
Computational Linguistics (AREA)
Artificial Intelligence (AREA)
Business, Economics & Management (AREA)
Computer Vision & Pattern Recognition (AREA)
Game Theory and Decision Science (AREA)
User Interface Of Digital Computer (AREA)

EP21158588.0A 2020-06-30 2021-02-23 Procédé et appareil pour commander un assistant vocal, et support de stockage lisible sur ordinateur Pending EP3933570A1 (fr)

Applications Claiming Priority (1)

Application Number	Priority Date	Filing Date	Title
CN202010621486.6A CN111833868A (zh)	2020-06-30	2020-06-30	语音助手控制方法、装置及计算机可读存储介质

Publications (1)

Publication Number	Publication Date
EP3933570A1 true EP3933570A1 (fr)	2022-01-05

Family

ID=72899946

Family Applications (1)

Application Number	Title	Priority Date	Filing Date
EP21158588.0A Pending EP3933570A1 (fr)	2020-06-30	2021-02-23	Procédé et appareil pour commander un assistant vocal, et support de stockage lisible sur ordinateur

Country Status (3)

Country	Link
US (1)	US20210407521A1 (fr)
EP (1)	EP3933570A1 (fr)
CN (1)	CN111833868A (fr)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number	Priority date	Publication date	Assignee	Title
CN114915916B (zh) *	2021-02-08	2023-08-22	华为技术有限公司	定向控制电子设备的方法及电子设备、可读介质
CN112786048A (zh) *	2021-03-05	2021-05-11	百度在线网络技术（北京）有限公司	一种语音交互方法、装置、电子设备和介质
CN115810354A (zh) *	2021-09-14	2023-03-17	北京车和家信息技术有限公司	语音控制方法、装置、设备及介质
CN114327349B (zh) *	2021-12-13	2024-03-22	青岛海尔科技有限公司	智能卡片的确定方法及装置、存储介质、电子装置
CN115200168B (zh) *	2022-07-13	2023-10-03	深圳中集天达空港设备有限公司	通道空调的控制方法、装置、电子设备和存储介质
CN115933501A (zh) *	2023-01-05	2023-04-07	东方空间技术(山东)有限公司	一种火箭控制软件的操作控制方法、装置及设备

Citations (2)

* Cited by examiner, † Cited by third party
Publication number	Priority date	Publication date	Assignee	Title
US20140040748A1 (en) *	2011-09-30	2014-02-06	Apple Inc.	Interface for a Virtual Digital Assistant
US10586535B2 (en) *	2016-06-10	2020-03-10	Apple Inc.	Intelligent digital assistant in a multi-tasking environment

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number	Priority date	Publication date	Assignee	Title
US7805310B2 (en) *	2001-02-26	2010-09-28	Rohwer Elizabeth A	Apparatus and methods for implementing voice enabling applications in a converged voice and data network environment
TWI521936B (zh) *	2008-04-25	2016-02-11	台達電子工業股份有限公司	對話系統與語音對話處理方法
US9152376B2 (en) *	2011-12-01	2015-10-06	At&T Intellectual Property I, L.P.	System and method for continuous multimodal speech and gesture interaction
US9343068B2 (en) *	2013-09-16	2016-05-17	Qualcomm Incorporated	Method and apparatus for controlling access to applications having different security levels
JP6739907B2 (ja) *	2015-06-18	2020-08-12	パナソニックインテレクチュアルプロパティコーポレーションオブアメリカＰａｎａｓｏｎｉｃＩｎｔｅｌｌｅｃｔｕａｌＰｒｏｐｅｒｔｙＣｏｒｐｏｒａｔｉｏｎｏｆＡｍｅｒｉｃａ	機器特定方法、機器特定装置及びプログラム
KR102636638B1 (ko) *	2016-12-21	2024-02-15	삼성전자주식회사	컨텐츠 운용 방법 및 이를 구현한 전자 장치
US11016729B2 (en) *	2017-11-08	2021-05-25	International Business Machines Corporation	Sensor fusion service to enhance human computer interactions
WO2020108740A1 (fr) *	2018-11-27	2020-06-04	Unify Patente Gmbh & Co. Kg	Procédé de commande d'une conversation en temps réel et plateforme de communication et de collaboration en temps réel
CN109830233A (zh) *	2019-01-22	2019-05-31	Oppo广东移动通信有限公司	语音助手的交互方法、装置、存储介质及终端
CN111724775B (zh) *	2019-03-22	2023-07-28	华为技术有限公司	一种语音交互方法及电子设备
EP3977257A1 (fr) *	2019-05-31	2022-04-06	Google LLC	Attribution dynamique de données circonstancielles à modalités multiples à des demandes d'actions d'assistant pour la corrélation avec des demandes ultérieures
CN113760427B (zh) *	2019-08-09	2022-12-16	荣耀终端有限公司	显示页面元素的方法和电子设备
CN110825469A (zh) *	2019-09-18	2020-02-21	华为技术有限公司	语音助手显示方法及装置

2020
- 2020-06-30 CN CN202010621486.6A patent/CN111833868A/zh active Pending
2021
- 2021-02-03 US US17/166,410 patent/US20210407521A1/en active Pending
- 2021-02-23 EP EP21158588.0A patent/EP3933570A1/fr active Pending

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number	Priority date	Publication date	Assignee	Title
US20140040748A1 (en) *	2011-09-30	2014-02-06	Apple Inc.	Interface for a Virtual Digital Assistant
US10586535B2 (en) *	2016-06-10	2020-03-10	Apple Inc.	Intelligent digital assistant in a multi-tasking environment

Also Published As

Publication number	Publication date
US20210407521A1 (en)	2021-12-30
CN111833868A (zh)	2020-10-27

Legal Events

Date	Code	Title	Description
2021-12-03	PUAI	Public reference made under article 153(3) epc to a published international application that has entered the european phase	Free format text: ORIGINAL CODE: 0009012
2021-12-03	STAA	Information on the status of an ep patent application or granted ep patent	Free format text: STATUS: THE APPLICATION HAS BEEN PUBLISHED
2022-01-05	AK	Designated contracting states	Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR
2022-01-05	B565	Issuance of search results under rule 164(2) epc	Effective date: 20211006
2022-07-01	STAA	Information on the status of an ep patent application or granted ep patent	Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE
2022-08-03	17P	Request for examination filed	Effective date: 20220627
2022-08-03	RBV	Designated contracting states (corrected)	Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

Publication	Publication Date	Title
EP3933570A1 (fr)	2022-01-05	Procédé et appareil pour commander un assistant vocal, et support de stockage lisible sur ordinateur
CN107919123B (zh)	2022-06-03	多语音助手控制方法、装置及计算机可读存储介质
CN108804010B (zh)	2021-07-30	终端控制方法、装置及计算机可读存储介质
US20170060599A1 (en)	2017-03-02	Method and apparatus for awakening electronic device
EP4184506A1 (fr)	2023-05-24	Traitement audio
CN106791893A (zh)	2017-05-31	视频直播方法及装置
EP3046016A1 (fr)	2016-07-20	Procédé et appareil de commutation de mode affichage
CN110554815A (zh)	2019-12-10	图标唤醒方法、电子设备和存储介质
CN111063354B (zh)	2022-03-25	人机交互方法及装置
US11222223B2 (en)	2022-01-11	Collecting fingerprints
EP3299946B1 (fr)	2020-11-04	Procédé et dispositif de commutation d'une image environnementale
CN106791921A (zh)	2017-05-31	视频直播的处理方法及装置
CN110730360A (zh)	2020-01-24	视频上传、播放的方法、装置、客户端设备及存储介质
CN108133708B (zh)	2021-01-08	一种语音助手的控制方法、装置及移动终端
CN111540350B (zh)	2024-03-01	一种智能语音控制设备的控制方法、装置及存储介质
EP3249575A1 (fr)	2017-11-29	Procédé et appareil de détection de pression
CN108874450B (zh)	2021-05-04	唤醒语音助手的方法及装置
CN111580773A (zh)	2020-08-25	信息处理方法、装置及存储介质
CN108766427B (zh)	2020-10-16	语音控制方法及装置
CN115733918A (zh)	2023-03-03	飞行模式的切换方法、装置、电子设备及存储介质
CN110062276A (zh)	2019-07-26	音视频数据的处理方法、装置及电子设备和存储介质
CN111968680A (zh)	2020-11-20	一种语音处理方法、装置及存储介质
CN105786561B (zh)	2020-06-02	进程调用的方法及装置
CN112509596A (zh)	2021-03-16	唤醒控制方法、装置、存储介质及终端
CN113936697A (zh)	2022-01-14	语音处理方法、装置以及用于语音处理的装置