CN118197305A

CN118197305A - Voice control method and device and electronic equipment

Info

Publication number: CN118197305A
Application number: CN202410295204.6A
Authority: CN
Inventors: 熊新雷; 周华; 庞敏辉
Original assignee: Apollo Zhilian Beijing Technology Co Ltd
Current assignee: Apollo Zhilian Beijing Technology Co Ltd
Priority date: 2024-03-14
Filing date: 2024-03-14
Publication date: 2024-06-14

Abstract

The disclosure provides a voice control method, a voice control device and electronic equipment, relates to artificial intelligence, and particularly relates to the technical fields of voice recognition, large language models, automatic driving and the like. The specific implementation scheme is as follows: receiving voice information and converting the voice information into text content; acquiring control information in a current interface, wherein the control information comprises one or more controls in the current interface; the text content and the control information are sent to a language model, and information of a target control output by the language model is received; the language model is used for outputting information of a target control matched with the text content from the input control information and the text content; and triggering the operation aiming at the target control according to the information of the target control. The development efficiency of the application program is improved, the development cost is reduced, and the user experience is improved.

Description

Voice control method and device and electronic equipment

Technical Field

The disclosure relates to the technical fields of speech recognition, large language models, automatic driving and the like in the field of artificial intelligence, and in particular relates to a speech control method, a speech control device and electronic equipment.

Background

In general, an operable control may be displayed on a user interface of an application, and a user may interact with the application by operating the operable control on the application interface using a mouse or a keyboard, or by performing operations such as touch control on the operable control displayed on a touch screen.

In some application scenarios, a user wishes to interact with an application using controls on a voice operated application interface. At present, the scheme of using controls on a voice operation application interface to interact with applications has the problems of low development efficiency, high cost and poor user experience caused by the fact that different spoken language expressions cannot be supported.

Disclosure of Invention

The disclosure provides a voice control method, a voice control device and electronic equipment.

According to a first aspect of the present disclosure, there is provided a voice control method, the method comprising: receiving voice information and converting the voice information into text content; acquiring control information in a current interface, wherein the control information comprises one or more controls in the current interface; the text content and the control information are sent to a language model, and information of a target control output by the language model is received; the language model is used for outputting information of a target control matched with the text content according to the input control information and the text content; and triggering the operation aiming at the target control according to the information of the target control.

According to a second aspect of the present disclosure, there is provided a voice control apparatus including: the receiving unit is used for receiving voice information and converting the voice information into text content; the system comprises an acquisition unit, a control information processing unit and a control processing unit, wherein the acquisition unit is used for acquiring control information in a current interface, and the control information comprises one or more controls in the current interface; the processing unit is used for sending the text content and the control information to a language model and receiving information of a target control output by the language model; the language model is used for outputting information of a target control matched with the text content according to the input control information and the text content; and the triggering unit is used for triggering the operation aiming at the target control according to the information of the target control.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.

According to a fourth aspect of the present disclosure there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method provided by the first aspect.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising: a computer program stored in a readable storage medium, from which it can be read by at least one processor of an electronic device, the at least one processor executing the computer program causing the electronic device to perform the method of the first aspect.

According to the scheme of the disclosure, the received voice information can be converted into text content; acquiring control information of a current interface, sending the text content and the control information to a language model, and processing the text content and the control information by the language model to determine a target control; and receiving information of the target control output by the language model, and then triggering to operate the target control. After the voice information is received, the control information of the current interface is obtained, and the target control to be controlled is determined according to the text content and the control information corresponding to the voice information, so that an application program of the current interface can realize the operation of the control in the application program interface through the voice information without registering the control and the voice instruction corresponding to the control to the voice application. On one hand, the development efficiency of the application program supporting voice control can be improved, and the development cost is reduced; in addition, the voice command of the control is not limited to the registered voice command, so that the control of the control by using different spoken expressions by the user can be supported, and the user experience is improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 4 is a schematic diagram of the structure of a hint according to the present disclosure;

FIG. 5 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 6 is a schematic diagram according to a fifth embodiment of the present disclosure;

FIG. 7 is a schematic block diagram of an example electronic device that may be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

With the development of an operating system, the interactive operation between a user and an application program gradually develops from the earliest command line operation to the mouse visualization operation and then to the touch screen operation. In some scenarios, however, the user wishes to interact with the application by voice (spoken language). Controls on the application interface are operated, for example, by voice control.

The user operates the control on the application program interface through voice control, and can see any control on the interface, and can directly send a voice instruction to the electronic equipment. And the electronic equipment obtains the intention of the user based on a voice recognition technology and a natural language processing technology on the received voice instruction, and finds the best matched interface control to automatically operate. Thus realizing man-machine interaction in a voice mode.

As an implementation scheme for performing man-machine interaction in a voice manner, a control supporting interaction in voice (spoken language) in a third party application program can be registered to a voice application (such as various voice assistants), and corresponding voice instructions can be registered to the voice application. The user may use the registered voice instructions to effect operation of a control registered with the voice application. For example, for an on-screen swipe control, the swipe control may be registered and a voice command "swipe left" registered. The user may speak "slide left" to the electronic device to enable the function of sliding left performed by the sliding control.

The above scheme relies on the registration of the event function by the third party application program, and to realize the operation of the application program control through the voice, the control supported by the application program interface and the corresponding voice instruction need to be registered to the voice application. When the user inputs voice information, the voice information of the user is analyzed through the voice application and matched with registered voice instructions, and the real intention of the user is determined. And further determines registered controls that match the actual intent.

Each registered interface control needs to be associated with a specific function implementation interface, that is, the execution behavior corresponding to the current control, button or icon may be clicking or sliding, for example, a user expresses specific text content displayed on the current picture to the electronic device through microphone voice, after receiving the voice information, the electronic device performs text and intention analysis through voice recognition, semantic analysis and the like, and then associates the analyzed text content with the function implementation interface of the current control, such as clicking the button.

In the implementation manner, the registered voice instruction can be used for performing simulated clicking operation on the control registered in the voice engine. But has the following problems: first, when the text content on the control is complex, the voice command cannot be executed, and the generalization capability is weak. Second, the user can only speak the registered voice command to complete the simulation operation of the control. Voice instructions that have not been registered in advance cannot be executed. For example, a voice command "slide left" is registered for the slide control, and the user can only say the voice command "slide left". If the user wants to slide to the previous page and the next page, the user cannot slide to the previous page and the next page through the voice command.

Therefore, the above implementation has the following drawbacks: the method is highly dependent on registration of a third party application program developer to a control and a corresponding voice instruction, has low development efficiency, and is easy to cause the problem of registration omission. In addition, because the interface design of the application program is often changed, once the interface is changed, a developer needs to register the control and the voice command again, redevelope is needed, a large amount of development resources are consumed, and the maintenance cost is high. In addition, the user can express various spoken expressions of the same intention, and the scheme can only support registered voice instructions and cannot adapt to different spoken expressions.

The disclosure provides a voice control method, a voice control device and electronic equipment, which are applied to the technical fields of voice technology, automatic driving and the like in the artificial intelligence field, so as to achieve the purposes of improving the development efficiency of an application program supporting voice control, reducing the maintenance cost of the application program, adapting to different oral expressions and improving user experience.

Referring to fig. 1, fig. 1 is a schematic diagram of a first embodiment of the disclosure, and as shown in fig. 1, a voice control method provided in this embodiment includes the following steps:

S101, receiving voice information and converting the voice information into text content.

The execution subject of the present embodiment may be a terminal device, and specifically, may be a voice application (e.g., a voice assistant) running in the terminal device. The execution body may be a server, for example, a server that provides services to a voice assistant. The terminal device may be a mobile terminal or a fixed terminal device such as a desktop computer.

A plurality of Applications (APPs) may be installed in the terminal device.

The user may input voice information to the terminal device. In some application scenarios, a user may wake up a voice application in a terminal device before inputting voice information to the terminal device.

After receiving the voice information input by the user, the executing body can perform voice recognition through various voice recognition methods to obtain text content corresponding to the voice information. The above-described voice recognition method includes, but is not limited to: model matching methods (including vector quantization, dynamic time warping), probability statistical methods (gaussian mixture model, hidden markov model), and discriminator classification methods (including support vector machines, artificial neural networks, and deep neural networks).

S102, acquiring control information in the current interface, wherein the control information comprises one or more controls in the current interface.

The current interface refers to an interface being displayed in the terminal device or a current active window of the terminal device when voice information is received.

The current interface may be a system interface of an operating system running in the terminal device, or may be an interface of a third party application running currently.

The third party application currently running may be a desktop application, a mobile application, or other type of application.

After receiving the voice information, the executing body can acquire the control information in the current interface in various modes. The controls indicated by the control information may include one or more controls in the current interface.

Controls are user interface elements, such as buttons, tabs, or check boxes, for displaying content and supporting interactions. The control has control properties and scripting logic that operates and interacts with the control properties. The control attributes described above determine the appearance of the control, and typically include the name, shape, display style, font color, etc. of the control.

The script logic of the control may include an event (response function). An event of a control refers to a response to its operation. The control can have own event set, once a certain event of the control occurs, the execution of a corresponding event process can be caused, the event object has own specific name, and the event process code can be written in advance according to the requirement.

In some embodiments, control information on the application interface may be identified through UI Automation (UIA) technology that is native to the operating system.

As an implementation manner, the step S102 includes: and acquiring control information in the current interface based on an optical character recognition algorithm.

Specifically, an image corresponding to the current interface may be acquired. The text or identifier in the image is then extracted using an optical character recognition (Optical Character Recognition, OCR) algorithm. The extracted text or identifier may then be identified to determine the text content and identifier. The controls in the current interface may then be identified based on the identified text content and/or identifier, location, and other image information. In some application scenarios, the position of a control in an application program interface or the text content or identifier on the control is relatively fixed. In these application scenarios, control elements on the interface may be identified according to location, text content, or identifiers.

In some application scenarios, the OCR-recognized interface element information may be input into a pre-trained control recognition model, which identifies a plurality of controls in the interface from the input interface element information.

In the implementation manner, the control in the application program interface can be acquired in different application scenes.

After the control in the current interface is obtained based on the optical character recognition algorithm, control information may be generated.

The control information may include information for one or more controls in the current interface. The information of the control comprises a control identification, attribute information of the control, a function interface of an event corresponding to the control and the like.

The information for each control includes an identification of the control (e.g., an identity number (Identity document, ID) of the control), a literal content or identifier displayed on the control, a function interface, parameter information, and so forth.

S103, sending the text content and the control information to a language model, and receiving information of a target control output by the language model; the language model is used for outputting information of a target control matched with the text content according to the input control information and the text content.

The language model herein may be various types of models capable of processing natural language, such as a large language model (Large Language Model, LLM), etc. As an implementation manner, the language model may match the input information of the plurality of controls with the text content, and determine the target control according to the matching result.

The execution subject may input control information and text content into the language model. After the language model receives the control information and the text content, the language model can process the information and the text content of a plurality of controls in the control information to determine a target control matched with the text content. The language model may output information of the target control to the execution subject.

The information of the target control comprises an identifier, a position and the like corresponding to the target control. In some embodiments, the voice information further includes parameter information, and the information of the target control may further include parameter information extracted from the voice information. Illustratively, the voice information includes a 50% sliding to the left. The information of the target control output by the language model comprises the identification of the target control and the parameter of sliding 50% leftwards.

S104, triggering the operation for the target control according to the information of the target control.

Specifically, an application program corresponding to the current interface provides an application program interface (Application Programming Interface, API) for performing voice control on a control of the application program in advance to an operating system.

As an implementation manner, after determining the target control, the execution body may send an instruction for operating the target control to the operating system, and the operating system transmits the instruction to the interface to implement the related operation on the target control in the application program, so that the same response effect as that of the manually operated interface control is obtained through voice control.

Alternatively, the execution body may apply the operating system for the use authority of the API interface, and then the execution body may write the information of the target control to the API interface. After receiving the information of the target control, the application program can execute the operation aiming at the target control.

The voice-triggered operations for the target control include one or more of the following: click, slide, scroll, change pages, open new pages, etc.

In the embodiment, voice information is received, and the voice information is converted into text content; acquiring control information in a current interface, wherein the control information comprises one or more controls in the current interface; the text content and the control information are sent to the language model, and information of a target control output by the language model is received; the language model is used for outputting information of a target control matched with the text content according to the input control information and the text content; and triggering the operation aiming at the target control according to the information of the target control. After the voice information is received, the control information of the current interface is obtained, the target control to be controlled is determined according to the text content and the control information corresponding to the voice information, and the operation of the target control is triggered, so that an application program of the current interface can realize the operation of the control in the application program interface through the voice information without registering the control and the voice instruction corresponding to the control to the voice application. The development efficiency of the application program supporting voice control can be improved, and the development and maintenance cost of the application program can be reduced; in addition, the voice instruction of the control is not limited to the registered voice instruction, so that the user can be supported to control the same control by using different spoken expressions, and the user experience is improved.

Referring to fig. 2, fig. 2 is a schematic diagram of a second embodiment of the disclosure, and as shown in fig. 2, the voice control method provided in the present embodiment includes the following steps:

s201, receiving voice information and converting the voice information into text content.

In this embodiment, the execution subject of the voice control method may be a terminal device, and specifically may be a voice assistant running in the terminal device.

The implementation of step S201 may refer to the relevant description of step S101 in the embodiment shown in fig. 1, which is not repeated here.

S202, acquiring control information in a current interface based on an auxiliary function provided by the system.

Auxiliary functions, also known as barrier-free services (Accessibility Service). The barrier-free service includes a set of system-level APIs that can simulate operation. After obtaining the authority that the user authorizes the execution body to obtain the barrier-free service, the operation can be simulated, so that the control information of the current interface is obtained.

Specifically, for example, after a voice assistant application running in the terminal device is started, the voice assistant may apply to the operating system for the right to open an unobstructed service for it. After the barrier-free service authority is obtained, voice Accessibility Service barrier-free services inherited from the barrier-free services are started, and the voice assistant can have the authority to obtain User Interface (UI) elements and corresponding node information in the Interface.

In some embodiments, the step S302 includes the following steps:

firstly, acquiring a root node of a current interface view based on an auxiliary function;

secondly, traversing all the child nodes under the root node of the current interface view, and determining the control in the current interface according to the description information of the child nodes.

After the rights are acquired, the execution subject may acquire root node information of the current interface.

After the root node of the current interface view is obtained, as the interface element nodes adopt a tree structure, the information of all child nodes contained in the root node can be obtained in a recursion traversal mode.

And acquiring text description information of all the nodes and area information of child node views. And determining the control of the current interface according to the text description information.

The root node of the current interface may be obtained using the getRootInActiveWindow () method in class Accessibility Service. The object returned AccessibilityNodeInfo using the method described above may then obtain the childCount attribute (childCount attribute, indicating the number of child nodes of the AccessibilityNodeInfo object in the hierarchy) of that object. All the child nodes can be searched through circulation traversal, and information of each child node is obtained. The control data such as class name (className), text (Text), resource name of source view (viewIdResourceName), location information, description information (description), clickable (isClickable), scrollable (isScrollable), editable (isEditable) and the like can be extracted from the information of the child node.

The class name of the control represents the type of control. The text may be text content displayed on the control. The description information of the control includes information such as a function that the control can perform, and illustratively, the description information of a control may be "this is a close button" or the like.

The control may click (isClickable), scroll (isScrollable), edit (isEditable), etc. on information of how the control is used.

In these embodiments, by acquiring the root node of the current interface view and then traversing all the child nodes of the root node, all the controls in the current interface can be acquired, so that the controls in the current interface cannot be missed, and the accuracy of determining the target controls according to the text content corresponding to the voice information can be improved.

In the step S202, the control information of the current interface can be obtained quickly through the auxiliary function. The control information includes one or more controls.

And S203, generating prompt information comprising text content, control information and task information, wherein the task information is used for indicating to determine a target control matched with the text content.

S204, inputting the prompt information into the language model, and receiving information of the target control which is output after the prompt information is processed by the language model.

The text content, the control information and the task information can be integrated according to a preset format, so that Prompt information (Prompt) is obtained.

In some embodiments, a prompt template may be preset, in which a text content fill area and a control information fill area may be set. Text content obtained by conversion of voice information can be filled into a text content filling area, and control information can be filled into a control information filling area, so that prompt information can be obtained.

It will be appreciated that the alert message template described above may include a task message entry area.

The task information may be a task or command that the language model is to accomplish. In the present disclosure, the task information may include, for example, "a control for determining text content correspondence". The task information is input in the prompt information, so that the language model can understand the intention of the user.

That is, the prompt information may include task information and data required for completing a task indicated by the task information.

The prompt information can set text content and control information after the task information. The text content is generated according to voice information input by a user, and the control information is control information obtained from a current interface when the voice information is received. The text content and the control information may be actual data processed by the language model.

The text content, the control information and the task information are filled into the prompt information template, so that the language model can conveniently identify and extract the task information, the text content and the control information from the prompt information. The method and the device are beneficial to improving the efficiency of determining the target control by text content and control information.

As a schematic illustration, the control information in the prompt includes one or more of the following:

control identification, control text, function interfaces and parameter information.

The control information may include an identification of each control in the current interface (e.g., an identity number (Identity document, ID) of the control), text content displayed in the control, a function interface of the control to the event, corresponding parameter information of the control, and so on.

The execution main body can input the prompt information into a language model, and the language model processes the text content and the control information according to the task information in the prompt information, so that a target control is determined from a plurality of controls included in the control information.

In some embodiments, the prompt further includes processing logic for processing the text content and the control information to obtain a target control, and/or,

Examples of input information including reference text content, reference control information, and output information of a reference target control corresponding to the reference text content are included.

In some application scenarios, the prompt information includes processing logic that processes the text content and the control information to obtain the target control. The processing logic includes one or more of the following: (1) Processing the text content and the control information to obtain one or more steps required by the target control and the execution sequence of each step; (2) Rules for matching text content with control information; (3) determining the standard of the target control from the plurality of controls.

In these application scenarios, the language model is prompted by the processing logic to process the text content and the control information according to the processing logic, thereby obtaining the target control. Wrong answers due to large model illusions can be avoided.

In other application scenarios, the prompt information may further include: examples of input information including reference text content, reference control information, and output information of a reference target control corresponding to the reference text content are included.

In these application scenarios, the above examples may include a complete input-output pair. For example, the input information includes information referencing text content, one or more reference controls. The output information includes a reference target control corresponding to the reference text content. The reference target control belongs to the one or more reference controls. The input/output pairs may include correct input/output pairs or incorrect input/output pairs.

The above examples are given in the prompt message, so that the language model learns the required knowledge through the above examples so as to better understand the task, thereby improving the accuracy of output.

After the language model receives the prompt information, text content and control information in the prompt information can be processed according to the task information, and a target control matched with the text content is obtained. The language model may output information of the target control. The information of the target control comprises: the identification of the target control is in the area of the current interface, the use information of the target control, the parameter value of the function interface corresponding to the control and the like.

S205, triggering the operation for the target control according to the information of the target control.

The execution body can trigger the application program to initiate the operation for the target control through an API provided by the application program where the current interface is located.

For example, for a click type control, a click operation may be triggered to be performed on the target control, and the application program may display a response interface according to the click operation on the target control.

In this embodiment, it is described that control information on a current interface is obtained according to an auxiliary function provided by a system, prompt information including control information, voice information and task information is generated, and the prompt information is processed by a language model to obtain contents of a target control. The control information of the current interface can be rapidly acquired through the auxiliary function; in addition, by generating the prompt information comprising the task information, the control information and the text content, the language model can process the control information and the content information according to the task information in the prompt information to obtain the target control, and the process of training the model can be saved, so that the development cost for controlling the control through voice is further reduced.

In some embodiments of the voice control method shown in fig. 1 and fig. 2, the voice information further includes control parameter information, and the information of the target control output by the language model further includes: control parameter information; and the step S104 or the step S205 further includes:

And triggering the operation aiming at the target control according to the control parameter information.

In these embodiments, the voice information may further include control parameter information for operating the control. The language model may extract the control parameter information from the voice information when processing text content and control information corresponding to the voice information. After determining the target control according to the text content and the control information, the language model can output the identification of the target control, the corresponding function interface and the control parameters.

After receiving the identification of the target control, the corresponding function interface and the control parameter, the execution body can transmit the identification of the target control, the corresponding function interface and the control parameter to the application program through an API interface provided by the application program. The application program can respond according to the identification of the target control, the corresponding function interface and the control parameter.

For example, the control corresponding to the voice information is a sliding control, the voice information includes control parameters such as a sliding direction and a sliding distance, and the control parameters in the voice information include a right sliding half screen schematically.

The language model may extract the control parameter from the voice information, and may output the identification of the slide control and the control parameter (50% of the rightward slide screen) when the language model outputs the information of the target control to the execution subject.

The execution main body can transmit the identification of the sliding control and 50% of control parameters of the rightward sliding screen to the application program through the API interface, and the application program completes the sliding operation of the sliding control according to the identification and the control parameters of the control. In these embodiments, by triggering the operation of the target control according to the control parameters in the voice information, the accuracy of triggering the operation of the target control according to the user voice information can be improved.

In some embodiments of the voice control method shown in fig. 1 and 2, the voice control method further includes the following steps:

Playing preset reminding information in response to receiving information which is output by the language model and does not match with the target control; the preset reminding information is used for reminding the user to input the voice information again or reminding the user that the voice information is not supported.

In these embodiments, if the language model outputs information indicating that the target control is not matched, the alert information may be played. The reminding information may include voice information for reminding the user to make voice input again, for example, "I have not yet heard yet, please speak again; information that the input voice information is not supported, such as "you cannot complete the input voice information," can also be played.

By playing the reminding information, the reminding information can be played to the user when the target control cannot be found, so that the user can know the processing progress of the voice information in time, the user can adjust the input voice information, the control operation through the voice information is realized, and the user experience can be improved.

Referring to fig. 3, fig. 3 is a schematic diagram of a third embodiment of the disclosure, and as shown in fig. 3, the embodiment provides a schematic flow chart of a voice control method.

The terminal device may be capable of running an application program, which may be various application programs. A voice application, such as a voice assistant-like application, may also be run in the terminal device. In some application scenarios, the voice application may be started according to a manual operation of a user, or may be awakened by voice. After the voice application is started, the system can be applied with the authority of the auxiliary function. After the user authorizes the voice application, the voice application may initiate the service assistance function 301. Such as initiating VoiceAccessibilityService auxiliary functions inherited from AccessibilityService implementations.

The user may input the voice information 302, and after the user inputs the voice information, the voice application may perform voice recognition on the input voice information 308, and obtain text content 309 of the voice information according to the voice recognition.

The user inputs voice information while triggering the voice application to acquire the root node 303 of the current interface view based on the auxiliary function. After the root node of the current interface view is obtained, the speech application may recursively traverse all child nodes 304 under the root node of the current interface view. The voice application may obtain information 305 such as description information, area information, etc. of all child nodes. The voice application may also obtain a function 306 corresponding to each control. The voice application may generate control information 307 of the current interface according to the description information and the area information of all the child nodes in the current view interface and the function information corresponding to each control.

After the control information is obtained, the voice application may generate prompt 310 that includes the control information, text content, and task information. The speech application may input the prompt information to a language model. And processing the text content and the control information in the prompt information by the language model to obtain the target control. The language model described above may output the processing result 311.

The voice application may determine whether the output results suggest a hit to the target control 312 on the current interface. If it is determined that the processing result indicates that the target control is not matched, the voice application may play the reminding information 313. The reminding information is used for reminding the user to re-input the voice information or reminding the user that the current voice information cannot be executed.

If it is determined that the processing result prompt matches the target control, the processing result further includes information of the target control. The information of the target control comprises: identification of the target control, region information in the current interface, function interfaces and parameters of the target control, and the like. The voice application, upon receiving the processing results described above, may trigger the preset operation 314 of executing the target control.

According to the scheme provided by the embodiment, the application program does not need to register the control on the interface and the voice instruction corresponding to the control to the voice application, so that the operation of the control on the application interface can be controlled through voice, when the interface content of the application program changes, the control on the interface does not need to register the voice application again, the corresponding voice instruction is registered, the updating efficiency of the application program is improved, and the user experience is improved.

As shown in fig. 4, fig. 4 is a schematic structural diagram of the prompt information provided in the present disclosure. The prompt message (prompt) 40 generated in step 310 is composed of four parts, the first part is task information 401, the second part is information 402 of the current on-screen control, the third part is given example information 403, and the fourth part is given question (text content corresponding to voice information) 404 of the user. The fifth part is the output format requirement 405. The output format is limited by the output format requirements. The output format requirements may include an identification of the control.

As an exemplary illustration, the task information described above is as follows:

"given the description information of page control elements on some screens, the description information comprises the ID of the identity number, text content, function interface and parameter information of each control element, and simultaneously gives a user question, predicts the ID of the identity number most likely to match the user question on the screen, and gives the value of the corresponding function interface parameter".

FIG. 4 is a schematic diagram of a prompt message for prompting a language model to resolve a real intent in a voice message.

Fig. 5 is a schematic diagram of a fourth embodiment of the present disclosure, and as shown in fig. 5, a voice control apparatus 500 provided in this embodiment includes: a receiving unit 501, an acquisition unit 502, a processing unit 503 and a triggering unit 504, wherein,

A receiving unit 501 for receiving voice information and converting the voice information into text content;

an obtaining unit 502, configured to obtain control information in a current interface, where the control information includes one or more controls in the current interface;

the processing unit 503 is configured to send the text content and the control information to the language model, and receive the information of the target control output by the language model; the language model is used for outputting information of a target control matched with the text content according to the input control information and the text content;

And the triggering unit 504 is configured to trigger an operation for the target control according to the information of the target control.

In some embodiments, the processing unit 503 includes a generation module 5031 and an input module 5032; wherein,

The generating module 5031 is configured to: generating prompt information comprising text content, control information and task information, wherein the task information is used for indicating to determine a target control matched with the text content;

The input module 5032 is configured to: and inputting the prompt information into the language model, and receiving the information of the target control which is output after the language model processes the prompt information.

In some embodiments, the control information in the hint information includes one or more of:

In some embodiments, the hint information further includes:

Processing the text content and the control information to obtain processing logic of the target control, and/or,

In some embodiments, the acquisition unit 502 includes a first acquisition module 5021 or a second acquisition module 5022; wherein,

The first obtaining module 5021 is configured to: acquiring control information in a current interface based on an auxiliary function provided by a system;

The second obtaining module 5022 is configured to: and acquiring control information in the current interface based on an optical character recognition algorithm.

In some embodiments, the first acquisition module 5021 is further to:

acquiring a root node of a current interface view based on an auxiliary function;

traversing all the child nodes under the root node of the current interface view, and determining the control in the current interface according to the description information of the child nodes.

In some embodiments, the voice information further includes control parameter information, and the information of the target control output by the language model further includes: control parameter information; and the triggering unit 504 includes a triggering module 5041, the triggering module 5041 being configured to:

In some embodiments, the apparatus further comprises a playback unit 505;

the playing unit 505 is configured to: playing preset reminding information in response to receiving information which is output by the language model and does not match with the target control; the preset reminding information is used for reminding the user to input the voice information again or reminding the user that the voice information is not supported.

Fig. 6 is a schematic diagram of a fifth embodiment of the present disclosure, and as shown in fig. 6, an electronic device 600 in the present embodiment may include: a processor 601 and a memory 602.

A memory 602 for storing a program; the memory 602 may include a volatile memory (english: volatile memory), such as a random-access memory (RAM), such as a static random-access memory (SRAM), a double data rate synchronous dynamic random-access memory (DDR SDRAM), etc.; the memory may also include a non-volatile memory (English) such as a flash memory (English). The memory 602 is used to store computer programs (e.g., application programs, functional modules, etc. that implement the methods described above), computer instructions, etc., which may be stored in one or more of the memories 602 in a partitioned manner. And the above-described computer programs, computer instructions, data, etc. may be called upon by the processor 601.

The computer programs, computer instructions, etc., described above may be stored in one or more of the memories 602 in partitions. And the above-described computer programs, computer instructions, etc. may be invoked by the processor 601.

A processor 601 for executing a computer program stored in a memory 602 to implement the steps of the method according to the above embodiment.

Reference may be made in particular to the description of the embodiments of the method described above.

The processor 601 and the memory 602 may be separate structures or may be integrated structures integrated together. When the processor 601 and the memory 602 are separate structures, the memory 602 and the processor 601 may be coupled by a bus 603.

The electronic device in this embodiment may execute the technical scheme in the above method, and the specific implementation process and the technical principle are the same, which are not described herein again.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

According to an embodiment of the present disclosure, there is also provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the solution provided by any one of the above embodiments.

According to an embodiment of the present disclosure, the present disclosure also provides a computer program product comprising: a computer program stored in a readable storage medium, from which at least one processor of an electronic device can read, the at least one processor executing the computer program causing the electronic device to perform the solution provided by any one of the embodiments described above.

FIG. 7 illustrates a schematic block diagram of an example electronic device that may be used to implement embodiments of the present disclosure. The electronic device 700 may be a terminal device or a server. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the apparatus 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in device 700 are connected to I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, etc.; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, an optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the respective methods and processes described above, for example, the generation operation process of the electronic case report. For example, in some embodiments, the operational process of generating the electronic case report may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 700 via ROM 702 and/or communication unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the above-described electronic case report generation operation processing may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the electronic case report generation operation processing by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual PRIVATE SERVER" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A voice control method, comprising:

receiving voice information and converting the voice information into text content;

Acquiring control information in a current interface, wherein the control information comprises one or more controls in the current interface;

The text content and the control information are sent to a language model, and information of a target control output by the language model is received; the language model is used for outputting information of a target control matched with the text content according to the input control information and the text content;

And triggering the operation aiming at the target control according to the information of the target control.

2. The method of claim 1, wherein the sending the text content and the control information to a language model, receiving information of a target control output by the language model, comprises:

Generating prompt information comprising the text content, the control information and task information, wherein the task information is used for indicating a target control which is determined to be matched with the text content;

And inputting the prompt information into a language model, and receiving information of a target control which is output after the language model processes the prompt information.

3. The method of claim 2, wherein the control information in the hint information includes one or more of:

4. The method of claim 2, wherein the hint information further comprises:

processing the text content and the control information to obtain processing logic of a target control, and/or,

5. The method of claim 1, wherein the obtaining control information in the current interface comprises:

acquiring control information in a current interface based on an auxiliary function provided by a system; or alternatively

And acquiring control information in the current interface based on an optical character recognition algorithm.

6. The method of claim 5, wherein the obtaining control information in the current interface based on the auxiliary functionality provided by the system comprises:

acquiring a root node of the current interface view based on an auxiliary function;

7. The method of claim 1, wherein the speech information further comprises control parameter information, and the information of the target control output by the language model further comprises: the control parameter information; and triggering the operation for the target control according to the information of the target control, including:

8. The method according to any one of claims 1-7, further comprising:

Playing preset reminding information in response to receiving information which is output by the language model and does not match with a target control; the preset reminding information is used for reminding the user to input the voice information again or reminding the user that the voice information is not supported.

9. A voice control apparatus comprising:

the receiving unit is used for receiving voice information and converting the voice information into text content;

the system comprises an acquisition unit, a control information processing unit and a control processing unit, wherein the acquisition unit is used for acquiring control information in a current interface, and the control information comprises one or more controls in the current interface;

the processing unit is used for sending the text content and the control information to a language model and receiving information of a target control output by the language model; the language model is used for outputting information of a target control matched with the text content according to the input control information and the text content;

And the triggering unit is used for triggering the operation aiming at the target control according to the information of the target control.

10. The apparatus of claim 9, wherein the processing unit comprises a generation module and an input module; wherein,

The generating module is used for: generating prompt information comprising the text content, the control information and task information, wherein the task information is used for indicating a target control which is determined to be matched with the text content;

The input module is used for: and inputting the prompt information into a language model, and receiving information of a target control which is output after the language model processes the prompt information.

11. The apparatus of claim 10, wherein the control information in the hint information includes one or more of:

12. The apparatus of claim 10, wherein the hint information further comprises:

13. The apparatus of claim 9, wherein the acquisition unit comprises a first acquisition module or a second acquisition module; wherein,

The first acquisition module is used for: acquiring control information in a current interface based on an auxiliary function provided by a system;

The second acquisition module is used for: and acquiring control information in the current interface based on an optical character recognition algorithm.

14. The apparatus of claim 13, wherein the first acquisition module is further configured to:

15. The apparatus of claim 9, wherein the speech information further comprises control parameter information, and wherein the information of the target control output by the language model further comprises: the control parameter information; and the triggering unit comprises a triggering module, wherein the triggering module is used for:

16. The apparatus according to any one of claims 9-15, further comprising a playback unit;

The playing unit is used for: playing preset reminding information in response to receiving information which is output by the language model and does not match with a target control; the preset reminding information is used for reminding the user to input the voice information again or reminding the user that the voice information is not supported.

17. An electronic device, comprising:

At least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-8.

19. A computer program product comprising a computer program which, when executed by a processor, implements the steps of the method of any of claims 1-8.