CN117636874A

CN117636874A - Robot dialogue method, system, robot and storage medium

Info

Publication number: CN117636874A
Application number: CN202311477336.2A
Authority: CN
Inventors: 李鑫
Original assignee: Cloudminds Shanghai Robotics Co Ltd
Current assignee: Cloudminds Shanghai Robotics Co Ltd
Priority date: 2023-11-07
Filing date: 2023-11-07
Publication date: 2024-03-01

Abstract

The application provides a robot dialogue method, a system, a robot and a storage medium, wherein the system comprises: the robot is used for collecting the voice signal of the speaker and the scene image of the position of the speaker, determining a first text corresponding to the voice signal and a second text corresponding to the scene image, and sending the first text and the second text to the cloud server; the cloud server is used for acquiring prompt texts corresponding to the first text and the second text based on a preset prompt template, inputting the prompt texts into the large language model, obtaining dialogue contents corresponding to the prompt texts, and sending the dialogue contents to the robot so that the robot performs dialogue interaction with a speaker based on the dialogue contents; the preset prompt template is used for limiting the content formats for describing the first text and the second text, and the prompt text is used for guiding the large language model to infer, so that the time delay between the end to end is reduced, and meanwhile, the robot can more intelligently talk with a speaker.

Description

Robot dialogue method, system, robot and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a robot dialogue method, a robot dialogue system, a robot and a storage medium.

Background

In recent years, with the rapid development of artificial intelligence technology, robots have become a hotspot in the industry, and various robots are layered endlessly. At present, a user mainly interacts with a robot in a voice dialogue mode, a dialogue scheme is mainly completed at a cloud end, however, operation time consumption of feature extraction, model reasoning and the like in the dialogue scheme is relatively high, a certain time delay exists between the end of the robot and the cloud end in a voice dialogue process, the requirement of voice interaction instantaneity cannot be met, and user experience is poor. Therefore, how to reduce the time delay between the end of the robot and the cloud end in the process of the voice conversation between the user and the robot becomes a technical problem to be solved.

Disclosure of Invention

The embodiment of the invention provides a robot dialogue method, a system, a robot and a storage medium, which are used for improving the performance of a robot dialogue system, reducing the time delay between a user and the end of the robot in the process of voice dialogue, and enabling the robot to perform dialogue interaction with the user more intelligently.

In a first aspect, an embodiment of the present invention provides a robot dialog system, the system including:

the robot and the cloud server;

The robot is used for collecting a voice signal of a speaker and a scene image of the position of the speaker, determining a first text corresponding to the voice signal and a second text corresponding to the scene image, and sending the first text and the second text to the cloud server; the second text is used for describing the image content of the scene image;

the cloud server is used for acquiring prompt texts corresponding to the first text and the second text based on a preset prompt template, inputting the prompt texts into a large language model, obtaining dialogue contents corresponding to the prompt texts, and sending the dialogue contents to the robot so that the robot performs dialogue interaction with the speaker based on the dialogue contents; the preset prompt template is used for limiting a content format describing the first text and the second text, and the prompt text is used for guiding the large language model to conduct reasoning.

In a second aspect, an embodiment of the present invention provides a robot dialogue method, applied to a robot in a robot dialogue system, where the method includes:

collecting a voice signal of a speaker and a scene image of the position of the speaker, wherein the voice signal comprises the voice of the speaker;

Determining a first text corresponding to the voice signal;

determining a second text corresponding to the scene image, wherein the second text is used for describing the image content of the scene image;

the first text and the second text are sent to the cloud server, so that the cloud server obtains prompt texts corresponding to the first text and the second text based on a preset prompt template, and the prompt texts are input to a large language model to obtain dialogue contents corresponding to the prompt texts; the preset prompt template is used for limiting a content format for describing the first text and the second text, and the prompt text is used for guiding the large language model to make reasoning;

and receiving the dialogue content sent by the cloud server, and performing dialogue interaction with the speaker based on the dialogue content.

In a third aspect, an embodiment of the present invention provides a robot dialogue device, including a robot located in a robot dialogue system, the robot dialogue device comprising:

the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring a voice signal of a speaker and a scene image of the position of the speaker, and the voice signal comprises the voice of the speaker;

The first determining module is used for determining a first text corresponding to the voice signal;

the second determining module is used for determining a second text corresponding to the scene image, wherein the second text is used for describing the image content of the scene image;

the sending module is used for sending the first text and the second text to the cloud server, so that the cloud server obtains prompt texts corresponding to the first text and the second text based on a preset prompt template, and inputs the prompt texts into a large language model to obtain dialogue contents corresponding to the prompt texts; the preset prompt template is used for limiting a content format for describing the first text and the second text, and the prompt text is used for guiding the large language model to make reasoning;

and the interaction module is used for receiving the dialogue content sent by the cloud server and carrying out dialogue interaction with the speaker based on the dialogue content.

In a fourth aspect, an embodiment of the present invention provides a robot, including: a memory, a processor, a communication interface; wherein the memory has executable code stored thereon which, when executed by the processor, causes the processor to perform the robot conversation method of the second aspect.

In a fifth aspect, embodiments of the present invention provide a non-transitory machine-readable storage medium having executable code stored thereon, which when executed by a processor of an electronic device, causes the processor to at least implement a robot conversation method as described in the second aspect.

In a sixth aspect, an embodiment of the present invention provides a robot dialogue method, which is applied to a cloud server in a robot dialogue system, where the method includes:

receiving a first text and a second text sent by a robot, wherein the robot collects a voice signal of a speaker and a scene image of the position of the speaker, determines the first text corresponding to the voice signal and the second text corresponding to the scene image, and sends the first text and the second text to the cloud server; the second text is used for describing the image content of the scene image;

acquiring prompt texts corresponding to the first text and the second text based on a preset prompt template, wherein the preset prompt template is used for limiting content formats for describing the first text and the second text, and the prompt texts are used for guiding the large language model to carry out reasoning;

And inputting the prompt text into a large language model, obtaining dialogue content corresponding to the prompt text, and sending the dialogue content to the robot so that the robot performs dialogue interaction with the speaker based on the dialogue content.

In a seventh aspect, an embodiment of the present invention provides a robot dialogue device, which is located in a cloud server in a robot dialogue system, and the device includes:

the receiving module is used for receiving a first text and a second text sent by the robot, wherein the robot collects a voice signal of a speaker and a scene image of the position of the speaker, determines the first text corresponding to the voice signal and the second text corresponding to the scene image, and sends the first text and the second text to the cloud server; the second text is used for describing the image content of the scene image;

the acquisition module is used for acquiring prompt texts corresponding to the first text and the second text based on a preset prompt template, the preset prompt template is used for limiting content formats for describing the first text and the second text, and the prompt texts are used for guiding the large language model to carry out reasoning;

And the processing module is used for inputting the prompt text into a large language model, obtaining dialogue content corresponding to the prompt text, and sending the dialogue content to the robot so that the robot performs dialogue interaction with the speaker based on the dialogue content.

In an eighth aspect, an embodiment of the present invention provides a cloud server, including: a memory, a processor, a communication interface; wherein the memory has executable code stored thereon, which when executed by the processor, causes the processor to at least implement the robot conversation method of the sixth aspect.

In a ninth aspect, embodiments of the present invention provide a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of a user equipment, causes the processor to at least implement a robot conversation method as described in the sixth aspect.

The robot dialogue scheme provided by the embodiment of the invention mainly comprises a robot and a cloud server. In the process of carrying out voice conversation with a user, the robot firstly collects voice signals of a speaker and scene images of the position of the speaker, and then determines a first text corresponding to the voice signals and a second text corresponding to the scene images, wherein the first text is used for describing voice contents, and the second text is used for describing image contents of the scene images. And finally, the robot sends the first text and the second text obtained after processing to the cloud server. After receiving the first text and the second text sent by the robot, the cloud server firstly obtains prompt texts corresponding to the first text and the second text based on a preset prompt template, and inputs the prompt texts into the large language model to obtain dialogue contents corresponding to the prompt texts. The preset prompt template is used for limiting the content formats for describing the first text and the second text, and based on the preset prompt template, the prompt text in a fixed format is generated and can be used for guiding the large language model to conduct reasoning so as to obtain more accurate dialogue content.

According to the scheme, the conversation processing process is divided into two parts, namely, the robot preprocesses the collected voice signals and scene images of the speaker, so that the corresponding first text and second text are obtained, and then the processed first text and second text are transmitted to the cloud server, so that the cloud server can directly process the first text and the second text, the processing efficiency of the cloud server is improved, and the problem of time delay between the end to the end can be reduced. In addition, when the dialogue processing is carried out, personalized dialogue content is generated by combining the scene information of the environment where the speaker is, the user experience sense can be improved, the prompt text corresponding to the voice signal and the scene image is obtained through the preset prompt template, the dialogue scene can be better understood by utilizing the prompt text to guide the large language model, and the inference is carried out by combining the dialogue scene, so that more accurate dialogue content is obtained, and the robot can carry out dialogue interaction with the speaker more intelligently.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is an application schematic diagram of a robot dialogue system according to an embodiment of the present invention;

fig. 2 is an application schematic diagram of another robot dialogue system according to an embodiment of the present invention;

FIG. 3 is a flowchart of a robot dialogue method according to an embodiment of the present invention;

FIG. 4 is a flowchart of another robot dialogue method according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a robot dialogue device according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a robot according to the present embodiment;

fig. 7 is a schematic structural diagram of another robot dialogue device according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a cloud server according to the present embodiment.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In addition, the sequence of steps in the method embodiments described below is only an example and is not strictly limited.

As artificial intelligence technology evolves, intelligent robots are beginning to be applied in an increasing number of scenarios. At present, a user can interact with an intelligent robot in a voice dialogue mode, a dialogue scheme is mainly completed at a cloud end, however, the time consumption of the processes such as feature extraction, model reasoning and the like in the dialogue scheme is relatively large, a certain time delay exists between a robot end and the cloud end, the requirement of voice interaction instantaneity cannot be met, and user experience is poor.

In addition, when the intelligent machine and the user perform dialogue interaction at present, the environment information, the gesture and the character information of the user at present cannot be perceived, the questions can only be answered mechanically, and the requirements of the user under different ages, sexes, emotions, weather, geographic positions and gestures are difficult to meet.

Therefore, how to reduce the time delay from the end of the robot to the cloud end in the voice conversation process and break through the existing simple interaction mode of the service robot, so that the service robot becomes more intelligent and can interact with the user in a targeted manner, and the experience of the user and the robot is improved, so that the technical problem to be solved in the robot industry is urgent.

In order to solve the above problems, the embodiment of the invention provides a new robot dialogue scheme. In the robot conversation scheme, the conversation processing process is completed through the cooperation of the robot and the cloud server, the speaker voice signal and the scene image acquisition and preprocessing process are completed at the robot end, and the first text corresponding to the preprocessed voice signal and the second text corresponding to the scene image can be directly transmitted to the cloud server, so that the cloud server can directly perform subsequent processing based on the first text and the second text, the processing pressure of the cloud server can be reduced, the processing efficiency is improved, and the time delay between the robot end and the cloud server in the voice conversation process is reduced. In addition, when the cloud server generates dialogue content corresponding to the speaker by using the large language model, the first text and the second text are converted into prompt texts of content description in a fixed format based on a preset prompt template, so that the large language model can better understand the first text and the second text, and the dialogue content more closely related to the speaker is generated by reasoning according to the prompt texts, and further user experience is improved.

Some embodiments of the present invention are described in detail below with reference to the accompanying drawings. In the case where there is no conflict between the embodiments, the following embodiments and features in the embodiments may be combined with each other.

In order to better understand the solution, before describing a specific embodiment of the robot dialogue method, an exemplary description of the robot dialogue system is given in connection with an application scenario.

Fig. 1 is an application schematic diagram of a robot dialogue system according to an embodiment of the present invention; as shown in fig. 1, the robot dialog system includes a robot and a cloud server. The robot may be an operation robot, a chat robot, a sweeping robot, a transfer robot, or the like, which may directly provide a dialogue service for a user toward the user. In practical application, the user can complete intelligent question answering by performing voice dialogue with the robot or control the robot to complete some operations and the like. The cloud server may be any device with certain data computing capability, such as a single conventional server, a cluster server, a cloud host, a virtual center, and the like.

The user can directly initiate an intelligent question-answer request to the robot device in a voice dialogue mode or intelligently control the robot to complete corresponding operations, such as controlling the movement of the robot, controlling the meal delivery of the robot and the like. After the robot receives the voice dialogue request initiated by the user, the dialogue content interacted with the speaker can be determined based on the voice signal of the speaker and the information of the current environment of the speaker, if the voice dialogue content initiated by the speaker is used for controlling the robot to complete corresponding operation, the robot can also call the control system to control the robot to execute the same operation besides the generated dialogue content and the speaker, and the robot can further carry out corresponding adjustment suggestion on the task to be executed according to the current task to be executed and the current environment of the speaker so as to generate the corresponding dialogue content and further interact with the speaker.

When the robot receives the voice dialogue request initiated by the user, the voice signal of the speaker and the scene image of the position of the speaker can be collected first. The voice signal comprises the voice of the speaker, the scene image can comprise information such as the environment where the speaker is located, the face of the speaker, the state of the speaker and the like, and the acquisition range, the acquisition target and the like can be determined according to actual requirements.

In the embodiment of the invention, when the robot performs dialogue interaction with the speaker, in order to better sense the current environmental information, the current gesture, the character information and the like of the speaker, and when the voice signal of the speaker is acquired, the scene image of the position of the speaker is acquired at the same time, so as to obtain the current environmental information, the age, the expression, the action and other information of the character of the speaker, so as to generate corresponding dialogue content according to the current environmental information and the basic character information of the speaker and the like, thereby providing better service for the user.

In the embodiment of the present invention, the manner in which the robot collects the voice signal of the speaker and the scene image of the location of the speaker is not limited, and those skilled in the art may set the manner according to specific application requirements, for example: the robot can collect and collect the scene image of the position of the speaker and the voice signal of the speaker through the camera equipment and the radio equipment carried by the robot.

After the robot collects the voice signal of the speaker and the scene image of the position of the speaker, the robot can preprocess the voice signal of the speaker and the scene image of the position of the speaker. Specifically, a first text corresponding to the voice signal and a second text corresponding to the scene image are determined. Wherein the first text is used to describe the speech content spoken by the speaker and the second text is used to describe the image content of the scene image.

In the embodiment of the invention, the robot can determine the first text corresponding to the voice signal by performing text conversion processing on the voice signal. In addition, the embodiment of the present invention is not limited to a specific implementation manner of determining the first text corresponding to the voice signal, and a person skilled in the art may set the implementation manner according to specific application requirements and design requirements, for example, one implementation manner may determine the first text through a preset machine learning model, and specifically, the voice signal may be subjected to text conversion processing by using the voice recognition model, so as to obtain the first text corresponding to the voice signal.

The speech recognition model may be text that is trained in advance to determine correspondence to the speech signal, for example, the speech recognition model may be a speech synthesis (TTS) model or the like. In addition, the deep learning network can also be used for learning and training to generate a voice recognition model, namely, the deep learning network is used for learning and training by utilizing a preset reference voice signal and standard text corresponding to the reference voice signal, so that the voice recognition model can be obtained. After the speech recognition model is established, the speech signal of the speaker may be analyzed using the speech recognition model, so that a first text corresponding to the speech signal of the speaker may be obtained.

In the embodiment of the invention, the second text corresponding to the scene image can be determined by carrying out semantic recognition on the scene image. In addition, the embodiment of the invention does not limit the specific implementation manner of determining the second text corresponding to the scene image, and a person skilled in the art can set the specific implementation manner according to specific application requirements and design requirements, for example, one implementation manner can determine the second text through a preset machine learning model, specifically, the image semantic segmentation model is utilized to perform semantic segmentation processing on the scene image so as to obtain semantic information corresponding to each object in the scene image and position information corresponding to each object, and the second text corresponding to the scene image is determined based on the semantic information and the position information.

Wherein, each object can be a person, a background, a building, a plant, etc. in the scene where the speaker is located. The semantic information may include at least one of: visual layer semantics and object layer semantics, in particular, visual layer semantics may include: color, texture, shape, etc.; object layer semantics may include attribute features corresponding to objects located in a scene image, such as: status information of a certain object at a certain moment, structure information included in a certain object, and the like. The location information may include a specific location in the scene that corresponds to the scene in the scene image, and so on.

In practical application, in order to improve the processing efficiency of the robot end side, an acquired scene image may be analyzed and processed by using a Mobi leSAM (mobile segmentation) image semantic segmentation model or other types, so as to obtain semantic information and position information corresponding to each object in the scene image.

After a first text corresponding to a speaker voice signal and a second text corresponding to a scene image are obtained, the first text and the second text are sent to a cloud server, so that the cloud server processes the first text and the second text, and corresponding dialogue content is obtained. The voice signal and scene image preprocessing process in the dialogue processing process is deployed at the robot end for execution, so that the data processing pressure of the cloud server can be reduced, the time delay between the ends is reduced, meanwhile, the data quantity corresponding to the first text and the second text is obviously smaller than the data quantity corresponding to the voice signal and the scene image, the data transmission efficiency can be improved, the end-to-end time delay in the voice dialogue process is reduced, and the robot can more smoothly interact with the user, so that the real-time voice interaction requirement of the user is met.

In order to facilitate the cloud server to better process the first text and the second text so as to obtain more accurate dialogue content, when the first text and the second text are sent to the cloud server, the robot can add a segmenter between the first text and the second text so as to distinguish two text categories, so that the first text and the second text can be accurately analyzed and processed later. In addition, corresponding labels can be added in the first text and the second text in advance, so that the cloud server can better know the corresponding text. For example, a first text < text corresponding to a speech signal > < segmenter > a second text < text corresponding to an image >, and the like.

And after receiving the first text and the second text sent by the robot, the cloud server acquires prompt texts corresponding to the first text and the second text based on a preset prompt template. The method comprises the steps that a prompt template is preset for limiting a content format for describing a first text and a second text, and the prompt text is used for guiding a large language model to conduct reasoning. The first text and the second text can be converted into the prompt text with a fixed format based on the preset prompt template, and the description of different first texts and second texts with a uniform format is realized because the preset prompt template is used for limiting the content formats for describing the first text and the second text.

Because the collected voice signals and the scene images are data of different modes, the second text obtained by processing the scene images represents visual characteristics, however, the large language model has better processing capacity for the text characteristics, and if the second text is directly input into the corresponding large language model, the large language model cannot truly understand scene information and cannot better combine the scene information to generate corresponding dialogue content. Therefore, in the embodiment of the invention, before the large language model is utilized, a preset prompt template is adopted to convert the first text and the second text into prompt texts, and prompt information is certain based on the large language model so as to guide the large language model to better enable the first text and the second text and output dialogue contents wanted by a user.

Specifically, the preset prompt template may be a preset prompt learning template, a preset thinking chain prompt template, etc., and the type and format of the preset prompt template are not limited, and the preset prompt template may be set according to actual requirements. The preset prompt template can comprise input information, word description corresponding to the task to be processed, output information, prompt information, an reasoning process for the task to be processed and the like.

The preset learning prompt template may include a plurality of elements such as an element for defining task input information, an element for defining task description information, an element for defining task output information, and the like. When the method is implemented, the presentation elements corresponding to the prompt learning template can be determined first, and then the task input information and the task description information corresponding to the task to be processed are acquired based on the presentation elements, so that different dialogue tasks to be processed are described in a unified format. Any one of the dialog tasks may be described collectively by way of a < input, task literal > triplet, or any one of the dialog tasks may be described collectively by way of a < input, task literal, output > triplet, or the like. Therefore, the large language model can clearly acquire specific tasks to be processed, input content, output content and other information according to the prompt text, and can better complete the task of generating dialogue content.

In order to better utilize the language understanding capability and the reasoning capability of the large language model so as to obtain more accurate dialogue content and meet the personalized requirements of users, a preset thinking chain prompt template can be adopted to convert the first text and the second text into prompt texts in a fixed format. The preset thinking chain prompt template comprises an reasoning process for generating corresponding dialogue content based on the first text and the second text, and compared with the traditional prompt template, the thinking chain is added, so that the reasoning capability of a large language model can be enhanced. The thinking chain is a series of short sentences, imitates the thinking and possible reasoning process of human beings when answering questions, so that the large language model is not a simple final answer of the predicted questions, but predicts a 'thinking process' corresponding to the questions, and the capability of the large language model in complex reasoning is obviously improved.

Wherein, the preset thinking chain template can comprise elements for limiting the task processing process. In a specific implementation, the first text and the second text are filled in corresponding positions according to the representing elements included in the thinking chain prompt template so as to obtain corresponding prompt texts. Any of the dialog tasks may be converted into a description in the form of a mental chain. The preset thinking chain prompt template comprises an reasoning process for generating corresponding dialogue content based on the first text and the second text, and compared with the traditional prompt template, the thinking chain is added, so that the reasoning capability of a large language model can be enhanced.

In addition, a plurality of preset prompt templates can be created in the cloud server in advance, the created plurality of preset prompt templates are stored in the cloud server, after the cloud server receives the first text and the second text sent by the robot, a target prompt template can be selected from the plurality of preset prompt templates according to a preset prompt template selection strategy, and prompt texts corresponding to the first text and the second text are acquired based on the target prompt template. The preset prompting templates can comprise preset prompting learning templates and templates of various types such as preset thinking chain prompting templates. In an alternative embodiment, the dialog type may also be determined according to the first text, and the corresponding target prompt template may be determined according to the dialog type, the first text, and the second text.

After the prompt text is acquired, the prompt text can be input into a large language model to obtain dialogue content corresponding to the prompt text, and the generated dialogue content is sent to the robot so that the robot performs dialogue interaction with a speaker based on the dialogue content. The prompt text not only comprises the first text and the second text of the input information, but also comprises task text description and corresponding prompt information, so that the understanding capability and the reasoning capability of the large language model can be better stimulated, the large language model is guided to understand the dialogue content and the dialogue scene of the speaker, and the dialogue content and the dialogue scene of the speaker are combined to generate the more targeted dialogue content, thereby avoiding the robot from mechanically carrying out dialogue with the speaker and further improving the dialogue capability of the robot dialogue system.

In the embodiment of the invention, the conversation processing process is divided into two parts for processing, namely, the robot preprocesses the acquired voice signal and scene image of the speaker to obtain the corresponding first text and second text, and then the processed first text and second text are transmitted to the cloud server, so that the cloud server can directly process the first text and the second text, the processing efficiency of the cloud server is improved, and the problem of time delay between the end and the end is also reduced. In addition, when the dialogue processing is carried out, personalized dialogue content is generated by combining the scene information of the environment where the speaker is, the user experience sense can be improved, the prompt text corresponding to the voice signal and the scene image is obtained through the preset prompt template, the dialogue scene can be better understood by utilizing the prompt text to guide the large language model, and the inference is carried out by combining the dialogue scene, so that more accurate dialogue content is obtained, and the robot can carry out dialogue interaction with the speaker more intelligently.

The above embodiment describes a conversation processing process of a robot conversation system, where a robot is mainly used to collect a voice signal of a speaker and a scene image of a position where the speaker is located, and perform preprocessing on the voice signal and the scene image to obtain a first text corresponding to the voice signal and a second text corresponding to the scene image. The cloud server is mainly used for converting the first text and the second text into prompt texts in a fixed format so as to guide the large language model to better understand the voice content and the dialogue scene of the speaker, and guide the large language model to conduct reasoning according to the prompt texts so as to generate dialogue content corresponding to the speaker, so that different speakers or the same speaker are located in different scenes aiming at the same voice content to generate different dialogue content, personalized requirements of users are met, and dialogue interaction can be conducted with the users more intelligently.

The second text is mainly used for describing visual feature information, however, a space corresponding to the visual feature information and a space corresponding to the text feature information have certain difference, if a corresponding prompt text is directly generated based on the first text and the second text, the prompt text is input into a corresponding large language model, the large language model cannot well understand dialogue scene information actually reflected by the second text, and the dialogue content with higher standard cannot be well generated by combining the second text. Therefore, before the prompt texts corresponding to the first text and the second text are acquired based on the preset prompt template, the second text can be subjected to multi-modal information alignment processing to obtain text description with better understanding of the large language model, and then the prompt text is acquired based on the processed text description.

In an alternative embodiment, after the cloud server receives the first text and the second text sent by the robot, a label process may be added to the second text, so that the large language model may better understand the second text. Specifically, after receiving the first text and the second text sent by the robot, the cloud server may first add a tag to the second text to obtain a tagged second text, and then obtain a prompt text corresponding to the first text and the tagged second text based on a preset thinking chain prompt template. The labels are used for representing categories corresponding to all objects included in the scene image.

In an optional embodiment, the cloud server may also set weights corresponding to the objects that may appear in the scene in advance, specifically, determine the weights corresponding to the objects according to the importance of the objects to the dialogue content, and when performing label processing on the second text, add each corresponding category included in the scene image and the weights corresponding to the objects to the corresponding labels, so that the large language model may better process the first text and the second text in combination with the label information, so as to obtain the dialogue content that better conforms to the current dialogue scene, and enable the robot to perform dialogue interaction with the speaker more intelligently.

In addition, the second text may also be subjected to multi-modal information alignment processing by using a multi-modal large model or a multi-modal module or the like. Specifically, the multi-modal model is utilized to conduct image visual information and voice information alignment processing on the second text, a third text associated with the voice signal is generated, and prompt texts corresponding to the first text and the third text are obtained through a preset thinking chain prompt template. The multi-modal model is used for understanding the semantics of the second text, and establishing an association relation between the image visual information and the voice information so as to obtain a third text associated with the voice information. The multimodal model may be a multimodal large model (Q-Former), or the like.

And (3) connecting the image semantic segmentation model and the large language model by using the Q-force, inputting a second text obtained after the image semantic segmentation model is processed into a cross-attention cross attention mechanism of the Q-force model, so that the problem that the space of visual features and the space of text features are not easily aligned is solved by the Q-force model, and the visual mode and the text mode are aligned by the Q-force, so that the large language model can better understand the visual feature information contained in the scene image. The cross-attention mechanism aims at multi-mode information interaction, in a sense, visual information extraction is also performed, and corresponding visual information is extracted according to text information, so that the cross-attention mechanism can be directly used for visual information extraction to obtain a third text associated with a voice signal.

In practical applications, the speaking mode and speaking habit of each user are different, and there may be many expression modes for the same dialogue content. For example, "what is the weather today? "," what is the weather temperature today? ", how does weather? "etc. For the same dialogue content, the collected voice signals corresponding to each speaker are different, so that the first text after converting the voice signals may not clearly and definitely express the content to be actually expressed by the speaker, and therefore the generated dialogue content may not be the content required by the user on the basis of the prompt text corresponding to the first text and the second text by the cloud server. Therefore, in order to enable the conversation content finally obtained by the robot conversation system to meet the user requirements, the prompt text can be further optimized, so that the text after the optimization can describe the conversation content expressed by the speaker more accurately, and the cloud server can generate the corresponding conversation content more accurately.

To more clearly understand the session processing procedure of the cloud server, a specific processing procedure of the cloud server is exemplarily described with reference to fig. 2. Fig. 2 is an application schematic diagram of another robot dialogue system according to an embodiment of the present invention; as shown in fig. 2, the robot dialog system includes a robot and a cloud server. The cloud server comprises a multi-mode processing module, a preprocessing module and a large language model processing module.

When the robot detects that a user performs voice conversation with the robot, a voice signal of a speaker and a scene image of the position of the speaker are collected, and a first text corresponding to the voice signal and a second text corresponding to the scene image are determined. Specifically, performing text conversion processing on the voice signal by using a voice recognition model to obtain a first text corresponding to the voice signal; and carrying out semantic segmentation processing on the scene image by using the image semantic segmentation model so as to obtain semantic information corresponding to each object in the scene image and position information corresponding to each object, and determining a second text corresponding to the scene image based on the semantic information and the position information. And finally, the first text and the second text are sent to a cloud server.

After the cloud server receives the first text and the second text, the multi-mode processing module can be utilized to perform alignment processing on the image visual information and the voice information on the second text to generate a third text associated with the voice signal. Specifically, the multimodal processing module performs alignment processing on the image visual information and the voice information on the second text by utilizing the multimodal model to generate a third text associated with the voice signal. The multi-modal model is used for understanding the semantics of the second text, and establishing an association relation between the image visual information and the voice information so as to obtain a third text associated with the voice information. The visual characteristic information described in the second text corresponding to the scene image is converted into the third text which can be better understood by the large language model by utilizing the multi-mode processing module, so that the understanding of the large language model on the dialogue scene information is improved, customized dialogue content corresponding to the dialogue scene is generated, and the using experience of a user is improved.

The first text and the third text are then processed by a preprocessing module to be converted into a fixed format prompt text. Specifically, the preprocessing module acquires prompt texts corresponding to the first text and the third text based on a preset thinking chain prompt template. For example, the preset thinking chain prompt template may be: the "< sys > </sys > < user query > user voice corresponds to text </user query > < background instruct > text corresponding to semantic information of various things and text </background instruct >" corresponding to location information of various things in the current environment of the speaker. The prompt text with a fixed format can be generated according to the preset thinking prompt template, and the prompt for guiding the language big model to work is arranged in "< sys > </sys >", so that the relevance and stability of the understanding and the generation of the big model are ensured by using the fixed format.

The thought chain is a series of short sentences that mimics the human thinking and possible reasoning process in answering questions. In short, the thinking chain can be regarded as a discrete prompt learning, and compared with the traditional contextual learning, the thinking chain adds the steps of analyzing and solving the question information in the prompt. That is, the thought chain makes the large language model not a simple final answer to the predicted question, but predicts the "thought process" corresponding to the question, significantly increasing the large language model's ability to do complex reasoning.

In order to improve the quality corresponding to the generated dialogue content, after the prompt texts corresponding to the first text and the third text are acquired, the pretreatment module can be used for optimizing the prompt texts, so that language information of a user is further enriched and optimized, and the too short and too generalized information is prevented from entering a large language model to influence the effect of multi-round dialogue. Specifically, historical dialogue information is obtained, the prompt text is subjected to rewriting processing based on the historical dialogue information, the processed prompt text is obtained, and the processed prompt text is input into a large language model, so that dialogue content corresponding to the processed prompt text is obtained.

In an alternative embodiment, the prompt text and the history information in the dialogue system cache can be spliced, and the spliced text is determined to be the processed prompt text, so that the language information of man-machine interaction can be further enriched and optimized, the whole performance of the dialogue is prevented from being influenced by excessively intermittent or generalized information, and the processed prompt text can carry information such as multi-round dialogue memory and the like.

And then, inputting the processed prompt text to a large language model processing module so as to process the processed prompt text by utilizing a large language model, thereby obtaining dialogue content corresponding to the processed prompt text. Finally, the conversation content is sent to the bot to cause the bot to conversationally interact with the speaker based on the conversation content. The large language model can keep structural understanding and reasoning by means of the preset thinking chain prompt template, meanwhile, the whole dialogue process is more stable and reliable due to the intervention of the thinking chain, and more valuable dialogue contents can be obtained.

According to the embodiment of the invention, the visual characteristic information described in the second text corresponding to the scene image is converted into the third text which can be better understood by the large language model through the multi-mode processing module, so that the understanding of the large language model on the dialogue scene information is improved, and the prompt text is subjected to rewriting processing based on the history dialogue information through the preprocessing module to obtain the processed prompt text, so that the man-machine interaction language information can be further enriched and optimized, the overall dialogue performance is prevented from being influenced by the excessively interrupted or generalized information, the processed prompt text can carry information such as multi-round dialogue memory and the like, and the large language module can better understand the prompt text and perform corresponding reasoning to obtain high-quality dialogue content.

Having generally described the robotic dialog system, the following describes exemplary execution of a dialog by the robotic dialog system in connection with fig. 3.

Fig. 3 is a flowchart of a robot dialogue method provided in an embodiment of the present invention, and as shown in fig. 3, the method may be applied to a robot in a robot dialogue system, and the method includes the following steps:

301. a speech signal of a speaker and a scene image of the position of the speaker are collected, wherein the speech signal contains the speech of the speaker.

302. A first text corresponding to the speech signal is determined.

303. And determining a second text corresponding to the scene image, wherein the second text is used for describing the image content of the scene image.

304. The method comprises the steps that a first text and a second text are sent to a cloud server, so that the cloud server obtains prompt texts corresponding to the first text and the second text based on a preset prompt template, the prompt texts are input to a large language model, dialogue contents corresponding to the prompt texts are obtained, the preset prompt template is used for limiting content formats describing the first text and the second text, and the prompt texts are used for guiding the large language model to conduct reasoning.

305. And receiving dialogue content sent by the cloud server, and performing dialogue interaction with a speaker based on the dialogue content.

After the robot receives the dialogue request of the user, the voice signal of the speaker and the scene image of the position of the speaker are collected. The voice signal comprises the voice of the speaker, and the voice signal of the speaker and the scene image of the position of the current speaker can be collected through the camera equipment and the radio equipment carried by the robot.

Next, a first text corresponding to the speech signal is determined. Wherein, the voice signal can be subjected to text conversion processing by utilizing the voice recognition model so as to obtain a first text corresponding to the voice signal. And then, carrying out semantic segmentation processing on the scene image by using the image semantic segmentation model so as to obtain semantic information corresponding to each object in the scene image and position information corresponding to each object, and determining a second text corresponding to the scene image based on the semantic information and the position information. Wherein the second text is used to describe the image content of the scene image.

And then, the robot sends the first text and the second text to the cloud server, so that the cloud server obtains prompt texts corresponding to the first text and the second text based on a preset prompt template, the prompt texts are input into the large language model, dialogue contents corresponding to the prompt texts are obtained, the preset prompt template is used for limiting content formats describing the first text and the second text, and the prompt texts are used for guiding the large language model to conduct reasoning.

According to the embodiment of the invention, the robot is used for preprocessing the acquired voice signal and scene image of the speaker to obtain the corresponding first text and second text, and then the processed first text and second text are transmitted to the cloud server, so that the cloud server can directly process the first text and the second text, the processing efficiency of the cloud server is improved, and the problem of time delay between the end and the end is reduced.

The specific implementation process involved in the embodiment of the present invention may refer to the content in each embodiment, which is not described herein.

FIG. 4 is a flowchart of another robot dialogue method according to an embodiment of the present invention; as shown in fig. 4, the method can be applied to a cloud server in a robot dialogue system, and the method comprises the following steps:

401. And receiving a first text and a second text sent by the robot, wherein the robot collects a voice signal of a speaker and a scene image of the position of the speaker, determines the first text corresponding to the voice signal and the second text corresponding to the scene image, and sends the first text and the second text to the cloud server, and the second text is used for describing the image content of the scene image.

402. And acquiring prompt texts corresponding to the first text and the second text based on a preset prompt template, wherein the preset prompt template is used for limiting a content format for describing the first text and the second text, and the prompt texts are used for guiding a large language model to carry out reasoning.

403. And inputting the prompt text into the large language model to obtain dialogue content corresponding to the prompt text, and sending the dialogue content to the robot so that the robot performs dialogue interaction with a speaker based on the dialogue content.

When the robot dialogue system provided by the embodiment of the invention is used for dialogue processing, the robot and the cloud server are used for cooperatively completing the dialogue processing. Specifically, after receiving a dialogue request initiated by a user, a robot collects a voice signal of a speaker and a scene image of the position of the speaker, determines a first text corresponding to the voice signal and a second text corresponding to the scene image, and then sends the first text and the second text to a cloud server. Wherein the first text is mainly used for speech content of a speaker and the second text is used for describing image content of a scene image.

In order to reduce the time delay between the robot end and the cloud server end, after the robot collects the voice signals and the scene images of the speaker, the voice signals and the scene images can be processed first and converted into a first text and a second text, and then the first text and the second text are sent to the cloud server.

After receiving the first text and the second text sent by the robot end, the cloud server firstly obtains prompt texts corresponding to the first text and the second text based on a preset prompt template. The method comprises the steps that a prompt template is preset for limiting a content format for describing a first text and a second text, and the prompt text is used for guiding a large language model to conduct reasoning. That is, the first text and the second text are converted into prompt texts in a fixed format so as to give the large language model a prompt for working, thereby ensuring that the large language model understands the dialogue content and generates the relevance and stability of the dialogue content.

In order to better understand the dialogue scene information, after the second text is acquired, the second text may be first subjected to label processing, and then the prompt text is acquired based on the processed second text and the processed first text. Specifically, in an alternative embodiment, a label is added to the second text, a second text after the label is added is obtained, and a prompt text corresponding to the first text and the second text after the label is added is obtained based on a preset thinking chain prompt template. The labels are used for representing categories corresponding to all objects included in the scene image, and the preset prompt templates comprise preset thinking chain prompt templates.

In addition, the second text can be subjected to multi-modal information alignment processing, and the second text can be converted into a text which can be better understood by a large language model. Specifically, in an alternative embodiment, the multimodal model is used to perform alignment processing on the image visual information and the voice information on the second text, generate a third text associated with the voice signal, and obtain the prompt text corresponding to the first text and the third text based on the preset thinking chain prompt template. The multi-mode model is used for understanding the semantics of the second text and establishing the association relation between the image visual information and the voice information so as to obtain a third text associated with the voice information.

And finally, the cloud server inputs the prompt text into the large language model to obtain dialogue content corresponding to the prompt text. In an alternative embodiment, the language information of man-machine interaction is further enriched and optimized, the phenomenon that the overall performance of the dialogue is influenced by excessively intermittent or generalized information is prevented, and the prompt text can be further optimized, so that the prompt text input into the large language model carries multiple rounds of dialogue memory information. Specifically, historical dialogue information is obtained, the prompt text is subjected to rewriting processing based on the historical dialogue information, the processed prompt text is obtained, and the processed prompt text is input into a large language model, so that dialogue content corresponding to the processed prompt text is obtained.

From the above description, it is clear that: after receiving a dialogue request of a user, the dialogue processing process is divided into two parts for processing, namely, a robot preprocesses a collected voice signal and a scene image of a speaker to obtain a corresponding first text and a corresponding second text, and then the processed first text and second text are transmitted to a cloud server, so that the cloud server can directly process the first text and the second text, the processing efficiency of the cloud server is improved, and the problem of time delay between ends can be reduced.

In the embodiment of the invention, when the dialogue processing is carried out, the personalized dialogue content is generated by combining the scene information of the environment where the speaker is, the user experience sense can be improved, the prompt text corresponding to the voice signal and the scene image is obtained through the preset prompt template, the dialogue scene can be better understood by guiding the large language model through the prompt text, and the reasoning is carried out by combining the thinking chain in the prompt text, so that the more accurate dialogue content is obtained, and the robot can more intelligently carry out dialogue interaction with the speaker.

The robot conversation device of one or more embodiments of the present invention will be described in detail below. Those skilled in the art will appreciate that these means may be configured by the steps taught by the present solution using commercially available hardware components.

Fig. 5 is a schematic structural diagram of a robot dialogue device according to an embodiment of the present invention, where, as shown in fig. 5, the device is located in a robot dialogue system, and the device includes: the system comprises an acquisition module 11, a first determination module 12, a second determination module 13, a sending module 14 and an interaction module 15.

The collection module 11 is configured to collect a speech signal of a speaker and an image of a scene where the speaker is located, where the speech signal includes a speech of the speaker.

The first determining module 12 is configured to determine a first text corresponding to the voice signal.

A second determining module 13, configured to determine a second text corresponding to the scene image, where the second text is used to describe image content of the scene image.

The sending module 14 is configured to send the first text and the second text to the cloud server, so that the cloud server obtains a prompt text corresponding to the first text and the second text based on a preset prompt template, and inputs the prompt text to a large language model to obtain dialogue content corresponding to the prompt text; the preset prompt template is used for limiting a content format describing the first text and the second text, and the prompt text is used for guiding the large language model to conduct reasoning.

And the interaction module 15 is used for receiving the dialogue content sent by the cloud server and performing dialogue interaction with the speaker based on the dialogue content.

In an alternative embodiment, the first determining module 12 may specifically be configured to: and performing text conversion processing on the voice signal by utilizing a voice recognition model so as to obtain a first text corresponding to the voice signal.

In an alternative embodiment, the second determining module 13 may specifically be configured to: and carrying out semantic segmentation processing on the scene image by using an image semantic segmentation model to obtain semantic information corresponding to each object in the scene image and position information corresponding to each object, and determining a second text corresponding to the scene image based on the semantic information and the position information.

The apparatus shown in fig. 5 may perform the steps performed by the robot in the robot dialogue method in the foregoing embodiments, and detailed performing processes and technical effects are referred to the descriptions in the foregoing embodiments and are not repeated herein.

The embodiment of the invention also provides a robot, as shown in fig. 6, which may include: a first processor 21, a first memory 22, a first communication interface 23. Wherein the first memory 22 has stored thereon executable code which, when executed by the first processor 21, causes the first processor 21 to implement the robot dialog method as in the previous embodiments.

Additionally, embodiments of the present invention provide a non-transitory machine-readable storage medium having executable code stored thereon, which when executed by a processor of an electronic device, causes the processor to at least implement a robot conversation method as provided in the previous embodiments.

Fig. 7 is a schematic structural diagram of another robot dialogue device according to an embodiment of the present invention, where, as shown in fig. 7, the device is located in a cloud server in a robot dialogue system, and the device includes: a receiving module 31, an acquiring module 32 and a processing module 33.

The receiving module 31 is configured to receive a first text and a second text sent by a robot, where the robot collects a voice signal of a speaker and a scene image of a location where the speaker is located, determines a first text corresponding to the voice signal and a second text corresponding to the scene image, and sends the first text and the second text to the cloud server; the second text is used to describe the image content of the scene image.

The obtaining module 32 is configured to obtain a prompt text corresponding to the first text and the second text based on a preset prompt template, where the preset prompt template is used to define a content format describing the first text and the second text, and the prompt text is used to guide the large language model to make reasoning.

And the processing module 33 is configured to input the prompt text into a large language model, obtain dialogue content corresponding to the prompt text, and send the dialogue content to the robot, so that the robot performs dialogue interaction with the speaker based on the dialogue content.

In an alternative embodiment, the obtaining module 32 may specifically be configured to: adding a label to the second text to obtain a labeled second text, and obtaining prompt texts corresponding to the first text and the labeled second text based on a preset thinking chain prompt template; the labels are used for representing categories corresponding to all objects included in the scene image, and the preset prompt templates comprise the preset thinking chain prompt templates.

In an alternative embodiment, the obtaining module 32 may specifically be configured to: performing alignment processing on the image visual information and the voice information on the second text by utilizing a multi-modal model, generating a third text associated with the voice signal, and acquiring prompt texts corresponding to the first text and the third text based on a preset thinking chain prompt template; the multimodal model is used for understanding the semantics of the second text, and establishing an association relation between the image visual information and the voice information so as to obtain a third text associated with the voice information.

In an alternative embodiment, the obtaining module 32 may specifically be configured to: and obtaining historical dialogue information, carrying out rewriting processing on the prompt text based on the historical dialogue information to obtain a processed prompt text, and inputting the processed prompt text into a large language model to obtain dialogue content corresponding to the processed prompt text.

The device shown in fig. 7 may execute the steps executed by the cloud server in the robot dialogue method in the foregoing embodiment, and the detailed execution process and technical effects are referred to the description in the foregoing embodiment, which is not repeated herein.

The embodiment of the present invention further provides a cloud server, as shown in fig. 8, where the cloud server may include: a second processor 41, a second memory 42, a second communication interface 43. Wherein the second memory 42 has stored thereon executable code which, when executed by the second processor 41, causes the second processor 41 to implement the robot conversation method as in the previous embodiments.

Additionally, embodiments of the present invention provide a non-transitory machine-readable storage medium having stored thereon executable code that, when executed by a processor of an electronic device, causes the processor to at least implement a method performed by a robot in a robot conversation method as provided in the previous embodiments.

The apparatus embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by adding necessary general purpose hardware platforms, or may be implemented by a combination of hardware and software. Based on such understanding, the foregoing aspects, in essence and portions contributing to the art, may be embodied in the form of a computer program product, which may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A robotic dialogue system comprising:

the robot and the cloud server;

2. The system according to claim 1, characterized in that the robot is specifically adapted to: and performing text conversion processing on the voice signal by utilizing a voice recognition model so as to obtain a first text corresponding to the voice signal.

3. The system according to claim 1, characterized in that the robot is specifically adapted to: and carrying out semantic segmentation processing on the scene image by using an image semantic segmentation model to obtain semantic information corresponding to each object in the scene image and position information corresponding to each object, and determining a second text corresponding to the scene image based on the semantic information and the position information.

4. The system of claim 1, wherein the cloud server is configured to: adding a label to the second text to obtain a labeled second text, and obtaining prompt texts corresponding to the first text and the labeled second text based on a preset thinking chain prompt template; the label is used for representing the category corresponding to each object included in the scene image, and the preset prompt template comprises the preset thinking chain prompt template.

5. The system of claim 1, wherein the cloud server is configured to: performing alignment processing on the image visual information and the voice information on the second text by utilizing a multi-modal model, generating a third text associated with the voice signal, and acquiring prompt texts corresponding to the first text and the third text based on a preset thinking chain prompt template; the multimodal model is used for understanding the semantics of the second text, and establishing an association relation between the image visual information and the voice information so as to obtain a third text associated with the voice information.

6. The system of claim 1, wherein the cloud server is configured to: and obtaining historical dialogue information, carrying out rewriting processing on the prompt text based on the historical dialogue information to obtain a processed prompt text, and inputting the processed prompt text into a large language model to obtain dialogue content corresponding to the processed prompt text.

7. A robot conversation method, applied to a robot in a robot conversation system, the method comprising:

determining a first text corresponding to the voice signal;

the first text and the second text are sent to a cloud server, so that the cloud server obtains prompt texts corresponding to the first text and the second text based on a preset prompt template, and the prompt texts are input to a large language model to obtain dialogue contents corresponding to the prompt texts; the preset prompt template is used for limiting a content format for describing the first text and the second text, and the prompt text is used for guiding the large language model to make reasoning;

8. A robot conversation method, applied to a cloud server in a robot conversation system, the method comprising:

acquiring prompt texts corresponding to the first text and the second text based on a preset prompt template, wherein the preset prompt template is used for limiting content formats for describing the first text and the second text, and the prompt texts are used for guiding a large language model to carry out reasoning;

and inputting the prompt text to the large language model, obtaining dialogue content corresponding to the prompt text, and sending the dialogue content to the robot so that the robot performs dialogue interaction with the speaker based on the dialogue content.

9. A robot, comprising: a memory, a processor, a communication interface; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to perform the robot dialog method as claimed in claim 7.

10. A cloud server, comprising: a memory, a processor, a communication interface; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to perform the robot dialog method as claimed in claim 8.

11. A non-transitory machine-readable storage medium having stored thereon executable code which, when executed by a processor of an electronic device, causes the processor to perform the robotic conversation method of any one of claims 7 or 8.