CN118197310A - Voice interaction method, device, equipment and storage medium based on large language model - Google Patents

Voice interaction method, device, equipment and storage medium based on large language model Download PDF

Info

Publication number
CN118197310A
CN118197310A CN202410484169.2A CN202410484169A CN118197310A CN 118197310 A CN118197310 A CN 118197310A CN 202410484169 A CN202410484169 A CN 202410484169A CN 118197310 A CN118197310 A CN 118197310A
Authority
CN
China
Prior art keywords
reply
task
content
language model
voice interaction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410484169.2A
Other languages
Chinese (zh)
Inventor
单良
柴春雷
李欣语
葛志超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Liangsheng Digital Creative Design Hangzhou Co ltd
Original Assignee
Liangsheng Digital Creative Design Hangzhou Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Liangsheng Digital Creative Design Hangzhou Co ltd filed Critical Liangsheng Digital Creative Design Hangzhou Co ltd
Priority to CN202410484169.2A priority Critical patent/CN118197310A/en
Publication of CN118197310A publication Critical patent/CN118197310A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to a voice interaction method based on a large language model, which comprises the following steps: generating at least one reply task in response to at least one question to be processed entered by a user; according to the principle that the time for responding to the to-be-processed problem is earlier, the generated replying task is lower in priority, and the replying task is added into a replying queue according to the priority; and executing the reply task in the reply queue through preemptive scheduling and generating reply content. Embodiments of the present invention optimize voice interaction management and voice generation strategies in conjunction with large language models. The invention also relates to a device, an apparatus and a storage medium.

Description

Voice interaction method, device, equipment and storage medium based on large language model
Technical Field
The invention relates to the technical field of artificial intelligence, large language models and automatic voice recognition, in particular to a voice interaction method, device and equipment based on large language models and a storage medium.
Background
For different application scenarios, the intervention of a large language model is getting more and more attention, for example, a digital tour guide application scenario. In the application scene of digital tour guide, the large language model dialogue robot cannot achieve active explanation similar to real tour guide, and cannot cope well when meeting the conditions such as breaking dialogue. In addition, the digital tour guide software only stores the explanation information of each scenic spot, a user is required to click to view or listen manually on a software interface, and the cognitive load on the vision of the user can damage the immersion sense of the scenic spot unlike the situation that the real tour guide occupies the hearing resources and reserves the vision resources.
Disclosure of Invention
To solve the above-mentioned shortcomings, according to a first aspect of the present invention, there is provided a large language model-based voice interaction method, comprising: generating at least one reply task in response to at least one question to be processed entered by a user; according to the principle that the time for responding to the to-be-processed problem is earlier, the generated replying task is lower in priority, and the replying task is added into a replying queue according to the priority; and executing the reply task in the reply queue through preemptive scheduling and generating reply content.
Optionally, the executing the reply task in the reply queue and generating reply content through preemptive scheduling further includes: reply content is generated using a plurality of generation threads.
Optionally, the adding the reply task to the reply queue according to the priority according to the principle that the earlier the time of responding to the to-be-processed problem is, the lower the generated reply task priority is, further includes: when a reply task is initially generated, it is added to the reply queue.
Optionally, the generating at least one reply task in response to the at least one pending question input by the user further includes: and generating a later reply task in response to the later pending problem input by the user, simultaneously continuing to execute the earlier reply task at a first time point and generating the earlier reply content to a first middle stop, and then generating the later reply content and ending with the fixed prompt content.
Optionally, in response to the user's latest pending problem for the fixed hint content, backtracking to the previous reply task to the first suspension point to the first recovery point to continue generating the incomplete previous reply content or generating the latest reply task.
Optionally, the first stop point is set as a first stop point queried backwards from the first time point, and the first stop point is characterized by one of a comma, a semicolon, a period, a question mark or an exclamation mark in a sentence; the first recovery point is set as a first end point that is characterized as one of a beginning, a period, a question mark, or an exclamation mark in the sentence, which is queried forward starting at the first point in time.
Optionally, the adding the reply task to the reply queue according to the priority according to the principle that the earlier the time of responding to the to-be-processed problem is, the lower the generated reply task priority is, further includes: the recovery task with the lowest priority is set as a first recovery task, the address of a user is queried according to fixed frequency and compared with the address information of the target by acquiring the address information of each target and the fixed recovery content corresponding to each target, a new prompt is established according to the comparison result, a vector knowledge base is queried, and a large language model is utilized to generate the first recovery task.
Optionally, the length of the reply content is set to be smaller than n, wherein: if n is characterized by the number of sentences, n is 5, 6, 7 or 10; if n is characterized by the word count of the sentence, then n is 100, 120, 140, or 200.
Optionally, the step of inquiring the address of the user according to the fixed frequency and comparing the address with the address information of the target by acquiring the address information of each target and the fixed reply content corresponding to each target further includes: and requesting key and ip parameters by using a GET method, acquiring returned rectangular coordinates, and taking the center point of the rectangle as the address of the user.
Optionally, the new campt at least includes: system prompt, history interaction record, and corresponding fixed reply content of the target.
According to a second aspect of the present invention, there is provided a large language model-based voice interaction device, comprising: the generation module is used for: generating at least one reply task in response to at least one question to be processed entered by a user; and a sequencing module: according to the principle that the time for responding to the to-be-processed problem is earlier, the generated replying task is lower in priority, and the replying task is added into a replying queue according to the priority; the execution module: and executing the reply task in the reply queue through preemptive scheduling and generating reply content.
According to a third aspect of the present invention, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method described above.
According to a fourth aspect of the present invention there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the above-described method.
According to a fifth aspect of the present invention there is provided a computer program product comprising a computer program which, when executed by a processor, implements the steps of the above method.
Embodiments of the present invention optimize voice interaction management and voice generation strategies in conjunction with large language models. The embodiment of the invention provides a digital person tour guide different from an information system mode, and establishes a reasonable dialogue management mechanism and a generation strategy of reply content, so that the digital person tour guide integrates the advantages of a real person tour guide, brings immersion tour experience to a user, and optimizes the experience of the user in the process of using the digital person tour guide.
Drawings
In the following, by way of example, the drawings of exemplary embodiments of the invention are shown, the same or similar reference numbers being used in the various drawings to designate the same or similar elements. In the accompanying drawings:
FIG. 1 illustrates a flowchart of a large language model based voice interaction method according to an exemplary embodiment of the present invention.
FIG. 2 shows a flow chart of a large language model based voice interaction method in accordance with another embodiment of the present invention.
Fig. 3 shows a schematic structural diagram of a large language model-based voice interaction device according to an exemplary embodiment of the present invention.
Fig. 4 shows a schematic structural diagram of an electronic device according to an exemplary embodiment of the present invention.
Detailed Description
The invention will be better explained by the following detailed description of the embodiments with reference to the drawings.
In the present disclosure, the term "and/or" is intended to cover all possible combinations and subcombinations of the listed elements, including any, subcombinations, or all of the elements listed individually, without necessarily excluding other elements. Unless otherwise indicated, the terms "first," "second," and the like are used to describe various elements and are not intended to limit the positional, timing, or importance relationship of these elements, but are merely used to distinguish one element from another. Unless otherwise indicated, the terms "front, rear, upper, lower, left, right" and the like are generally based on the orientation or positional relationship shown in the drawings, and are merely for convenience of description and to simplify the description, and are not to be construed as limiting the scope of the invention.
Large language model (LLM, large Language Model): in some embodiments of the present invention, large language models are deep learning based models that are capable of understanding, generating, summarizing a variety of natural language text. These models typically learn the statistical rules of the language by pre-training on a large text dataset, and then can be fine-tuned on specific tasks to improve performance. The GPT (GENERATIVE PRE-trained Transformer) family of models is a typical representation of large language models. In some embodiments of the present invention, LLM is typically used in conjunction with a vector knowledge base (a method for storing and retrieving knowledge, represented as vectors in a high-dimensional space), which provides accurate knowledge retrieval and reasoning capabilities while generating rich text content.
Automatic speech recognition (ASR, automatic Speech Recognition): in some embodiments of the present invention, automatic speech recognition technology is a technology that converts human speech into text, which is widely used in a variety of scenarios for voice assistants, voice-to-text transcription services, and the like. The ASR technology enables a machine to understand and process human voice instructions, and the naturalness and convenience of human-computer interaction are greatly improved. Current ASR techniques are mainly based on deep neural networks, with high accuracy and low latency.
Text To Speech (TTS): in some embodiments of the invention, the TTS technology is an artificial intelligence technology that converts written text into spoken language. TTS systems enable computers to read text in human voice, and are widely used in the fields of navigation systems, voice assistants, and the like. With the development of deep learning technology, modern TTS systems are capable of generating increasingly natural and realistic human voices.
Global positioning system (GPS, global Positioning System): in some embodiments of the invention, the GPS is a global navigation satellite system capable of providing geographic location and time information to users worldwide. The distance to each satellite is calculated by measuring the propagation time of the signal from the satellite to the receiver, and then using these distances and the position information of the satellite, the three-dimensional position (latitude, longitude and altitude) of the receiver is calculated using a triangulation method. In some embodiments of the present invention, the referenced Goldnavigation is positioning using GPS technology.
FIG. 1 illustrates a flowchart of a large language model based voice interaction method according to an exemplary embodiment of the present invention.
S102: at least one reply task is generated in response to at least one pending issue entered by the user.
S104: and adding the reply task into a reply queue according to the priority according to the principle that the generated reply task has lower priority as the time for responding to the to-be-processed problem is earlier.
S106: and executing the reply task in the reply queue through preemptive scheduling and generating reply content.
For a plurality of problems to be processed which are time-sequential, two adjacent problems to be processed are respectively called a previous problem to be processed and a subsequent problem to be processed. By comparing two adjacent problems to be processed, the method can be further extended to a plurality of problems to be processed, wherein the problems to be processed comprise the latest problems to be processed. Similarly, the method also comprises a previous reply task and a previous reply content, a later reply task and a later reply content, a latest reply task and a latest reply content in time sequence.
FIG. 2 illustrates a flowchart of a large language model based voice interaction method according to another exemplary embodiment of the present invention.
S202: at least one reply task is generated in response to at least one pending issue entered by the user.
S204: and adding the reply task into a reply queue according to the priority according to the principle that the generated reply task has lower priority as the time for responding to the to-be-processed problem is earlier.
S206: and generating a later reply task in response to the later pending problem input by the user, simultaneously continuing to execute the earlier reply task at a first time point and generating the earlier reply content to a first middle stop, and then generating the later reply content and ending with the fixed prompt content.
S208: and executing the reply task in the reply queue through preemptive scheduling and generating reply content.
Further, in response to the latest pending problem of the user for the fixed prompt content, generating a latest replying task or backtracking the previous replying task to the first stopping point and forward to the first recovering point to continue generating the incomplete previous replying content
In some embodiments, the first stopping point is a first stopping point queried backwards starting at the first time point, the first stopping point being characterized by one of a comma, a semicolon, a period, a question mark, or an exclamation mark in the sentence. It should be understood that if the prior reply content is just stopped at a first point in time, the stop is the first stop point, in other words, when the post-processing problem is detected to respond, the prior reply content just reaches the stop point (corresponding to comma, semicolon, etc. in the sentence) or the expression is completed (corresponding to period, etc. in the sentence), the generation of the prior reply content is stopped at the first point in time, and the post-reply content is directly generated.
In some embodiments, the fixed hint content may be a hint that requires only a simple reply to a yes or no hint, or ask oneself self-answering hint statements (without user interaction). The latest pending questions of the fixed hint content may include: whether a new pending problem is either no, no answer, or otherwise initiated. The response to the most recent pending issue may be: backtracking to the prior reply task to the first abort point to the first resume point continues to generate prior reply content that is not complete or generates a latest reply task and generates latest reply content, for example: the reply which is pushed in by the different fixed prompts is based on the latest reply content processed by the large language model.
In some embodiments, the first recovery point is a first end point that is characterized as one of a beginning, a period, a question mark, an exclamation mark, or the like in the sentence, that is queried forward beginning at the first point in time. It should be understood that if the start of the previous reply content is reached before the first end point (period, etc.) is detected or the first end point (beginning in the sentence) is the start, the previous reply content is regenerated from the start (beginning) of the previous reply content.
Based on a combination of the two specific embodiments above, the following is present: the pre-reply content is generated de novo, followed by the direct generation (no further generation of the remaining part of the pre-reply content) of the post-reply content, followed by the further generation of the pre-reply content de novo.
It should be appreciated that in some embodiments, according to different content (e.g., different kinds of language features, such as english), there are also a second center point and a corresponding second stop point, a second resume point and a corresponding second end point according to the understanding of the large language model.
In some embodiments, the reply task is added to the reply queue when it is initially generated. In some embodiments, since generating the reply content requires time, multiple generating threads may be used to avoid the occurrence of reply content variation when responding to proxy processing problems in order to generate a reply task and reply the content; at the same time, the active interaction (which will be mentioned later) and the generation thread of the passive reply content are separated, so that the active interaction time point is prevented from being missed.
In some embodiments, the reply task with the lowest priority is set as a first reply task, the address of the user is queried according to the fixed frequency and compared with the address information of the target by acquiring the address information of each target and the fixed reply content corresponding to each target, a new prompt is established according to the comparison result, a vector knowledge base is queried, and a large language model is utilized to generate the first reply task. Further, setting the length of the reply content to be smaller than n, and according to the large language model, if n is characterized as the number of sentences, n is 5, 6, 7 or 10; if n is characterized by the word count of the sentence, then n is 100, 120, 140, or 200. Further, a GET method is used for requesting related parameters, returned rectangular coordinates are obtained, and the center point of the rectangle is taken as the address of the user.
According to the above method, the following exemplary embodiments are specified.
In a digital tour guide application scenario, the problem that digital tour guide is broken is solved. For example, when the digital tour guide is reading the reply content, it detects that the user breaks by sound (or gesture) at a certain point in time, the digital tour guide stops reading the first pause (comma/period) after that point in time, then replies to the broken question, and adds a query "do me need to continue (just not talk at the end of reply)":
if the user replies "yes" (indicating agreement), the digital tour guide starts to reply from the beginning of the first period/content retrieved before the center;
If the user replies "no" (indicating disagreement), the digital tour guide pauses and replies;
If the content replied by the user comprises a new problem to be processed (a problem of requiring specific reply content), the digital tour guide continuously generates the latest reply content according to the large language model.
It should be noted that to implement the above application, a processing method of "priority preemption scheduling" may be employed.
Firstly, the digital tour guide scene is set into two interaction modes:
1. Active interaction (first reply task): the digital tour guide triggers the interactive content conforming to the vector knowledge base according to the current position of the user, and initiatively initiates a conversation (first reply content).
2. Passive recovery: the digital tour guide carries out passive reply according to the to-be-processed problem of the user.
Both interaction modes are set to reply tasks. When any reply task is generated, the reply task is added into a reply queue, and the task is executed in a scheduling mode (namely TTS is carried out, and the reading of the generated content is completed). Wherein the priority of active interaction is always 1 (lowest priority); for passive replies, priorities are assigned according to time sequence: i.e., the earlier generated reply task has a lower priority. Therefore, when a reply task with higher priority is added into the reply queue, preemption scheduling is adopted, and the reply task is immediately executed. In some example implementations, stable ordering may be performed when the replying task joins the queue, and preemptive execution may be performed if the replying task is first (highest corresponding priority) after ordering; otherwise, waiting at the sequenced position.
Specifically, when the old reply content (the previous reply content) has not been completely read (generated), it is set that the new input of the user is detected at the time T 1, at which time the old reply content continues to be read, and the first comma/period (the first stop point) addressed later from the time T 1 is taken as the "read middle point". When the digital person reads the "read middle point", the reading of the old reply content is stopped. At the same time, the old reply thread will continue to generate the remaining content and take the period encountered by the first address before time T 1 (non-stop point, but end point, beginning from the beginning of the content if the beginning of the content has been reached before the period is detected) as the first recovery point.
When the user finishes talking, the new reply thread generates new reply content, and the new reply task is added into the reply queue when generating is started due to higher priority, and then is preemptively scheduled to start generating reading. After the new reply content is read (generated) is completed, the queue head is an unfinished old reply task, and the old reply content continues to be read from the first recovery point.
Furthermore, how does it determine that the generation of a reply task belongs to a new reply thread or an old reply thread? When no reply task is generated in the old thread and the new thread, the reply task is added into the old reply thread; if the old reply thread has a task, a new reply thread is added. In combination with the actual, a thread pool is created when the task is to be used, two threads which always run are defined in the thread pool, and one thread is automatically allocated when a new replying task exists.
It should be understood that in the embodiment of the present invention, since the delay of the streaming voice synthesis is not considered, the reply content generation and the voice synthesis are set to be synchronous.
Before the reply generation of the digital tour guide is described, the following settings are made:
1. And establishing a vector knowledge base of the scenic spots, and storing the reply content to be introduced of each scenic spot. Meanwhile, the longitude and latitude information of each scenic spot is stored in a database.
2. Based on scenic spot features, basic setting is carried out on characters of the digital tour guide, namely character description is added into a system sample when the large language model is used.
3. Because the actual dialogue is not very lengthy, in the system prompt, an instruction is added to answer the content once, which is limited in length (see above).
Further, in addition to the above-mentioned settings, the digital tour guide needs to cooperate with, for example, a goldhair navigation API to make a real-time position determination to assist in generating active interactive contents, which are specifically set as follows:
Firstly, inquiring the longitude and latitude positions of the user at fixed time t intervals. The method comprises the following steps:
The GET method is used to request URLhttps:// restapi. Amap. Com/v3/IPPARAMETERS, request parameters are as follows: key: applying for Web service API type KEY in the Goldmap official network; ip: the current ip address where the user is located; the returned rectangle (for example, "116.0119343, 39.661127144; 116.7829335, 40.2164962") is obtained, and the center point of the rectangle is taken as the current longitude and latitude position.
Secondly, traversing the latitude and longitude ranges of all scenic spots to check whether the user is positioned in a scenic spot: if yes, and the flag of the scenic spot is 0, the flag of the scenic spot is set to 1 (ensuring that interaction cannot be repeatedly initiated).
Finally, if a flag of a scenic spot is set to 1 in the previous step, a system prompt, a history chat record and a trigger language (such as "what is recommended by you is in the current field of the deer-holder.
The digital tour guide system stores the latest longitude and latitude of the user and the information of whether the user is located at a certain scenic spot or not. In addition, for passive replies, each time a user's question will also be queried against the vector knowledge base.
It should be appreciated that in the exemplary embodiments of the present invention, speech synthesis delays, i.e., the generation of content, i.e., the readout of content, are not considered.
Fig. 3 is a schematic diagram of a framework of a large language model-based voice interaction device according to an exemplary embodiment of the present invention, and as shown in fig. 3, the embodiment of the present invention further provides a large language model-based voice interaction device 300, including:
The generating module 302: generating at least one reply task in response to at least one question to be processed entered by a user;
the ranking module 304: according to the principle that the time for responding to the to-be-processed problem is earlier, the generated replying task is lower in priority, and the replying task is added into a replying queue according to the priority;
execution module 306: and executing the reply task in the reply queue through preemptive scheduling and generating reply content.
The embodiment of the device in the embodiment of the present invention may refer to the above method embodiment, and will not be described herein.
According to embodiments of the present invention, the present invention also provides an electronic device, a readable storage medium and a computer program product.
According to an embodiment of the present invention, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method according to any one of the above.
According to an embodiment of the present invention, there is provided a computer program product comprising: a computer program stored in a readable storage medium, from which at least one processor of an electronic device can read, the at least one processor executing the computer program causing the electronic device to perform the solution provided by any one of the embodiments described above.
The present invention also provides an electronic device according to an embodiment of the present invention, and fig. 4 shows a schematic block diagram of an example electronic device 400 that may be used to implement an embodiment of the present invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 4, the apparatus 400 includes a computing unit 401 that can perform various appropriate actions and processes according to a computer program stored in a ROM 402 or a computer program loaded from a storage unit into a RAM 403. In RAM403, various programs and data required for the operation of device 400 may also be stored. The computing unit 401, ROM 402, and RAM403 are connected to each other by a bus 404. An I/O interface 405 is also connected to bus 404.
Various components in device 400 are connected to I/O interface 405, including: an input unit 406 such as a keyboard, a mouse, etc.; an output unit 407 such as various types of displays, speakers, and the like; a storage unit 408, such as a magnetic disk, optical disk, etc.; and a communication unit 409 such as a network card, modem, wireless communication transceiver, etc. The communication unit 409 allows the device 400 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The computing unit 401 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 401 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 401 performs the respective methods and processes described above, such as a problem processing method. For example, in some embodiments, the problem-handling method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 408. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 400 via the ROM 402 and/or the communication unit 809. When the computer program is loaded into RAM 403 and executed by computing unit 401, one or more steps of the problem-solving method described above may be performed. Alternatively, in other embodiments, the computing unit 401 may be configured to perform the problem-handling method in any other suitable way (e.g. by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present invention may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present invention, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., CRT (cathode ray tube)) or LCD for displaying information to a user; and a keyboard and pointing device through which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback; and may receive input from a user in any form.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component, or that includes a front-end component, or that includes any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.
The embodiment of the invention combines a large language model to optimize the voice interaction management and the voice generation strategy. The embodiment of the invention provides a digital person tour guide different from an information system mode, and establishes a reasonable dialogue management mechanism and a generation strategy of reply content, so that the digital person tour guide integrates the advantages of a real person tour guide, brings immersion tour experience to a user, and optimizes the experience of the user in the process of using the digital person tour guide.
It will be understood that the application has been described in terms of several embodiments, and that various changes and equivalents may be made to these features and embodiments by those skilled in the art without departing from the spirit and scope of the application. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the application without departing from the essential scope thereof. Therefore, it is intended that the application not be limited to the particular embodiment disclosed, but that the application will include all embodiments falling within the scope of the appended claims.

Claims (14)

1. A large language model based voice interaction method, comprising:
Generating at least one reply task in response to at least one question to be processed entered by a user;
according to the principle that the time for responding to the to-be-processed problem is earlier, the generated replying task is lower in priority, and the replying task is added into a replying queue according to the priority;
And executing the reply task in the reply queue through preemptive scheduling and generating reply content.
2. The large language model based voice interaction method of claim 1, wherein the executing the reply task in the reply queue and generating reply content through preemptive scheduling further comprises: reply content is generated using a plurality of generation threads.
3. The method for voice interaction based on large language model according to claim 2, wherein said adding the reply task to the reply queue according to the priority level according to the principle that the earlier the time of responding to the to-be-processed problem is, the lower the priority of the generated reply task is, further comprising: when a reply task is initially generated, it is added to the reply queue.
4. The large language model based voice interaction method of claim 3, wherein the generating at least one reply task in response to at least one pending question entered by a user further comprises: and generating a later reply task in response to the later pending problem input by the user, simultaneously continuing to execute the earlier reply task at a first time point and generating the earlier reply content to a first middle stop, and then generating the later reply content and ending with the fixed prompt content.
5. The large language model based voice interaction method of claim 4, wherein in response to a user's latest pending questions for the fixed prompt content, backtracking to the previous reply task to the first suspension point forward to the first resume point continues to generate incomplete previous reply content or generates a latest reply task.
6. The large language model based voice interaction method of claim 5, wherein the first stop point is set as a first stop point queried backward starting at the first time point, the first stop point being characterized by one of a comma, a semicolon, a sentence mark, a question mark, or an exclamation mark in a sentence; the first recovery point is set as a first end point that is characterized as one of a beginning, a period, a question mark, or an exclamation mark in the sentence, which is queried forward starting at the first point in time.
7. The method for voice interaction based on large language model according to claim 1, wherein said adding the reply task to the reply queue according to the priority level according to the principle that the earlier the time of responding to the to-be-processed problem is, the lower the priority of the generated reply task is, further comprising:
The recovery task with the lowest priority is set as a first recovery task, the address of a user is queried according to fixed frequency and compared with the address information of the target by acquiring the address information of each target and the fixed recovery content corresponding to each target, a new prompt is established according to the comparison result, a vector knowledge base is queried, and a large language model is utilized to generate the first recovery task.
8. The large language model based voice interaction method of claim 1, wherein the length of the reply content is set to be less than n, wherein:
If n is characterized by the number of sentences, n is 5, 6, 7 or 10;
if n is characterized by the word count of the sentence, then n is 100, 120, 140, or 200.
9. The method for voice interaction based on large language model according to claim 8, wherein said searching the address of the user according to the fixed frequency and comparing with the address information of the target by obtaining the address information of each target and the corresponding fixed reply content, further comprises:
and requesting key and ip parameters by using a GET method, acquiring returned rectangular coordinates, and taking the center point of the rectangle as the address of the user.
10. The large language model based voice interaction method of claim 9, wherein the new template comprises at least: system prompt, history interaction record, and corresponding fixed reply content of the target.
11. A large language model based voice interaction device, comprising:
The generation module is used for: generating at least one reply task in response to at least one question to be processed entered by a user;
and a sequencing module: according to the principle that the time for responding to the to-be-processed problem is earlier, the generated replying task is lower in priority, and the replying task is added into a replying queue according to the priority;
The execution module: and executing the reply task in the reply queue through preemptive scheduling and generating reply content.
12. An electronic device, comprising:
At least one processor; and
A memory communicatively coupled to the at least one processor; wherein,
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 110.
13. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 110.
14. A computer program product comprising a computer program which, when executed by a processor, implements the steps of the method according to any one of claims 110.
CN202410484169.2A 2024-04-22 2024-04-22 Voice interaction method, device, equipment and storage medium based on large language model Pending CN118197310A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410484169.2A CN118197310A (en) 2024-04-22 2024-04-22 Voice interaction method, device, equipment and storage medium based on large language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410484169.2A CN118197310A (en) 2024-04-22 2024-04-22 Voice interaction method, device, equipment and storage medium based on large language model

Publications (1)

Publication Number Publication Date
CN118197310A true CN118197310A (en) 2024-06-14

Family

ID=91409096

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410484169.2A Pending CN118197310A (en) 2024-04-22 2024-04-22 Voice interaction method, device, equipment and storage medium based on large language model

Country Status (1)

Country Link
CN (1) CN118197310A (en)

Similar Documents

Publication Publication Date Title
JP7362827B2 (en) Automated assistant call for appropriate agent
KR102418511B1 (en) Creating and sending call requests to use third-party agents
US10755702B2 (en) Multiple parallel dialogs in smart phone applications
US10140977B1 (en) Generating additional training data for a natural language understanding engine
US20210134278A1 (en) Information processing device and information processing method
WO2021164244A1 (en) Voice interaction method and apparatus, device and computer storage medium
US11830482B2 (en) Method and apparatus for speech interaction, and computer storage medium
CN108877792B (en) Method, apparatus, electronic device and computer readable storage medium for processing voice conversations
CN112466302B (en) Voice interaction method and device, electronic equipment and storage medium
CN112487173A (en) Man-machine conversation method, device and storage medium
US11763813B2 (en) Methods and systems for reducing latency in automated assistant interactions
CN112307188B (en) Dialog generation method, system, electronic device and readable storage medium
US20230195998A1 (en) Sample generation method, model training method, trajectory recognition method, device, and medium
CN112767916A (en) Voice interaction method, device, equipment, medium and product of intelligent voice equipment
CN117112065B (en) Large model plug-in calling method, device, equipment and medium
CN118197310A (en) Voice interaction method, device, equipment and storage medium based on large language model
CN114299955B (en) Voice interaction method and device, electronic equipment and storage medium
CN114722171B (en) Multi-round dialogue processing method and device, electronic equipment and storage medium
CN113743127B (en) Task type dialogue method, device, electronic equipment and storage medium
US11948580B2 (en) Collaborative ranking of interpretations of spoken utterances
CN113114851B (en) Incoming call intelligent voice reply method and device, electronic equipment and storage medium
CN114078478B (en) Voice interaction method and device, electronic equipment and storage medium
US20240203423A1 (en) Collaborative ranking of interpretations of spoken utterances
CN113643696A (en) Voice processing method, device, equipment, storage medium and program
CN117093691A (en) System help method, device, equipment and storage medium based on large language model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination