CN118052222A

CN118052222A - Method and device for generating multi-round dialogue data

Info

Publication number: CN118052222A
Application number: CN202410444653.2A
Authority: CN
Inventors: 林一侃; 张晴晴; 马光谦; 罗磊; 刘杰辰
Original assignee: Beijing Qingshu Intelligent Technology Co ltd
Current assignee: Beijing Qingshu Intelligent Technology Co ltd
Priority date: 2024-04-15
Filing date: 2024-04-15
Publication date: 2024-05-17

Abstract

The embodiment of the application discloses a method and a device for generating multi-round dialogue data, wherein the method comprises the following steps: carrying out structural analysis on the voice dialogue text to obtain a speaker, a speaker sentence, a starting point of dialogue turn and a corresponding relation contained in the voice dialogue text; the voice dialogue text is a transcription text corresponding to natural dialogue voice; according to the result of the structural analysis, carrying out anomaly detection and corpus segmentation on the voice dialogue text to obtain a plurality of dialogue text blocks; based on the dialogue text block and the corpus smooth prompt, carrying out reasoning by using a large language model to obtain multi-round dialogue data; the embodiment of the application combines the large language model with the existing voice dialogue text to generate the multi-round dialogue data, can improve the production efficiency of the multi-round dialogue data and the authenticity and diversity of dialogue contents, and has higher language naturalness of the generated multi-round dialogue data, and the coverage field, the theme and the discussion depth are all higher than those of the data set produced by the traditional method.

Description

Method and device for generating multi-round dialogue data

Technical Field

The application belongs to the technical field of computers, and particularly relates to a method and a device for generating multi-round dialogue data.

Background

In the field of natural language processing, building data sets for different tasks is an important task. Common natural language data, each sentence is often independent. In contrast, conversational data may be considered semi-structured in that sentences are organized according to a certain structure (e.g., round, speaker). Such data is generally applied to fields related to human-machine conversation, such as chat robots, voice assistants, and the like. When constructing a suitable conversational dataset, the required turns, role relationships, fields, styles, etc. can affect the difficulty of dataset construction.

From the round, conversational data can be divided into two forms, single-round conversations and multi-round conversations. Single-round dialog is the most basic form of human-machine dialog data. Traditionally, the content of a single round of dialogue is typically that the user presents questions to the machine, which answers the questions, one round at a time. The content of the answer is generally organized into natural sentences based on entities derived from a database or knowledge graph. Traditionally, such datasets are field specific, such as dining recommendations, travel information counseling, traffic information counseling, and the like. In recent years, a single-round question-answer data is started to appear, the field of the dialog is more open, and the content can be a certain task instruction described by natural language, such as the requirement of judging emotion of a certain sentence, generating an advertisement document, generating explanation for a certain concept, and the like, and a target reply corresponding to the instruction. The creation of multiple rounds of conversations is often more difficult due to the need to maintain context consistency. The multi-round dialogue data set can be classified and described according to different standards. From a data production perspective, some conversations involve a more closed knowledge domain, and the conversational content and language style are more fixed, such as conversations that occur between a user and a financial product customer service. Some conversations are more divergent, such as natural boring of two people, and even if a given topic such as medical treatment, education, music is provided, the specific content and style of the conversation will vary from person to person and will be highly diversified.

In recent years, with the breakthrough of large language models on understanding and generating capabilities of natural language, people are aware that human-computer interaction applications are more possible, not just performing some simple information questions and answers. Therefore, demands for conversational data in academia and industry are becoming more and more abundant, and demands for naturalness, diversity and data volume of the materials are becoming more and more high, and general conversational data production modes often cannot fully meet the demands.

The corpus production method is a set of production flow comprehensively designed by research and development personnel based on stock resources, current technical status and actual demands. At present, a common dialog corpus production method comprises a template outline-based production method, namely, a certain dialog range is generated in an automatic mode according to the determined field and knowledge. The form may be a dialog outline, sentence template, or knowledge gist. On the basis, the method is further expanded into dialogue corpus. The expansion mode can be template filling and replacement based on rules, model reasoning based on a neural network, and dialogue sentences written by a true person according to outline, templates or knowledge points. In addition, based on the production method of the crawler, the web data with communication properties, such as comment content, forum postbacks, question and answer websites and the like on social media, can be crawled, and are sorted to obtain data in a free dialogue or question and answer form. Such data contains a large amount of invalid content and defective content, and thus requires a large amount of cleaning and screening work.

However, if a free dialog with rich fields, spoken language and high nature needs to be generated, the existing methods have obvious disadvantages: the method based on the template outline and the knowledge key points has limited knowledge, the generated corpus style is biased in writing, the content is monotonous and stiff, and the diversity is lacked. This rule-based approach also makes it difficult to achieve context consistency in very long multi-round conversations. If the color rendering is also required to be manually participated, although naturalness and creativity can be improved, time and economic cost can be obviously increased, and the production capacity of the data is severely limited. In addition, if one such production system is to be maintained, the supportable fields need to be continuously expanded, the update of the field knowledge is kept, and long-term investment of relevant experts is required. However, data crawled based on forums and social media contains a large amount of invalid and bad contents, so that the cleaning difficulty is high, and the obtained data often has the problems of inconsistent and inconsistent context. Moreover, to understand a web conversation, it is often necessary to know some contextual information, such as specific events, cultural backgrounds, or personal experiences, which are often omitted by default in the web conversation and not embodied in web text. In the absence of such contextual information, understanding and utilizing such conversations may be difficult. Meanwhile, the language style and the words of the data are obviously different from those of daily spoken dialogue.

Disclosure of Invention

The embodiment of the application aims to provide a method and a device for generating multi-round dialogue data, which are used for solving the defects of poor authenticity and naturalness of the existing dialogue corpus production method.

In order to solve the technical problems, the application is realized as follows:

in a first aspect, a method for generating multi-round dialogue data is provided, including the following steps:

carrying out structural analysis on the voice dialogue text to obtain a speaker, a speaker sentence, a starting point of dialogue turn and a corresponding relation contained in the voice dialogue text; the voice dialogue text is a transcription text corresponding to natural dialogue voice;

According to the result of the structural analysis, carrying out anomaly detection and corpus segmentation on the voice dialogue text to obtain a plurality of dialogue text blocks;

Based on the dialogue text block and the corpus smooth prompt, carrying out reasoning by using a large language model to obtain multi-round dialogue data;

The large-scale language model is used for executing a general natural language task, and can receive long-segment natural language text input and output long-segment natural language text; the corpus smooth prompt is a prompt text obtained after repeated debugging according to the large language model, and is used for clarifying tasks to be executed by the large language model, namely, the dialogue text block is smooth into multi-round dialogue data.

In a second aspect, there is provided a device for generating multi-round dialogue data, including:

The analysis module is used for carrying out structural analysis on the voice dialogue text to obtain a speaker, a speaker sentence, a starting point of dialogue turn and a corresponding relation contained in the voice dialogue text; the voice dialogue text is a transcription text corresponding to natural dialogue voice;

the processing module is used for carrying out abnormality detection and corpus segmentation on the voice dialogue text according to the structure analysis result to obtain a plurality of dialogue text blocks;

the reasoning module is used for reasoning by using a large language model based on the dialogue text block and the corpus smooth prompt to obtain multi-round dialogue data;

The embodiment of the application combines the large language model with the existing voice dialogue text to generate the multi-round dialogue data, can improve the production efficiency of the multi-round dialogue data and the authenticity and diversity of dialogue contents, and has higher language naturalness of the generated multi-round dialogue data, and the coverage field, the theme and the discussion depth are all higher than those of the data set produced by the traditional method.

Drawings

FIG. 1 is a flowchart of a method for generating multi-round dialogue data according to an embodiment of the present application;

Fig. 2 is a schematic structural diagram of a device for generating multi-round dialogue data according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Generally, the conventional construction concept of natural language text corpus can be divided into the following modes: 1. the method is characterized in that the method is based on the highly electronic corpus, and most of text data sets constructed based on crawlers belong to the category; 2. according to the requirements, a specification is set, and corpora are written by a true person. 3. According to the requirements, a developer designs a template or a flow, and combines a certain word list or a knowledge base to realize the mechanized production. Some data sets can use one of the above methods, and some data sets can further use a deep learning method to train a model capable of further optimizing and expanding the language materials by combining the language materials produced by a machine and the crawled language materials. In recent years, with the breakthrough of large language models, the demands of the industry on the magnitude and diversity of corpora are further improved, and meanwhile, people also begin to explore how to apply the understanding and generating capabilities of the large language models to the data set construction.

The dialogue data in the embodiment of the application can be regarded as special natural language text data. After people realize that large language models have excellent understanding and production capabilities for natural language, some data production modes based on large language models begin to appear. One common approach is to generate single round task instructions and corresponding answers by itself through a prompt engineering guidance language model. A large language model (large language model, abbreviated LLM) is a language model consisting of an artificial neural network with many parameters (typically billions of weights or more) that uses self-supervised learning or semi-supervised learning to train large amounts of unlabeled text. The production method based on the large language model is superior to the production method based on the template outline and the production method based on the crawler in terms of productivity, and can produce a large number of rounds of conversations in a short time. But there are some significant problems to guide the generated corpus if only prompt engineering is relied upon. First, the language machine, the machine feel heavy. Secondly, the multiple rounds of dialogue generated by the model is limited in information, mainly depends on priori knowledge of the model, often only can talk about some general and wide topics, lacks diversity and creativity, and is difficult to control the reliability and accuracy of information in answers. Under this paradigm, a common approach to improvement is to provide a paragraph of text that requires a language model to be written in a dialog around the given text. But this approach is also inherently a problem that requires autonomous dialog generation, as is the easy language style stiffness. Also, if a given text material is structurally complex or contains highly specialized content, the language model may not adequately understand the text and generate accurate, consistent conversations.

Under such state of the art, constructing large-scale, high-quality free-dialog datasets has been one of the major challenges facing the industry. In order to solve the problem, the method of the embodiment of the application converts the dialogue generation problem into a problem similar to text smoothness and text rewriting by using the stock data, and experiments show that the thought can effectively improve the quality of dialogue data.

In particular, there is currently a relatively sufficient inventory of Automatic Speech Recognition (ASR) dialog transcription text for the data industry. Such corpus is transcribed text of natural conversational speech. Although also conversational-type data, they often contain a significant amount of non-fluency compared to the fluency, integrity conversational data required by natural language processing tasks. The unsmooth comprises errors caused by the recognition performance of the system, and language phenomena such as self-contained filling words, self-repetition, self-correction and the like in the real voice. Generally speaking, the field of speech recognition requires performing text smoothing tasks on the recognized text, and related conventional work is generally focused on training small models in a specified field for specifically detecting and eliminating the phenomenon of dysfluency. Therefore, in order to improve the production mode of the dialogue data and further perform value mining on the dialogue corpus of the voice recognition, the embodiment of the application designs a dialogue data generation mode, and combines a large-scale language model with the existing stock data of the industry to perform mass dialogue data production. Compared with the method, the method has higher efficiency, higher language naturalness and more various corpus topics.

The method for generating multi-round dialogue data provided by the embodiment of the application is described in detail below through specific embodiments and application scenarios thereof with reference to the accompanying drawings.

As shown in fig. 1, a flowchart of a method for generating multi-round dialogue data according to an embodiment of the present application is provided, where the method includes the following steps:

Step 101, carrying out structural analysis on a voice dialogue text to obtain a speaker, a speaker sentence, a starting point of dialogue turn and a corresponding relation contained in the voice dialogue text; the voice dialogue text is a transcription text corresponding to natural dialogue voice.

And 102, performing anomaly detection and corpus segmentation on the voice dialogue text according to the structure analysis result to obtain a plurality of dialogue text blocks.

Specifically, the voice dialogue text can be filtered, and abnormal corpus is removed from the voice dialogue text, wherein the abnormal corpus comprises data with problems of overlong single-round content and insufficient or excessive speakers; the voice dialogue text can be segmented based on the context length supported by the large language model and the initial points of the speech segments and dialogue turns in the voice dialogue text to obtain a plurality of dialogue text blocks; the segmentation points of the dialogue text blocks are the starting points or the ending points of the speech segments and the dialogue turns in the voice dialogue text, and preset overlapping turns are reserved among the dialogue text blocks.

And step 103, based on the dialogue text block and the corpus smooth prompt, carrying out reasoning by using a large language model to obtain multi-round dialogue data.

In this embodiment, after the multi-round dialogue data is obtained by reasoning using a large language model based on the dialogue text block and the corpus smooth prompt, the multi-round dialogue data may be further analyzed for structure and integrity, and data with abnormal dialogue format, abnormal speaker number, abnormal round structure, and cut-off sentence may be screened out.

Further, the dialogue text block corresponding to the screened data can be rolled back, and reasoning is performed by using a large language model again based on the dialogue text block and the corpus smooth prompt until the generated multi-round dialogue data passes detection.

In the embodiment of the application, the dialogue generation problem is converted into the problems similar to the text smoothness and the text rewriting, so that a new set of production flow of dialogue data is realized, and the dialogue data with high naturalness and multiple fields can be produced with higher efficiency. Meanwhile, compared with the traditional mode, the embodiment of the application uses the related data in the voice field to produce the data required in the natural language processing field, realizes further value mining of the stock data resources, and improves the utilization rate of the data assets.

The following briefly describes the design concept and specific flow structure of the conversational data generation scheme according to the embodiments of the present application. The design idea behind this approach is: if the value mining can be carried out on the stock voice dialogue text resources, the naturalness, impromptu and certain spoken language property of the real dialogue are reserved, and meanwhile, redundant and unsmooth information can be removed, so that natural, real and diversified dialogue data can be obtained. Meanwhile, conversational data production does not require that the output data and the original data have strict sentence-by-sentence correspondence, and certain omission and expression change are allowed, so that the conversational data production has stronger flexibility. Therefore, the text generating task in the traditional production method is converted into a generating task with smooth and rewriting properties, and the difficulty is greatly reduced. To realize such an idea, a system capable of handling long contexts, having a good semantic understanding capability and generating capability, and performing open tasks is needed. A large language model is a very suitable technique to support this scheme.

It should be noted that the relationship between the solution and the large language model is not to change the structure of the large language model, but to embed the large language model as a functional module into the production flow, which is an application of reasoning capability of the large language model.

The following describes a conversational data production scheme according to an embodiment of the present application, including steps 1-4. The text processing object related to the flow comprises the dialogue corpus of the existing voice recognition and also comprises the dialogue data after smoothness. To represent the distinction, automatic Speech Recognition (ASR) dialog text, hereinafter referred to as speech dialog text, is entered by the automatic flow. The flow outputs a smooth dialogue corpus, which is called a smooth dialogue text.

1. And (5) corpus pretreatment. The given voice dialog text is preprocessed. The device specifically comprises the following modules:

(1) And (5) analyzing the dialogue structure. And obtaining the starting point and the corresponding relation of the speaker, the speaker sentence and the dialogue turn contained in the voice dialogue text.

(2) And (5) detecting abnormal corpus. And filtering the voice dialogue text, and eliminating abnormal corpus. For example, data having problems such as excessively long single-round content, insufficient or excessive speakers, and the like often belong to abnormal data.

(3) Corpus segmentation. Based on the result of the structural analysis, the given voice dialog text is partitioned. The reason for the chunking is that the length of the voice dialog text sometimes exceeds the length of the context supported by the language model. The key point is that the segmentation point can not segment between the unfinished speech segments and turns, and certain overlapping turns are reserved among blocks, so that the context is prevented from being incoherent.

2. The corpus is smooth. And using the voice smooth prompt, and reasoning the prompt and the segmented voice dialogue text by using a large-scale language model to obtain the smooth dialogue text.

(1) The corpus used is a prompt text obtained after repeated debugging by a research and development personnel according to the called large language model. The prompt text is written in natural language, and mainly illustrates the task to be executed by the model to smooth the long dialogue corpus into multiple rounds of dialogues, and generally comprises definition and specific requirements for the task.

(2) The large language model is not particularly limited to a certain model, and can be accessed after the quality of the generation is evaluated by research personnel as long as the model can receive long-segment natural language text input, output long-segment natural language text and execute general natural language tasks.

(3) The reasoning mode used can be off-line reasoning at a local server or can remotely call external model capability.

3. And (5) quality detection.

The method mainly comprises the steps of performing anomaly detection on a smooth dialogue text output by a model, mainly performing structural and integrity analysis on the smooth dialogue text, and screening out output results of dialogue formats, speaker numbers, round structural anomalies and cut-off sentences according to rules. Experimental experience has found that if these anomalies are present in the output, this tends to mean that there is a problem with the entire output content.

4. The data rolls back.

And (3) for the task with abnormal generation results detected in the previous step, rolling back the corresponding voice dialogue text block, and re-executing the step (2-3) until the generated smooth dialogue text passes the detection.

Compared with the existing production method, the method for generating the multi-round dialogue data in the embodiment of the application realizes the improvement of production efficiency from the following aspects.

1. Time efficiency improves.

According to the method based on the template outline, an expert is required to make and continuously maintain a series of template outline according to the field requirement, manual color rendering is also required in the later stage, the method based on the crawler is required to collect required websites, a crawler tool is prepared, cleaning treatment is required in the later stage, and the construction period of the data set making by the two methods often needs to take weeks or even months. The method of the embodiment of the application is a production mode based on a large language model, and the production efficiency can reach hundreds of thousands of rounds per day.

2. The naturalness of the language is high.

Template-based, outline-generated dialog, rule-based template filling and replacement may result in dialog that appears too mechanical and slackened to simulate a real human dialog, due to the fact that the preset syntactic structure, dialog flow, and rules are not separated. The corpus obtained by the crawler-based method is mainly a comment message of social media users, and the language style, the words and the content of the corpus are obviously different from those of daily conversations. The dialog corpus generated autonomously by the large language model is guided based on prompt engineering, and the language style and the words are often highly written. The method of the embodiment of the application is based on the production mode of the real spoken dialogue corpus, reflects the real language use mode and is essentially different from the dialogue which is generated by means of simulation or automatic generation. The language style and the words of the generated data are more natural and spoken, and are more similar to the style of natural dialogue in real life.

3. Authenticity and diversity of content.

Based on the dialogue of the template and outline, the expansion of the diversity of dialogue content is generally realized by replacing the entity word list of the field, if the field is more diverse, an expert is required to design a new template and outline, but the diversity of the data produced by the method is limited on the topic as a whole, and the complexity of the natural dialogue cannot be fully represented due to the lack of impromptu of the real dialogue. Based on the method of prompt engineering, the field diversity of the production data is constrained by the prior knowledge of the model, and in the dialogue related to entity information, the authenticity of the content is often difficult to guarantee. The method of the embodiment of the application is based on the real spoken language communication text in the open field, and has certain creativity. This approach provides extremely high diversity and realism, and can adequately reflect the impromptu and complexity of real conversations, covering fields, topics and depths far exceeding those of data sets produced by conventional methods.

In summary, the method for generating multi-round dialogue data in the embodiment of the application realizes high-efficiency natural dialogue data production after putting into production, and can produce multi-field dialogue corpus with the number of rounds of hundreds of thousands of rounds per day, and the efficiency is far higher than that of any one method in the past. Moreover, the conversational data produced in this way can cover fields, topics and discussion depths far exceeding that specified by manual specifications, and the authenticity and naturalness of conversations far exceeding that achieved by traditional methods.

Fig. 2 is a schematic structural diagram of a device for generating multi-round dialogue data according to an embodiment of the present application, including:

The parsing module 210 is configured to parse the voice dialogue text to obtain a speaker, a speaker sentence, a starting point of a dialogue turn and a correspondence contained in the voice dialogue text; the voice dialogue text is a transcription text corresponding to natural dialogue voice.

And the processing module 220 is configured to perform anomaly detection and corpus segmentation on the voice dialog text according to a result of the structural analysis, so as to obtain a plurality of dialog text blocks.

Specifically, the processing module 220 is specifically configured to filter the voice dialog text, and reject an abnormal corpus from the voice dialog text, where the abnormal corpus includes data with problems of overlong single-round content, insufficient or excessive speakers.

In addition, the processing module 220 is specifically configured to block the voice dialog text based on the context length supported by the large language model and the speech segments and the starting points of dialog turns in the voice dialog text, so as to obtain a plurality of dialog text blocks; the segmentation points of the dialogue text blocks are the starting points or the ending points of the speech segments and the dialogue turns in the voice dialogue text, and preset overlapping turns are reserved among the dialogue text blocks.

The reasoning module 230 is configured to use a large language model to perform reasoning based on the dialogue text block and the corpus smoothness prompt, so as to obtain multiple rounds of dialogue data;

Further, the above device further comprises:

and the detection module is used for analyzing the structure and the integrity of the multi-round dialogue data and screening out the abnormal dialogue format, the abnormal number of speakers, the abnormal round structure and the data with the cut-off sentences.

And the rollback module is used for rollback the dialogue text blocks corresponding to the screened data, and reasoning by using a large language model based on the dialogue text blocks and the corpus smoothness prompt again until the generated multi-round dialogue data passes the detection.

The embodiment of the application also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the processes of the above-mentioned multi-round dialogue data generation method embodiment, and can achieve the same technical effects, so that repetition is avoided, and no further description is given here. The computer readable storage medium is, for example, a Read-Only Memory (ROM), a random access Memory (RandomAccess Memory RAM), a magnetic disk or an optical disk.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.

Claims

1. A method for generating multi-round dialogue data, comprising the steps of:

2. The method according to claim 1, wherein the anomaly detection is performed on the voice dialog text according to a result of the structural analysis, specifically comprising:

Filtering the voice dialogue text, and removing abnormal corpus from the voice dialogue text, wherein the abnormal corpus comprises data with problems of overlong single-round content, insufficient or excessive speakers.

3. The method according to claim 1, wherein the corpus segmentation is performed on the speech dialog text according to the result of the structural analysis to obtain a plurality of dialog text blocks, and the method specifically comprises:

Based on the context length supported by the large language model and the initial points of the speech segments and the dialogue turns in the voice dialogue text, the voice dialogue text is segmented to obtain a plurality of dialogue text blocks;

the segmentation points of the dialogue text blocks are the starting points or the ending points of the speech segments and the dialogue turns in the voice dialogue text, and preset overlapping turns are reserved among the dialogue text blocks.

4. The method of claim 1, wherein the reasoning is performed using a large language model based on the dialog text block and the corpus smoothness prompt, and after obtaining the multi-turn dialog data, further comprising:

And analyzing the structure and the integrity of the multi-round dialogue data, and screening out dialogue format abnormality, speaker number abnormality, round structure abnormality and statement cut-off data.

5. The method of claim 4, wherein after the analyzing the structure and integrity of the multi-turn dialogue data, filtering out the dialogue format anomaly, the speaker number anomaly, the turn structure anomaly, and the data with the sentence cut, further comprising:

and rolling back the dialogue text block corresponding to the screened data, and reasoning by using a large language model again based on the dialogue text block and the corpus smooth prompt until the generated multi-turn dialogue data passes detection.

6.A multi-round dialogue data generation apparatus, comprising:

7. The apparatus of claim 6, wherein the device comprises a plurality of sensors,

The processing module is specifically configured to filter the voice dialogue text, and reject abnormal corpus from the voice dialogue text, where the abnormal corpus includes data with problems of overlong single-round content, insufficient or excessive speakers.

8. The apparatus of claim 6, wherein the device comprises a plurality of sensors,

The processing module is specifically configured to segment the voice dialogue text based on a context length supported by the large language model and a starting point of a speech segment and a dialogue turn in the voice dialogue text, so as to obtain a plurality of dialogue text blocks;

9. The apparatus as recited in claim 6, further comprising:

10. The apparatus as recited in claim 9, further comprising: