CN117573816A - Question-answer data generation method, device, equipment and storage medium - Google Patents

Question-answer data generation method, device, equipment and storage medium Download PDF

Info

Publication number
CN117573816A
CN117573816A CN202310835504.4A CN202310835504A CN117573816A CN 117573816 A CN117573816 A CN 117573816A CN 202310835504 A CN202310835504 A CN 202310835504A CN 117573816 A CN117573816 A CN 117573816A
Authority
CN
China
Prior art keywords
data
answer
question
dialogue
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310835504.4A
Other languages
Chinese (zh)
Inventor
杨昌林
汪亲
张望舒
胡森
许腾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202310835504.4A priority Critical patent/CN117573816A/en
Publication of CN117573816A publication Critical patent/CN117573816A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • G06F16/90332Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the specification provides a question and answer data generation method, a question and answer data generation device, question and answer data generation equipment and a storage medium, wherein the method comprises the following steps: confirming first dialogue data associated with the answer-free question data from service dialogue data by acquiring the answer-free question data and the target answer data in the service dialogue data, confirming second dialogue data associated with the target answer data, confirming answer data matched with the answer-free question data based on the first dialogue data and the answer-free question data, confirming target question data matched with the target answer data based on the second dialogue data and the target answer data, and generating question-answer data based on the answer-free question data, the answer data, the target answer data and the target question data.

Description

Question-answer data generation method, device, equipment and storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for generating question-answer data.
Background
In an online customer service system, a robot intelligent customer service system, an intelligent assistant and the like can directly interact with a user by using natural language, so that the problem of the user is solved.
When solving the customer, the customer service robot generally searches for the answer of the customer from a pre-arranged knowledge base containing a large number of questions and answers to the questions, however, construction and maintenance of a high-quality knowledge base of questions and answers (Frequently Asked Questions, FAQ) requires great manpower. On the one hand, the demands of users are diversified, and the inquiry pairs are constructed only by relying on the evidence of operators, so that the demands of the users can not be covered well, and the experience of the conversation is reduced. On the other hand, the number of conversations is very large, the conversation content is complex, and compared with a document/knowledge base, the flow of the conversation is complex, and the content/style of different customer service/user conversations are greatly different. Therefore, a method capable of efficiently generating high-quality and complete question-answer data is needed.
Disclosure of Invention
The main purpose of the present specification is to provide a method, a device and a storage medium for generating question-answer data, which aim to solve the problem of low efficiency in constructing a question-answer knowledge base. The technical scheme is as follows:
in a first aspect, an embodiment of the present specification provides a method for generating question-answer data, including:
acquiring answer-free question data and target answer data in service dialogue data;
Confirming first dialogue data associated with the answer-free question data from the service dialogue data, and confirming second dialogue data associated with the target answer data;
confirming answer data matched with the answer-free question data based on the first dialogue data and the answer-free question data;
confirming target question data matched with the target answer data based on the second dialogue data and the target answer data;
and generating question-answer data based on the answer-free question data, the answer data, the target answer data and the target question data, wherein the question-answer data comprises a plurality of question answer pairs, and the question answer pairs comprise question data and answer data corresponding to the question data.
In a second aspect, embodiments of the present disclosure provide a question-answer data generating device, including:
the acquisition module is used for acquiring answer-free question data and target answer data in the service dialogue data;
a dialogue confirmation module for confirming first dialogue data associated with the answer-free question data from the service dialogue data and confirming second dialogue data associated with the target answer data;
The answer confirming module is used for confirming answer data matched with the non-answer question data based on the first dialogue data and the non-answer question data;
a question confirmation module for confirming target question data matched with the target answer data based on the second dialogue data and the target answer data;
and the generating module is used for generating question-answer data based on the answer-free question data, the answer data, the target answer data and the target question data, wherein the question-answer data comprises a plurality of question answer pairs, and the question answer pairs comprise the question data and answer data corresponding to the question data.
In a third aspect, embodiments of the present disclosure provide an electronic device, the device comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, which when executed by the processor, performs the steps of the method as described above.
In a fourth aspect, embodiments of the present description provide a storage medium having a computer program stored thereon, which when executed by a processor, implements the steps of the method as described above.
In a fifth aspect, embodiments of the present description provide a computer program product comprising: a computer program which, when executed by a processor of an electronic device, causes the processor to at least implement a method as described in the first aspect.
In the embodiment of the present specification, by acquiring unanswered question data and target answer data in service dialogue data, first dialogue data associated with the unanswered question data is confirmed from the service dialogue data, second dialogue data associated with the target answer data is confirmed, answer data matching the unanswered question data is confirmed based on the first dialogue data and the unanswered question data, target question data matching the target answer data is confirmed based on the second dialogue data and the target answer data, and question-answer data is generated based on the unanswered question data, the answer data, the target answer data, and the target question data. The corresponding answers are produced by mining answer-free question data generated in the service dialogue, and then high-quality target answer data are mined from the service dialogue, so that target question data of the target answer data are generated, high-quality question-answer data are obtained, and the knowledge base construction efficiency is improved.
Drawings
In order to more clearly illustrate the embodiments of the present description or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present description, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is an exemplary schematic diagram of a question-answer data generating method according to an embodiment of the present disclosure;
fig. 2 is a schematic flow chart of a question and answer data generating method according to an embodiment of the present disclosure;
fig. 3 is a schematic flow chart of a question and answer data generating method according to an embodiment of the present disclosure;
fig. 4 is an exemplary schematic diagram of a question-answer data generating method according to an embodiment of the present disclosure;
fig. 5 is a schematic flow chart of a question and answer data generating method according to an embodiment of the present disclosure;
fig. 6 is a flowchart of a method for generating question-answer data according to an embodiment of the present disclosure;
fig. 7 is an overall flowchart of a question-answer data generation method provided in the embodiment of the present specification;
Fig. 8 is a schematic structural diagram of a question-answer data generating device according to an embodiment of the present disclosure;
fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
The technical solutions of the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is apparent that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.
In addition, it should be noted that, the user information and data (including, but not limited to, data for analysis, stored data, presented data, etc.) in the embodiments of the present disclosure are all information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data is required to comply with the relevant laws and regulations and standards of the relevant country and region, and is provided with a corresponding operation portal for the user to select authorization or rejection.
The question-answer data generating device provided in the embodiment of the present disclosure may be a terminal device such as a mobile phone, a computer, a tablet computer, a smart watch or a vehicle-mounted device, or may be a module used in the terminal device to implement a method for generating question-answer data, where the question-answer data generating device may obtain question-answer-free data and target answer data in service dialogue data, confirm first dialogue data associated with the question-answer-free data from the service dialogue data, confirm second dialogue data associated with the target answer data, confirm answer data matched with the question-answer-free data based on the first dialogue data and the question-answer-free data, confirm target question data matched with the target answer data based on the second dialogue data and the target answer data, and generate question-answer data based on the question-answer-free data, the answer data, the target answer data and the target question data.
Referring to fig. 1, an exemplary schematic diagram of a question-answer data generating method is provided for an embodiment of the present disclosure, where a question-answer data generating device obtains answer-free question data and target answer data from service dialogue data, generates answer data for the answer-free question data based on first dialogue data associated with the answer-free question data, and confirms target question data for the target answer data based on second dialogue data associated with the target answer data, and further generates question-answer data based on the answer-free question data, the answer data, the target answer data and the target question data. By automatically mining question and answer data from service dialogue data, the efficiency of building and updating knowledge base can be improved.
The method for generating question-answer data provided in the present specification will be described in detail with reference to specific examples.
Referring to fig. 2, a flowchart of a method for generating question-answer data is provided in an embodiment of the present disclosure. As shown in fig. 2, the method of the embodiments of the present specification may include the following steps S102-S110.
S102, obtaining answer-free question data and target answer data in service dialogue data;
s104, confirming first dialogue data associated with the answer-free question data from the service dialogue data and confirming second dialogue data associated with the target answer data;
S106, based on the first dialogue data and the non-answer question data, confirming answer data matched with the non-answer question data;
s108, confirming target question data matched with the target answer data based on the second dialogue data and the target answer data;
and S110, generating question-answer data based on the answer-free question data, the answer data, the target answer data and the target question data, wherein the question-answer data comprises a plurality of question answer pairs, and the question answer pairs comprise question data and answer data corresponding to the question data.
The question and answer data generation method in the embodiment of the specification is mainly used for maintaining and constructing a high-quality FAQ knowledge base. The FAQ question and answer is used as a core component of human-computer dialogue in intelligent customer service, so that the workload of manual customer service can be reduced, common questions frequently asked by some users are answered, however, great manpower is required to be consumed for constructing and maintaining a high-quality FAQ knowledge base, and the demands of users are diversified, QA pairs are constructed only by relying on the evidence of operators, so that the demands of users can not be covered well, and the experience of the dialogue is reduced. Therefore, the embodiment of the specification provides a question and answer data generation method, which is used for mining and producing high-quality question and answer pairs from service dialogue data containing rich knowledge, so that the efficiency of constructing a knowledge base can be improved, the quality of an FAQ knowledge base is continuously and iteratively optimized, and further, the question and answer effect of intelligent customer service is improved.
The following will explain each step in detail:
s102, obtaining answer-free question data and target answer data in service dialogue data;
in one embodiment of the present description, service session data refers to session data acquired by a human or robot customer in a session interaction scenario. In particular, service session data may be obtained from a log of conversations of manual or robotic customer service and clients. The question data without answer refers to a question method without answer in the knowledge base, namely, a question method which can not be hung up to the existing standard questions in the knowledge base. The target answer data refers to answer data of interest identified from customer service answer data in service dialogue data, and may be answer data that appears at a high frequency, for example. Customer service response data refers to feedback made by a human customer service or a robot customer service based on customer input content, and may be, for example, response text sent for a customer question.
Specifically, the scene of the conversation between the human customer service and the client may be referred to as a human conversation scene, and the scene of the conversation between the robot customer service and the client may be referred to as a human-machine conversation scene. It can be understood that, in a man-machine conversation scenario, there is generally a possibility that a robot customer service cannot answer a question, and in a man-machine conversation scenario, high-quality answer data is more easily obtained because an answer is made by a man-machine customer service, so that answer-free question data can be obtained from service conversation data in the man-machine conversation scenario, and target answer data can be obtained from service conversation data in the man-machine conversation scenario.
It should be noted that, in addition to the service session data in a specific customer service session scenario, the method may also be applied in a conventional session scenario, such as a question-answer in a chat scenario.
S104, confirming first dialogue data associated with the answer-free question data from the service dialogue data and confirming second dialogue data associated with the target answer data;
in one embodiment of the present specification, after confirming the no-answer question data and the first dialogue data, the first dialogue data associated with the no-answer question data is confirmed, and the second dialogue data associated with the target answer data is confirmed. The dialogue data is data containing a question-answer dialogue, and can be data containing a round of dialogue or data containing a plurality of rounds of dialogue. A complete session includes a process of making feedback on user input from user input to customer service. Multiple rounds of conversations refer to the entire interactive process, meaning the process of obtaining user input through more than one round of conversations to ultimately give feedback results that meet the needs of the user.
It will be appreciated that although corresponding answers may be manually produced by an operator for answer-free question data, this way of producing answers is costly and, therefore, the first dialogue data associated with the answer-free question data may be validated in the existing service dialogue data to produce an answer from the first dialogue data which may contain answer content. Similarly, since the target answer data is answer data generated in the service dialogue data, the associated second dialogue data can be confirmed from the target answer data, and the corresponding question data can be confirmed. The first dialogue data may be obtained from service dialogue data other than non-answer dialogue data, where the non-answer dialogue data refers to dialogue data including non-answer question data in the service dialogue data; the second session data may be validated from all of the service session data. Specifically, the first dialogue data may not be acquired in the dialogue round to which the answer-free question data belongs, and may be acquired dialogue data from other dialogue rounds, which is not specifically limited. And the second dialogue data is preferably acquired from the dialogue round to which the target problem data belongs.
S106, based on the first dialogue data and the non-answer question data, confirming answer data matched with the non-answer question data;
in one embodiment of the present specification, answer data may be extracted conclusively based on first dialogue data and answer-free question data by recalling the first dialogue data. For example, answer data matching the answer-free question data may be extracted from the first dialogue data by a read-understanding-based question-answer model, which may be constructed using a model structure such as CNN (Convolutional Neural Network ), BERT (Bidirectional Encoder Representation from Transformers, bi-directional coded representation based on a converter), or the like.
S108, confirming target question data matched with the target answer data based on the second dialogue data and the target answer data;
in one embodiment of the present description, the target question data is a question corresponding to each answer in the target answer data. And according to the acquired target answer data and the second dialogue data, confirming target question data matched with the target answer data from the target answer data. In one possible implementation, the target issue data may be validated based on a degree of matching of the target issue data with the issue data in the second dialogue data.
And S110, generating question-answer data based on the answer-free question data, the answer data, the target answer data and the target question data, wherein the question-answer data comprises a plurality of question answer pairs, and the question answer pairs comprise question data and answer data corresponding to the question data.
In an embodiment of the present disclosure, after answer data corresponding to answer-free question data is obtained, the answer data corresponding to the answer-free question data may be correlated to obtain a question answer pair, and for the obtained target answer data and the target question data, a question answer pair may also be generated correspondingly, and the obtained question answer pair may be refined to be used as a new or complementary knowledge point in the knowledge base.
In the embodiment of the present specification, by acquiring unanswered question data and target answer data in service dialogue data, first dialogue data associated with the unanswered question data is confirmed from the service dialogue data, second dialogue data associated with the target answer data is confirmed, answer data matching the unanswered question data is confirmed based on the first dialogue data and the unanswered question data, target question data matching the target answer data is confirmed based on the second dialogue data and the target answer data, and question-answer data is generated based on the unanswered question data, the answer data, the target answer data, and the target question data. The knowledge can be mined from a large amount of service dialogue data according to the target answer data and the answer-free question data in the service dialogue data by confirming the target answer data and the answer-free question data, so that the content of the online FAQ knowledge base can be continuously enriched, the knowledge base can be rapidly constructed in a cold start scene, and the knowledge base constructing and updating efficiency is improved.
Referring to fig. 3, a flowchart of a method for generating question-answer data is provided in an embodiment of the present disclosure. As shown in fig. 3, the method of the embodiments of the present specification may include the following steps S202 to S214.
S202, acquiring service dialogue data, and clustering customer service answer data in the service dialogue data to obtain a customer service answer data class cluster;
in one embodiment of the present disclosure, when the target answer data is acquired, a clustering result of the customer service answer data, that is, a customer service answer data cluster, may be obtained by clustering the customer service answer data in the acquired service dialogue data. The customer service answer data refers to dialogue content sent by customer service in service dialogue data. Specifically, customer service answer data text can be converted into a vector form, and then a clustering algorithm is adopted to cluster the vector. The clustering algorithm may employ, for example, k-means, BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), balanced iteration protocols and clustering using hierarchical methods, etc., and is not particularly limited.
S204, confirming a first occurrence frequency of the customer service answer data based on the cluster size of each customer service answer data cluster, and confirming target answer data from the customer service answer data based on the first occurrence frequency;
In one embodiment of the present disclosure, after a customer answer class cluster is confirmed, according to a class cluster size, that is, the number of the same class of customer service answer data in the class cluster, the occurrence frequency of each class cluster may be confirmed according to the size and the total data amount of each class cluster, and the occurrence frequency of each class cluster is used as the first occurrence frequency of the customer service answer data in the class cluster. And confirming target answer data with higher first occurrence frequency from the customer service answer data according to the first occurrence frequency of the customer service answer data. Specifically, the target answer data with the first occurrence frequency higher than the frequency threshold value can be screened out by setting the frequency threshold value and the like, so that high-frequency customer service answers are mined from service dialogue data, and a large number of similar answer pairs are prevented from being mined.
In one embodiment of the present disclosure, the step of confirming the first occurrence frequency of the customer service answer data based on the cluster size of each of the customer service answer data clusters, and confirming the target answer data from the customer service answer data based on the first occurrence frequency includes the following steps S302-S306:
s302, confirming high-frequency response data from the customer service response data based on the first occurrence frequency;
In one embodiment of the present specification, the high-frequency response data is confirmed from the customer service response data according to the first frequency of occurrence corresponding to the customer service response data. Specifically, the first occurrence frequency of each customer service answer data may be ranked, and the customer service answer data ranked in the first 5% may be selected as the high-frequency answer data.
S304, confirming the first similarity between the high-frequency answer data and answer data in a knowledge base;
in one embodiment of the present disclosure, after the high-frequency answer data is confirmed, it is further required to confirm the first similarity between the high-frequency answer data and the answer data in the knowledge base, and further confirm whether the high-frequency answer data contains content that can be used as an answer. In a possible implementation manner, a plurality of knowledge points are stored in the knowledge base, each knowledge point can be composed of a question and an answer, and the first similarity can be calculated by matching the high-frequency answer data with answer data corresponding to each knowledge point in the knowledge base.
S306, confirming target answer data from the high-frequency answer data based on the first similarity;
in one embodiment of the present specification, it is confirmed whether the high-frequency dialogue data contains knowledge points according to the first similarity, thereby selecting high-quality representative target answer data from the service dialogue data as a knowledge source. Specifically, when the first similarity is higher than the similarity threshold, it may be confirmed that the high-frequency answer data contains knowledge, which may be confirmed as target answer data.
Optionally, the target answer data confirmation can also be realized through a knowledge detection model, the knowledge detection model can be composed of a Bert+ classification layer, the training data can be rapidly cold-started in an automatic construction mode without using labeling data: taking answers in the knowledge base as positive examples, and taking the similarity between customer service answers and answers in the knowledge base in a recall dialogue as negative classes. For example: the input knowledge detection model is "[ CLS ] i am a sentence", the probability of classifying 0 or 1 is output (0 contains no knowledge, 1 contains knowledge), and the high-frequency answer data containing knowledge is confirmed as target answer data.
S206, matching the dialogue fragments to which the target answer data belong in the service dialogue data;
in one embodiment of the present specification, the dialogue segment to which the target answer data belongs is confirmed after the target answer data is obtained. Specifically, the dialogue fragments to which the target answer data belong can be located by means of text matching, and the dialogue fragments can be a dialogue data set comprising the target answer data and the target answer data context.
S208, confirming second dialogue data associated with the target answer data based on the dialogue segment;
In one embodiment of the present specification, after the session fragment is obtained, the second session data of the target answer data may be confirmed according to the session fragment. Specifically, the session fragment may be composed of customer service session data to which the target answer data belongs and previous user session data, and session data of a related session may be acquired together as second session data.
S210, confirming the round distance and the first question-answer matching degree of the target answer data and the question data in the second dialogue data;
in one embodiment of the present specification, after the target answer data is obtained, the round distance of the target answer data to the question data in the second dialogue data and the first question-answer matching degree may be confirmed. The turn distance refers to the distance between the turn to which the target answer data belongs and the turn of the question data sent by the client in the second dialogue data. The first question-answer matching degree refers to the question-answer correlation of the target answer data with the question data in the second dialogue data. The first question-answer matching degree can be obtained through confirmation of a question-answer correlation matching model.
S212, confirming target question data matched with the target answer data from the question data in the second dialogue data based on the round distance and the first question-answer matching degree;
In one embodiment of the present disclosure, after the round distance and the first question-answer matching degree are obtained, the close question data with the round close to the round may be obtained from the question data in the second dialogue data according to the round distance, so as to confirm the target question data matched with the target answer data according to the first question-answer matching degree of each close question data and the target answer data. Alternatively, the round distance and the first question-answer matching degree of the question data in the second dialogue data may be respectively scored, and the target question data may be confirmed by calculating the total score of the question data in each second dialogue data. Referring to fig. 4, fig. 4 is a schematic diagram of an exemplary method for generating question-answer data according to an embodiment of the present disclosure, and fig. 4 is a schematic diagram of second dialogue data including target answer data, where target question data of the target answer data may be confirmed in the second dialogue data according to a round distance and a first question-answer matching degree.
S214, the second dialogue data and the target answer data are spliced and then input into a question generation model, and target question data corresponding to the target answer data are output by the question generation model.
In one embodiment of the present disclosure, after the target answer data is obtained, the second dialogue data and the target answer data may be spliced and then input into the question generation model, and the question generation model outputs the target question data corresponding to the target answer data. Specifically, the problem generation model may employ a BART-based training, which is a noise reduction self-encoder constructed using a sequence-to-sequence model. Specifically, the training data set of the BART may obtain corresponding question data according to the answer data of the knowledge base, or may obtain answer data and a context (dialogue segment) corresponding to the answer data from the DuReader reading and understanding data set in the open field, and obtain the answer data corresponding to the answer data, and obtain training answer data and corresponding training question data in the two modes, and train a question generation model according to the training answer data and the training question data.
In this embodiment of the present disclosure, service session data is acquired, customer service answer data in the service session data is clustered to obtain customer service answer data class clusters, a first frequency of occurrence of the customer service answer data is confirmed based on a class cluster size of each customer service answer data class cluster, high-frequency answer data is confirmed from the customer service answer data based on the first frequency of occurrence, mining of high-frequency customer service answers is avoided, a large number of similar answer pairs are avoided, a first similarity of the high-frequency answer data and answer data in a knowledge base is confirmed, target answer data is confirmed from the high-frequency answer data based on the first similarity, whether the customer service answers have "knowledge" is detected, and target answer data with knowledge is selected. After confirming the target answer data, a dialogue segment to which the target answer data belongs may be matched in the service dialogue data, second dialogue data associated with the target answer data may be confirmed based on the dialogue segment, a round distance and a first question-answer matching degree of the target answer data and question data in the second dialogue data may be confirmed, target question data matched with the target answer data may be confirmed from the question data in the second dialogue data based on the round distance and the first question-answer matching degree, or the second dialogue data and the target answer data may be spliced and input into a question generation model, and the target question data corresponding to the target answer data may be output by the question generation model. For a large amount of service dialogue data, based on a question-answer pair production link of cluster-knowledge detection-question production, target answer data of interest and target question data corresponding to the target answer data can be extracted from a large amount of human dialogue logs, and then question-answer pairs containing knowledge points can be extracted.
Referring to fig. 5, a flowchart of a method for generating question-answer data is provided in an embodiment of the present disclosure. As shown in fig. 5, the method of the embodiments of the present specification may include the following steps S402 to S408.
S402, obtaining answer-free question data in service dialogue data;
in one embodiment of the present disclosure, defect detection may be performed on man-machine conversation data in service conversation data, where no answer question data is detected.
S404, confirming a second similarity of the answer-free question data and the service dialogue data, and confirming similar dialogue data from the service dialogue data based on the second similarity;
in one embodiment of the present specification, when the answer-free question data is obtained, a second similarity of the answer-free question data to the service session data may be confirmed, and the similar session data may be confirmed from the service session data based on the second similarity. For example, a BM25 algorithm may be used to calculate a second similarity of the no answer question data to the service session data, recalling similar session data for Topk that is similar to the no answer question data. The BM25 is a classical algorithm used in the information index field to calculate the query and document similarity score. Of course, other text similarity algorithms may be selected, and are not limited in particular.
S406, confirming a third similarity of the answer-free question data and the user round data in the similar dialogue data, and confirming first dialogue data from the similar dialogue data based on the third similarity;
in one embodiment of the present disclosure, it may be understood that the third similarity is a semantic similarity between the non-answer question data and similar dialogue data, and the dialogue segment most relevant to the non-answer question data may be found through the semantic similarity. Specifically, since the answer-free question data is a question posed by the client, the matching range can be confirmed to the user round data in the similar dialogue data, that is, the dialogue data transmitted by the user round. User turn data in answer-free question data and similar dialogue data can be converted into vector representation, and a third similarity is recorded as sim qq =axsim(q k ,uery),q k Is the user turn in the dialogue and the query is the no answer question data. If sim is qq If the answer is lower than the set threshold, indicating that no proper dialogue is available for accepting the problem query, and performing refusal processing; otherwise, the dialog with the highest similarity is taken as the first dialog data maxsim (user_querys). Wherein sim is qq The method can be calculated by adopting a cosine similarity algorithm, can also adopt other similarity calculation modes, and is not particularly limited.
S408, inputting the first dialogue data and the non-answer question data into a dialogue question answer model, and outputting answer data matched with the non-answer question data by the dialogue question answer model.
In one embodiment of the present description, answer data that matches no answer question data may be derived from a pre-trained dialogue question answer model. In one possible implementation, the dialogue question answer model is obtained by training a machine reading understanding model based on a question-answer (Question Answering on Informative Conversations, QAConv) dataset of an informative dialogue. The machine reading understanding model may be BERT (Bidirectional Encoder Representations from Transformers), biDAF (Bi-Directional Attention Flow), ELMo (Embedding from Language Models) models or other functionally similar models, and the present embodiment is not limited.
In this embodiment of the present disclosure, by acquiring unanswered question data in service session data, confirming second similarity between unanswered question data and service session data, confirming similar session data from the service session data based on the second similarity, confirming third similarity between unanswered question data and user turn data in the similar session data, confirming first session data from the similar session data based on the third similarity, inputting the first session data and the unanswered question data into a session question answer model, and outputting answer data matching the unanswered question data from the session question answer model. The method comprises the steps of automatically detecting unanswered user questions in service dialogue data, confirming first dialogue data based on the unanswered question data and the overall third similarity of the dialogue on the one hand, and on the other hand, confirming corresponding answers based on the unanswered question data and the third similarity of the user round data on the other hand, and automatically producing the corresponding answers according to the first dialogue data and the unanswered question data by a dialogue question answer model, so that the accuracy and efficiency of answer production are improved.
Referring to fig. 6, a flowchart of a method for generating question-answer data is provided in an embodiment of the present disclosure. As shown in fig. 6, the method of the embodiments of the present specification may include the following steps S502 to S506.
S502, generating first question-answer data based on the answer-free question data and the answer data;
in one embodiment of the present disclosure, after answer data matching the answer-free question data and the answer-free question data is obtained, each question in the answer-free question data is matched with the answer data to obtain a question answer pair, and first question answer data is generated.
S504, generating second question-answer data based on the target answer data and the target question data;
in one embodiment of the present specification, after the target answer data and the target question data are obtained, the target answer data and the target question data are one-to-one corresponding to the question answer pair, and the second question answer data is generated.
S506, confirming the first question and answer data and the second question and answer data as candidate question and answer data, confirming question and answer quality of the candidate question and answer data, and screening the question and answer data from the candidate question and answer data based on the question and answer quality.
In one embodiment of the present specification, the first question-answer data and the second question-answer data are taken together as candidate question-answer data, then the candidate question-answer data is subjected to quality inspection, and the question-answer data is confirmed according to the question-answer quality of the candidate question-answer data. It can be understood that after the candidate question-answer data is obtained, the quality of the question-answer data can be improved by judging the quality of the produced knowledge, filtering low-quality question-answer pairs, merging similar knowledge, checking and desensitizing sensitive information contained in the answer, and the like. Specifically, whether sensitive information such as a user license plate/phone/address exists in answer data of the question-answer data can be checked based on regular rules.
Further, in an embodiment of the present specification, the question quality includes a question quality, and the confirming the question quality of the candidate question data, and the screening of the question data from the candidate question data based on the question quality includes the following steps S602 to S606:
s602, confirming the question quality of candidate question data in the candidate question-answer data based on the question data in the knowledge base;
s604, confirming invalid question data from the candidate question data based on the question quality;
in one embodiment of the present description, the quality of the questions of the question data in the candidate question-answer data may be confirmed based on the question data stored in the knowledge base. In a specific implementation manner, whether the problem knowledge points contained in the problem data are matched with the problem data in the knowledge base can be confirmed, specifically, a BERT training classification model can be adopted to obtain a classification result of whether the problem data contain the knowledge points, and the classification result is used as the problem quality. And taking the candidate problem data without knowledge points as invalid problem data. For example: good thank you? What do me do? . In addition, whether some nonsensical problems and problems with unclear expression exist in the candidate problem data can be judged, for example, statement smoothness of the problems can be calculated as problem quality, and invalid problem data can be further confirmed.
And S606, eliminating invalid question data in the candidate question-answering data to obtain question-answering data.
In one embodiment of the present specification, invalid question data among the candidate question-answer data is eliminated, and the remaining candidate question-answer data is taken as question-answer data. It can be understood that if the invalid question data is the question data in the first question-answer data, the corresponding answer data may be meaningless and may be removed together, and if the invalid question data is the question data in the second question-answer data, it may be one of the non-compliant question methods of the target answer data, and it may be further confirmed whether the corresponding answer data needs to be removed.
Further, in an embodiment of the present specification, the question quality includes a question matching degree, and the confirming the question quality of the candidate question data screens question data from the candidate question data based on the question quality includes the following steps S702 to S704:
s702, confirming second question-answer matching degree of candidate question data and candidate answer data in the candidate question-answer data;
and S704, screening out question and answer data of which the second question and answer matching degree meets a preset condition from the candidate question and answer data based on the second question and answer matching degree.
In one embodiment of the present description, the candidate question-answer data includes candidate question data and corresponding candidate answer data, which mayTo confirm the second question-answer matching degree of the candidate question data and the candidate answer data, thereby selecting question-answer data of which the second question-answer matching degree satisfies a preset condition from the candidate question-answer data. Specifically, the second question-answer matching degree sim of questions and answers can be measured by a question-answer semantic matching model based on BERT training qa =im (query, a), the query representing candidate question data, and a representing candidate answer data. Of course, the question-answer semantic matching model can also be obtained through model training such as QA-LSTM, and the like, and is not limited in the embodiment.
Further, in an embodiment of the present specification, the question-answer quality includes an answer occurrence frequency, and the confirming the question-answer quality of the candidate question-answer data, and the screening of the question-answer data from the candidate question-answer data based on the question-answer quality includes the following steps S802-S804:
s802, clustering candidate answer data in the candidate question-answer data to obtain candidate answer data class clusters;
in one embodiment of the present description, the question-answer quality includes an answer occurrence frequency, that is, an occurrence probability of candidate answer data among candidate question-answer data. Specifically, the candidate answer data text can be converted into a vector form, and then the vector is clustered by adopting a clustering algorithm to obtain candidate answer data class clusters. The clustering algorithm may be, for example, k-means, BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), a balanced iteration protocol using a hierarchical method, a clustering algorithm, or the like, and is not particularly limited.
S804, confirming second occurrence frequency of the candidate answer data based on the class cluster size of each candidate answer data class cluster, and confirming question and answer data from the customer service answer data based on the second occurrence frequency.
It can be appreciated that the user is more concerned with the high frequency question-answer pairs, and the answer frequency score indirectly reflects the frequency of occurrence of the answer in the real dialogue and the coverage of the online traffic. Therefore, by calculating the second occurrence probability for the candidate answer data, question-answer data more focused by the user can be selected therefrom. After confirming the candidate answer data class clusters, according to the size of the class clusters, namely the number of the same class of candidate answer data in the class clusters, the occurrence frequency of each class of clusters can be confirmed according to the size of the class clusters, and the occurrence frequency of each class of clusters is used as the second occurrence frequency of the candidate answer data in the class clusters. And confirming target answer data with higher second occurrence frequency from the candidate answer data according to the second occurrence frequency of various candidate answer data. Specifically, the second occurrence probability may be expressed in terms of an answer frequency score, where the answer frequency score frep_score (a) =class cluster size/max class cluster size. The size of the class cluster is the size of the class cluster of each candidate answer data class cluster, and the size of the max class cluster is the size of the class cluster with the largest class cluster size in the candidate answer data class clusters.
Further, in an embodiment of the present specification, the question-answer quality includes answer quality, and the confirming the question-answer quality of the candidate question-answer data, and the screening of the question-answer data from the candidate question-answer data based on the question-answer quality includes the following steps S902-S906:
s902, confirming answer quality of candidate answer data in the candidate question answer data;
and S904, confirming invalid answer data from the candidate question-answer data based on the answer quality, and eliminating the invalid answer data in the candidate question-answer data to obtain question-answer data.
In one embodiment of the present specification, the question-answer quality includes answer quality of candidate question-answer data, and specifically, whether the candidate answer data includes a knowledge point may be determined by performing knowledge detection on the candidate answer data in the candidate question-answer data. Wherein, the knowledge detection model can be used for giving a classification result of whether the knowledge points exist in the candidate answer data. The knowledge detection model may employ a model structure of the bert+ classification layer. The classification result can be identified by the probability of 0-1, the probability that the candidate answer data contains knowledge points is higher as the probability is closer to 1, invalid answer data lower than the probability threshold can be screened out from the probability by setting the probability threshold, and the invalid answer data is rejected, so that question-answer data is obtained.
Further, in an embodiment of the present disclosure, the confirming the answer quality of the candidate answer data in the candidate question-answer data includes at least one of the following:
s9022, confirming the length of candidate answer data in the candidate question and answer data, and confirming the answer quality of the candidate question and answer data based on the length and a length threshold;
in a possible embodiment, the answer quality of the candidate answer data can be confirmed by its length, i.e. the number of characters in the candidate answer data, it being understood that when the length of the candidate answer data is too short, it is likely that no valid information is contained therein, i.e. no knowledge points are contained therein, and therefore, the answer quality of the candidate answer data having a length below the length threshold value can be set to be low and then filtered.
S9024, confirming the confusion degree of the language model of the candidate answer data, and confirming the answer quality of the candidate question-answer data based on the confusion degree of the language model;
in one possible implementation, the answer quality of the candidate answer data may be confirmed by calculating a language model confusion degree for each candidate answer data. Specifically, the confusion degree (PPL) of the language model is mainly to estimate the occurrence probability of a sentence according to each word in the sentence, and the lower the confusion degree of the language model is, the higher the answer quality of the candidate question-answer data is, the more the calculation mode of the confusion degree of the language model is disclosed, and details are not repeated here.
And S9026, confirming the answer quality of the candidate answer data based on the answer data in the knowledge base.
In one possible implementation, the answer quality of the candidate answer data may be determined according to the answer data in the knowledge base, that is, whether the candidate answer data includes a knowledge point may be determined according to the similarity between the candidate answer data and the answer data in the knowledge base. For example, a knowledge detection model may be trained based on answer data in a knowledge base, and the answer quality of candidate answer data is obtained according to the output score of the knowledge detection model.
Referring to fig. 7, fig. 7 is an overall flowchart of a method for generating question-answer data, in which, for answer-free question data obtained from service dialogue data, first, similar dialogue data matched with the service dialogue data can be calculated and recalled, then, third similarity ranking is calculated for the similar dialogue data and the answer-free question data to obtain first dialogue data, the first dialogue data and the answer-free question data are input into a dialogue question answer model, answer data matched with each answer-free question data is output by the dialogue question answer model based on a dialogue reading understanding mechanism, and answer data is obtained by aggregating the answer; clustering service dialogue data to obtain high-frequency answer data, confirming whether the high-frequency answer data contains 'knowledge' according to answer data in a knowledge base, confirming to obtain target answer data, obtaining second dialogue data to which the target answer data belongs, confirming the round distance between the target answer data and question data in the second dialogue data and the matching degree of the first question answer, matching to obtain target question data, and splicing the target answer data and the second dialogue data to input a question generation model to generate target question data; and confirming the obtained target question data, target answer data, answer-free question data and answer data as candidate question-answer data, performing quality inspection on the candidate question-answer data, screening the candidate question-answer data by detecting the question quality, the question-answer matching degree, the answer occurrence probability and the answer quality of the candidate question-answer data, and performing data desensitization to obtain the question-answer data. Optionally, after the question-answer data is obtained, quality inspection can be performed manually, so that the validity of the question-answer data is further ensured.
In the embodiment of the specification, the first question-answer data is generated based on the answer-free question data and the answer data, the second question-answer data is generated based on the target answer data and the target question data, the first question-answer data and the second question-answer data are confirmed to be candidate question-answer data, the question-answer quality of the candidate question-answer data is confirmed, the question-answer data is screened out of the candidate question-answer data based on the question-answer quality, the low-quality generated candidate question-answer data can be removed by detecting the question-answer quality, the quality of the generated question-answer data is improved, and the quality of a FAQ knowledge base constructed based on the question-answer data is further improved.
The question-answer data generating device provided in the embodiment of the present specification will be described in detail with reference to fig. 8. It should be noted that, the question-answer data generating device in fig. 8 is used to execute the method of the embodiment shown in fig. 2 to 7 of the present specification, and for convenience of explanation, only the portion relevant to the embodiment of the present specification is shown, and specific technical details are not disclosed, please refer to the embodiment shown in fig. 2 to 7 of the present specification.
Referring to fig. 8, a schematic diagram of the structure of the question-answer data generating device according to an exemplary embodiment of the present disclosure is shown. The question-answer data generation means may be implemented as all or part of the apparatus by software, hardware or a combination of both. The device 1 comprises an acquisition module 11, a dialogue confirmation module 12, an answer confirmation module 13, a question confirmation module 14 and a generation module 15.
An obtaining module 11, configured to obtain answer-free question data and target answer data in service dialogue data;
a dialogue confirmation module 12 for confirming first dialogue data associated with the answer-free question data from the service dialogue data, and confirming second dialogue data associated with the target answer data;
an answer confirming module 13, configured to confirm answer data matched with the answer-free question data based on the first dialogue data and the answer-free question data;
a question confirmation module 14 for confirming target question data matching the target answer data based on the second dialogue data and the target answer data;
and the generating module 15 is configured to generate question-answer data based on the answer-free question data, the answer data, the target answer data and the target question data, where the question-answer data includes a plurality of question answer pairs, and the question answer pairs include question data and answer data corresponding to the question data.
Optionally, the acquiring module 11 is specifically configured to acquire service dialogue data, and cluster customer service answer data in the service dialogue data to obtain a customer service answer data cluster;
And confirming a first occurrence frequency of the customer service answer data based on the class cluster size of each customer service answer data class cluster, and confirming target answer data from the customer service answer data based on the first occurrence frequency.
Optionally, the acquiring module 11 is specifically configured to confirm high-frequency answer data from the customer service answer data based on the first occurrence frequency;
confirming a first similarity between the high-frequency answer data and answer data in a knowledge base;
and confirming target answer data from the high-frequency answer data based on the first similarity.
Optionally, the session confirmation module 12 is specifically configured to confirm a second similarity between the answer-free question data and the service session data, and confirm similar session data from the service session data based on the second similarity;
and confirming a third similarity of the answer-free question data and the user round data in the similar dialogue data, and confirming the first dialogue data from the similar dialogue data based on the third similarity.
Optionally, the session confirmation module 12 is specifically configured to match, in the service session data, a session segment to which the target answer data belongs;
And confirming second dialogue data associated with the target answer data based on the dialogue fragment.
Optionally, the answer confirming module 13 is specifically configured to input the first dialogue data and the answer-free question data into a dialogue question answer model, and output answer data matched with the answer-free question data from the dialogue question answer model.
Optionally, the answer confirming module 13 is specifically configured to confirm a round distance between the target answer data and the question data in the second dialogue data and a first question-answer matching degree;
and confirming target question data matched with the target answer data from the question data in the second dialogue data based on the round distance and the first question-answer matching degree.
Optionally, the answer confirming module 13 is specifically configured to splice the second dialogue data and the target answer data, input the second dialogue data and the target answer data into a question generation model, and output target question data corresponding to the target answer data by using the question generation model.
Optionally, the generating module 15 is specifically configured to generate first question-answer data based on the answer-free question data and the answer data;
generating second question-answer data based on the target answer data and the target question data;
And confirming the first question and answer data and the second question and answer data as candidate question and answer data, confirming the question and answer quality of the candidate question and answer data, and screening the question and answer data from the candidate question and answer data based on the question and answer quality.
Optionally, the generating module 15 is specifically configured to confirm, based on the question data in the knowledge base, quality of the question of the candidate question data in the candidate question-answer data;
identifying invalid issue data from the candidate issue data based on the issue quality;
and eliminating invalid question data in the candidate question-answering data to obtain question-answering data.
Optionally, the generating module 15 is specifically configured to confirm a second question-answer matching degree of the candidate question data and the candidate answer data in the candidate question-answer data;
and screening out question-answer data of which the second question-answer matching degree meets a preset condition from the candidate question-answer data based on the second question-answer matching degree.
Optionally, the generating module 15 is specifically configured to cluster candidate answer data in the candidate question-answer data to obtain a candidate answer data class cluster;
and confirming a second occurrence frequency of the candidate answer data based on the class cluster size of each candidate answer data class cluster, and confirming question and answer data from the customer service answer data based on the second occurrence frequency.
Optionally, the generating module 15 is specifically configured to confirm answer quality of candidate answer data in the candidate question-answer data;
and confirming invalid answer data from the candidate question-answer data based on the answer quality, and eliminating the invalid answer data in the candidate question-answer data to obtain question-answer data.
Optionally, the generating module 15 is specifically configured to confirm a length of candidate answer data in the candidate question-answer data, and confirm an answer quality of the candidate question-answer data based on the length and a length threshold;
confirming the confusion degree of the language model of the candidate answer data, and confirming the answer quality of the candidate answer data based on the confusion degree of the language model;
and confirming the answer quality of the candidate answer data based on the answer data in the knowledge base.
It should be noted that, when the question-answer data generating device provided in the foregoing embodiment executes the question-answer data generating method, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the question-answer data generating device and the question-answer data generating method provided in the foregoing embodiments belong to the same concept, which embody detailed implementation procedures and are not described herein.
The foregoing embodiment numbers of the present specification are merely for description, and do not represent advantages or disadvantages of the embodiments. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
The embodiment of the present disclosure further provides a storage medium, where a computer program is stored, and when the computer program is executed by a processor, the method for generating question-answer data according to the embodiment shown in fig. 2 to fig. 7 is implemented, and a specific execution process may refer to a specific description of the embodiment shown in fig. 2 to fig. 7, which is not repeated herein.
Referring to fig. 9, a schematic structural diagram of an electronic device according to an exemplary embodiment of the present disclosure is shown. The electronic device in this specification may include one or more of the following: processor 110, memory 120, input device 130, output device 140, and bus 150. The processor 110, the memory 120, the input device 130, and the output device 140 may be connected by a bus 150.
Processor 110 may include one or more processing cores. The processor 110 connects various parts within the overall electronic device using various interfaces and lines, performs various functions of the terminal 100 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 120, and invoking data stored in the memory 120. Alternatively, the processor 110 may be implemented in hardware in at least one of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 110 may integrate one or a combination of several of a central processing unit (Central Processing Unit, CPU), an image processor (Graphics Processing Unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user page, an application program and the like; the GPU is used for being responsible for rendering and drawing of display content; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor 110 and may be implemented solely by a single communication chip.
The memory 120 may include a random access memory (Random Access Memory, RAM) or a Read-only memory (ROM). Optionally, the memory 120 includes a Non-transitory computer readable medium (Non-Transitory Computer-Readable Storage Medium). Memory 120 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 120 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, which may be an Android (Android) system, including an Android system-based deep development system, an IOS system developed by apple corporation, including an IOS system-based deep development system, or other systems, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described above, and the like.
Memory 120 may be divided into an operating system space in which the operating system runs and a user space in which native and third party applications run. In order to ensure that different third party application programs can achieve better operation effects, the operating system allocates corresponding system resources for the different third party application programs. However, the requirements of different application scenarios in the same third party application program on system resources are different, for example, under the local resource loading scenario, the third party application program has higher requirement on the disk reading speed; in the animation rendering scene, the third party application program has higher requirements on the GPU performance. The operating system and the third party application program are mutually independent, and the operating system often cannot timely sense the current application scene of the third party application program, so that the operating system cannot perform targeted system resource adaptation according to the specific application scene of the third party application program.
In order to enable the operating system to distinguish specific application scenes of the third-party application program, data communication between the third-party application program and the operating system needs to be communicated, so that the operating system can acquire current scene information of the third-party application program at any time, and targeted system resource adaptation is performed based on the current scene.
The input device 130 is configured to receive input instructions or data, and the input device 130 includes, but is not limited to, a keyboard, a mouse, a camera, a microphone, or a touch device. The output device 140 is used to output instructions or data, and the output device 140 includes, but is not limited to, a display device, a speaker, and the like. In one example, the input device 130 and the output device 140 may be combined, and the input device 130 and the output device 140 are touch display screens.
The touch display screen may be designed as a full screen, a curved screen, or a contoured screen. The touch display screen may also be designed as a combination of a full screen and a curved screen, a combination of a special-shaped screen and a curved screen, and the embodiments of the present disclosure are not limited thereto.
In addition, those skilled in the art will appreciate that the configuration of the electronic device shown in the above-described figures does not constitute a limitation of the electronic device, and the electronic device may include more or less components than illustrated, or may combine certain components, or may have a different arrangement of components. For example, the electronic device further includes components such as a radio frequency circuit, an input unit, a sensor, an audio circuit, a WiFi module, a power supply, and a bluetooth module, which are not described herein.
In the electronic device shown in fig. 9, the processor 110 may be configured to invoke a computer application program stored in the memory 120, and specifically perform the following operations:
acquiring answer-free question data and target answer data in service dialogue data;
confirming first dialogue data associated with the answer-free question data from the service dialogue data, and confirming second dialogue data associated with the target answer data;
confirming answer data matched with the answer-free question data based on the first dialogue data and the answer-free question data;
confirming target question data matched with the target answer data based on the second dialogue data and the target answer data;
and generating question-answer data based on the answer-free question data, the answer data, the target answer data and the target question data, wherein the question-answer data comprises a plurality of question answer pairs, and the question answer pairs comprise question data and answer data corresponding to the question data.
In one embodiment, the processor 110, when executing the target answer data in the get services session data, specifically performs the following operations:
Acquiring service dialogue data, and clustering customer service response data in the service dialogue data to obtain a customer service response data class cluster;
and confirming a first occurrence frequency of the customer service answer data based on the class cluster size of each customer service answer data class cluster, and confirming target answer data from the customer service answer data based on the first occurrence frequency.
In one embodiment, the processor 110, when executing the first dialogue data associated with the answer-free question data from the service dialogue data, specifically executes the following operations:
confirming a second similarity of the answer-free question data and the service dialogue data, and confirming similar dialogue data from the service dialogue data based on the second similarity;
and confirming a third similarity of the answer-free question data and the user round data in the similar dialogue data, and confirming the first dialogue data from the similar dialogue data based on the third similarity.
In one embodiment, the processor 110, when executing the answer data that matches the unanswered question data based on the first dialogue data and the unanswered question data, specifically executes the following operations:
And inputting the first dialogue data and the non-answer question data into a dialogue question answer model, and outputting answer data matched with the non-answer question data by the dialogue question answer model.
In one embodiment, the processor 110, when executing the second session data associated with the target answer data from the service session data, specifically performs the following operations:
matching dialogue fragments to which the target answer data belong in the service dialogue data;
and confirming second dialogue data associated with the target answer data based on the dialogue fragment.
In one embodiment, the processor 110, when executing the target question data that matches the target answer data based on the second dialogue data and the target answer data, specifically performs the following operations:
confirming the round distance and the first question-answer matching degree of the target answer data and the question data in the second dialogue data;
and confirming target question data matched with the target answer data from the question data in the second dialogue data based on the round distance and the first question-answer matching degree.
In one embodiment, the processor 110, when executing the target question data that matches the target answer data based on the second dialogue data and the target answer data, specifically performs the following operations:
and after the second dialogue data and the target answer data are spliced, inputting a question generation model, and outputting target question data corresponding to the target answer data by the question generation model.
In one embodiment, the processor 110, when executing the generating question-answer data based on the answer-free question data, the answer data, the target answer data, and the target question data, specifically performs the following operations:
generating first question-answer data based on the answer-free question data and the answer data;
generating second question-answer data based on the target answer data and the target question data;
and confirming the first question and answer data and the second question and answer data as candidate question and answer data, confirming the question and answer quality of the candidate question and answer data, and screening the question and answer data from the candidate question and answer data based on the question and answer quality.
In one embodiment, the processor 110, when executing the confirmation of the question quality of the candidate question and answer data, screens the question and answer data from the candidate question and answer data based on the question and answer quality, specifically executes the following operations:
Based on the question data in the knowledge base, confirming the question quality of the candidate question data in the candidate question-answer data;
identifying invalid issue data from the candidate issue data based on the issue quality;
and eliminating invalid question data in the candidate question-answering data to obtain question-answering data.
In one embodiment, the processor 110, when executing the confirmation of the question quality of the candidate question and answer data, screens the question and answer data from the candidate question and answer data based on the question and answer quality, specifically executes the following operations:
confirming the second question-answer matching degree of the candidate question data and the candidate answer data in the candidate question-answer data;
and screening out question-answer data of which the second question-answer matching degree meets a preset condition from the candidate question-answer data based on the second question-answer matching degree.
In one embodiment, the processor 110, when executing the confirmation of the question quality of the candidate question and answer data, screens the question and answer data from the candidate question and answer data based on the question and answer quality, specifically executes the following operations:
clustering the candidate answer data in the candidate question-answer data to obtain candidate answer data class clusters;
And confirming a second occurrence frequency of the candidate answer data based on the class cluster size of each candidate answer data class cluster, and confirming question and answer data from the customer service answer data based on the second occurrence frequency.
In one embodiment, the processor 110, when executing the confirmation of the question quality of the candidate question and answer data, screens the question and answer data from the candidate question and answer data based on the question and answer quality, specifically executes the following operations:
confirming the answer quality of candidate answer data in the candidate question-answer data;
and confirming invalid answer data from the candidate question-answer data based on the answer quality, and eliminating the invalid answer data in the candidate question-answer data to obtain question-answer data.
In one embodiment, the processor 110, when executing the confirmation of answer quality of candidate answer data in the candidate question-answer data, specifically executes at least one of the following operations:
confirming the length of candidate answer data in the candidate question-answer data, and confirming the answer quality of the candidate question-answer data based on the length and a length threshold;
confirming the confusion degree of the language model of the candidate answer data, and confirming the answer quality of the candidate answer data based on the confusion degree of the language model;
And confirming the answer quality of the candidate answer data based on the answer data in the knowledge base.
In the embodiment of the present specification, by acquiring unanswered question data and target answer data in service dialogue data, first dialogue data associated with the unanswered question data is confirmed from the service dialogue data, second dialogue data associated with the target answer data is confirmed, answer data matching the unanswered question data is confirmed based on the first dialogue data and the unanswered question data, target question data matching the target answer data is confirmed based on the second dialogue data and the target answer data, and question-answer data is generated based on the unanswered question data, the answer data, the target answer data, and the target question data. The corresponding answers are produced by mining answer-free question data generated in the service dialogue, and then high-quality target answer data are mined from the service dialogue, so that target question data of the target answer data are generated, high-quality question-answer data are obtained, and the knowledge base construction efficiency is improved.
Further, by acquiring service dialogue data, clustering customer service answer data in the service dialogue data to obtain customer service answer data class clusters, confirming first occurrence frequency of the customer service answer data based on class cluster sizes of the customer service answer data class clusters, confirming high-frequency answer data from the customer service answer data based on the first occurrence frequency, mining high-frequency customer service answers, simultaneously avoiding mining a large number of similar answer pairs, confirming first similarity of the high-frequency answer data and answer data in a knowledge base, confirming target answer data from the high-frequency answer data based on the first similarity, detecting whether the customer service answers have 'knowledge', and selecting the target answer data with knowledge. After confirming the target answer data, a dialogue segment to which the target answer data belongs may be matched in the service dialogue data, second dialogue data associated with the target answer data may be confirmed based on the dialogue segment, a round distance and a first question-answer matching degree of the target answer data and question data in the second dialogue data may be confirmed, target question data matched with the target answer data may be confirmed from the question data in the second dialogue data based on the round distance and the first question-answer matching degree, or the second dialogue data and the target answer data may be spliced and input into a question generation model, and the target question data corresponding to the target answer data may be output by the question generation model. For a large amount of service dialogue data, based on a question-answer pair production link of cluster-knowledge detection-question production, target answer data of interest and target question data corresponding to the target answer data can be extracted from a large amount of human dialogue logs, and then question-answer pairs containing knowledge points can be extracted.
Further, by acquiring unanswered question data in service dialogue data, confirming second similarity of the unanswered question data and the service dialogue data, confirming similar dialogue data from the service dialogue data based on the second similarity, confirming third similarity of user round data in the unanswered question data and the similar dialogue data, confirming first dialogue data from the similar dialogue data based on the third similarity, inputting the first dialogue data and the unanswered question data into a dialogue question answer model, and outputting answer data matched with the unanswered question data by the dialogue question answer model. The method comprises the steps of automatically detecting unanswered user questions in service dialogue data, confirming first dialogue data based on the unanswered question data and the overall third similarity of the dialogue on the one hand, and on the other hand, confirming corresponding answers based on the unanswered question data and the third similarity of the user round data on the other hand, and automatically producing the corresponding answers according to the first dialogue data and the unanswered question data by a dialogue question answer model, so that the accuracy and efficiency of answer production are improved.
Further, by generating first question-answer data based on answer-free question data and answer data, generating second question-answer data based on target answer data and target question data, confirming the first question-answer data and the second question-answer data as candidate question-answer data, confirming question-answer quality of the candidate question-answer data, screening the question-answer data from the candidate question-answer data based on the question-answer quality, and detecting the question-answer quality, low-quality generated candidate question-answer data can be removed, quality of the generated question-answer data is improved, and quality of a FAQ knowledge base constructed based on the question-answer data is further improved.
Furthermore, embodiments of the present description provide a computer program product comprising a computer program which, when executed by a processor of an electronic device, causes the processor to at least implement a method as provided in the embodiments of fig. 1-7 described above.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.
The foregoing disclosure is only illustrative of the preferred embodiments of the present invention and is not to be construed as limiting the scope of the claims, which follow the meaning of the claims of the present invention.

Claims (18)

1. A question-answer data generation method, the method comprising:
acquiring answer-free question data and target answer data in service dialogue data;
Confirming first dialogue data associated with the answer-free question data from the service dialogue data, and confirming second dialogue data associated with the target answer data;
confirming answer data matched with the answer-free question data based on the first dialogue data and the answer-free question data;
confirming target question data matched with the target answer data based on the second dialogue data and the target answer data;
and generating question-answer data based on the answer-free question data, the answer data, the target answer data and the target question data, wherein the question-answer data comprises a plurality of question answer pairs, and the question answer pairs comprise question data and answer data corresponding to the question data.
2. The method of claim 1, the obtaining target answer data in service session data, comprising:
acquiring service dialogue data, and clustering customer service response data in the service dialogue data to obtain a customer service response data class cluster;
and confirming a first occurrence frequency of the customer service answer data based on the class cluster size of each customer service answer data class cluster, and confirming target answer data from the customer service answer data based on the first occurrence frequency.
3. The method of claim 2, the validating target response data from the customer service response data based on the first frequency of occurrence, comprising:
confirming high-frequency response data from the customer service response data based on the first occurrence frequency;
confirming a first similarity between the high-frequency answer data and answer data in a knowledge base;
and confirming target answer data from the high-frequency answer data based on the first similarity.
4. The method of claim 1, the validating the first dialogue data associated with the answer-free question data from the service dialogue data, comprising:
confirming a second similarity of the answer-free question data and the service dialogue data, and confirming similar dialogue data from the service dialogue data based on the second similarity;
and confirming a third similarity of the answer-free question data and the user round data in the similar dialogue data, and confirming the first dialogue data from the similar dialogue data based on the third similarity.
5. The method of claim 1, the validating answer data that matches the unanswered question data based on the first dialogue data and the unanswered question data, comprising:
And inputting the first dialogue data and the non-answer question data into a dialogue question answer model, and outputting answer data matched with the non-answer question data by the dialogue question answer model.
6. The method of claim 1, the validating second dialogue data associated with the target answer data from the service dialogue data comprising:
matching dialogue fragments to which the target answer data belong in the service dialogue data;
and confirming second dialogue data associated with the target answer data based on the dialogue fragment.
7. The method of claim 1, the validating target question data matching the target answer data based on the second dialogue data and the target answer data, comprising:
confirming the round distance and the first question-answer matching degree of the target answer data and the question data in the second dialogue data;
and confirming target question data matched with the target answer data from the question data in the second dialogue data based on the round distance and the first question-answer matching degree.
8. The method of claim 1, the validating target question data matching the target answer data based on the second dialogue data and the target answer data, comprising:
And after the second dialogue data and the target answer data are spliced, inputting a question generation model, and outputting target question data corresponding to the target answer data by the question generation model.
9. The method of claim 1, the generating question-answer data based on the answer-free question data, the answer data, the target answer data, and the target question data, comprising:
generating first question-answer data based on the answer-free question data and the answer data;
generating second question-answer data based on the target answer data and the target question data;
and confirming the first question and answer data and the second question and answer data as candidate question and answer data, confirming the question and answer quality of the candidate question and answer data, and screening the question and answer data from the candidate question and answer data based on the question and answer quality.
10. The method of claim 9, the question-answer quality comprising a question quality;
the step of confirming the question and answer quality of the candidate question and answer data, and screening the question and answer data from the candidate question and answer data based on the question and answer quality comprises the following steps:
based on the question data in the knowledge base, confirming the question quality of the candidate question data in the candidate question-answer data;
Identifying invalid issue data from the candidate issue data based on the issue quality;
and eliminating invalid question data in the candidate question-answering data to obtain question-answering data.
11. The method of claim 9, the question-answer quality comprising a degree of question-answer matching;
the step of confirming the question and answer quality of the candidate question and answer data, and screening the question and answer data from the candidate question and answer data based on the question and answer quality comprises the following steps:
confirming the second question-answer matching degree of the candidate question data and the candidate answer data in the candidate question-answer data;
and screening out question-answer data of which the second question-answer matching degree meets a preset condition from the candidate question-answer data based on the second question-answer matching degree.
12. The method of claim 9, the question-answer quality comprising answer frequency of occurrence;
the step of confirming the question and answer quality of the candidate question and answer data, and screening the question and answer data from the candidate question and answer data based on the question and answer quality comprises the following steps:
clustering the candidate answer data in the candidate question-answer data to obtain candidate answer data class clusters;
and confirming a second occurrence frequency of the candidate answer data based on the class cluster size of each candidate answer data class cluster, and confirming question and answer data from the customer service answer data based on the second occurrence frequency.
13. The method of claim 9, the question-answer quality comprising answer quality;
the step of confirming the question and answer quality of the candidate question and answer data, and screening the question and answer data from the candidate question and answer data based on the question and answer quality comprises the following steps:
confirming the answer quality of candidate answer data in the candidate question-answer data;
and confirming invalid answer data from the candidate question-answer data based on the answer quality, and eliminating the invalid answer data in the candidate question-answer data to obtain question-answer data.
14. The method of claim 13, said confirming answer quality of candidate answer data in said candidate question-answer data comprising at least one of:
confirming the length of candidate answer data in the candidate question-answer data, and confirming the answer quality of the candidate question-answer data based on the length and a length threshold;
confirming the confusion degree of the language model of the candidate answer data, and confirming the answer quality of the candidate answer data based on the confusion degree of the language model;
and confirming the answer quality of the candidate answer data based on the answer data in the knowledge base.
15. A question-answer data generation device, comprising:
the acquisition module is used for acquiring answer-free question data and target answer data in the service dialogue data;
A dialogue confirmation module for confirming first dialogue data associated with the answer-free question data from the service dialogue data and confirming second dialogue data associated with the target answer data;
the answer confirming module is used for confirming answer data matched with the non-answer question data based on the first dialogue data and the non-answer question data;
a question confirmation module for confirming target question data matched with the target answer data based on the second dialogue data and the target answer data;
and the generating module is used for generating question-answer data based on the answer-free question data, the answer data, the target answer data and the target question data, wherein the question-answer data comprises a plurality of question answer pairs, and the question answer pairs comprise the question data and answer data corresponding to the question data.
16. An electronic device, comprising: a processor and a memory; in which a memory stores a computer program adapted to be loaded by the processor and to perform the steps of the method according to any one of claims 1 to 14.
17. A storage medium storing a computer program which, when executed by a processor, implements the steps of the method of any one of claims 1 to 13.
18. A computer program product comprising: computer program which, when executed by a processor of an electronic device, causes the processor to perform the steps of the method according to any one of claims 1 to 13.
CN202310835504.4A 2023-07-07 2023-07-07 Question-answer data generation method, device, equipment and storage medium Pending CN117573816A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310835504.4A CN117573816A (en) 2023-07-07 2023-07-07 Question-answer data generation method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310835504.4A CN117573816A (en) 2023-07-07 2023-07-07 Question-answer data generation method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117573816A true CN117573816A (en) 2024-02-20

Family

ID=89863085

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310835504.4A Pending CN117573816A (en) 2023-07-07 2023-07-07 Question-answer data generation method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117573816A (en)

Similar Documents

Publication Publication Date Title
CN103853703B (en) A kind of information processing method and electronic equipment
CN111428010B (en) Man-machine intelligent question-answering method and device
CN108447471A (en) Audio recognition method and speech recognition equipment
CN107844470B (en) Voice data processing method and equipment thereof
CN111462741B (en) Voice data processing method, device and storage medium
CN111312233A (en) Voice data identification method, device and system
CN112669842A (en) Man-machine conversation control method, device, computer equipment and storage medium
CN110956958A (en) Searching method, searching device, terminal equipment and storage medium
CN114064943A (en) Conference management method, conference management device, storage medium and electronic equipment
CN110781329A (en) Image searching method and device, terminal equipment and storage medium
CN113643706B (en) Speech recognition method, device, electronic equipment and storage medium
CN113763925B (en) Speech recognition method, device, computer equipment and storage medium
CN115620726A (en) Voice text generation method, and training method and device of voice text generation model
CN117573816A (en) Question-answer data generation method, device, equipment and storage medium
CN111966803B (en) Dialogue simulation method and device, storage medium and electronic equipment
CN114373443A (en) Speech synthesis method and apparatus, computing device, storage medium, and program product
CN114049875A (en) TTS (text to speech) broadcasting method, device, equipment and storage medium
CN112307186A (en) Question-answering service method, system, terminal device and medium based on emotion recognition
CN112632241A (en) Method, device, equipment and computer readable medium for intelligent conversation
CN112786041A (en) Voice processing method and related equipment
CN115934920B (en) Model training method for man-machine conversation and related device
CN116913278B (en) Voice processing method, device, equipment and storage medium
CN117933384A (en) Map generation method, device, equipment and storage medium
CN116932716A (en) Answer generation method, device, equipment and storage medium
CN115841810A (en) Voice processing method, device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination