CN117762795A - Method, device, equipment and storage medium for testing abstract generation task - Google Patents

Method, device, equipment and storage medium for testing abstract generation task Download PDF

Info

Publication number
CN117762795A
CN117762795A CN202311788743.5A CN202311788743A CN117762795A CN 117762795 A CN117762795 A CN 117762795A CN 202311788743 A CN202311788743 A CN 202311788743A CN 117762795 A CN117762795 A CN 117762795A
Authority
CN
China
Prior art keywords
abstract
task
index
evaluation index
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311788743.5A
Other languages
Chinese (zh)
Inventor
李思远
王枫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zitiao Network Technology Co Ltd
Original Assignee
Beijing Zitiao Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zitiao Network Technology Co Ltd filed Critical Beijing Zitiao Network Technology Co Ltd
Priority to CN202311788743.5A priority Critical patent/CN117762795A/en
Publication of CN117762795A publication Critical patent/CN117762795A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The application discloses a method, a device, equipment and a storage medium for testing a summary generation task. The testing method of the abstract generating task comprises the following steps: acquiring a summary task data set, wherein the summary task data set comprises text data and a reference summary of the text data; executing a summary generation task on the text data to obtain a predicted summary of the text data; and performing performance test on the abstract generating task based on the reference abstract, the prediction abstract and a multi-dimensional evaluation index, wherein the multi-dimensional evaluation index comprises a semantic understanding index and a structural constraint index.

Description

Method, device, equipment and storage medium for testing abstract generation task
Technical Field
The present invention relates to the field of natural language processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for testing a summary generation task.
Background
With the advent of the big data age, the amount of information has been increasing explosively, and so has the need to process and obtain information. Under such a background, an automatic text summarization technology has been developed, which can help users to quickly understand the main content of text, save reading time and improve working efficiency.
Under various application scenes, the abstract generation task has higher practical application value. Evaluating the abstract generating task in the application scene provides a reference basis for improving the abstract quality. At present, the test of the abstract generation task is mainly realized by means of manual evaluation. However, the manual evaluation has single dimension, cannot better reflect the abstract generation effect in the application scene, and has certain limitation.
Disclosure of Invention
The embodiment of the application aims to provide a method, a device, equipment and a storage medium for testing a summary generation task, which are used for more accurately testing the effect of the summary generation task and providing a reference basis for further optimizing the summary generation task and improving the summary quality.
In order to achieve the above purpose, the embodiment of the present application adopts the following technical scheme:
in a first aspect, an embodiment of the present application provides a method for testing a task for generating a summary, including:
acquiring a summary task data set, wherein the summary task data set comprises text data and a reference summary of the text data;
executing a summary generation task on the text data to obtain a predicted summary of the text data;
and performing performance test on the abstract generating task based on the reference abstract, the prediction abstract and a multi-dimensional evaluation index, wherein the multi-dimensional evaluation index comprises a semantic understanding index and a structural constraint index.
In a second aspect, an embodiment of the present application provides a testing device for a task of generating a summary, including:
the device comprises an acquisition unit, a storage unit and a storage unit, wherein the acquisition unit is used for acquiring a summary task data set, and the summary task data set comprises text data and a reference summary of the text data;
the generation unit is used for executing a summary generation task on the text data to obtain a predicted summary of the text data;
and the testing unit is used for performing performance test on the abstract generating task based on the reference abstract, the prediction abstract and a multi-dimensional evaluation index, wherein the multi-dimensional evaluation index comprises a semantic understanding index and a structural constraint index.
In a third aspect, an embodiment of the present application provides an electronic device, including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the test method of the digest generation task according to the first aspect.
In a fourth aspect, an embodiment of the present application provides a computer readable storage medium, where instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform a testing method for a digest generation task according to the first aspect.
The above-mentioned at least one technical scheme that this application embodiment adopted can reach following beneficial effect:
the text data and the reference abstract thereof are used as a test data set so as to better reflect the actual requirements and characteristics of abstract tasks and improve the fitting degree and the practicability of the test; on the basis, the corresponding prediction abstract is obtained by executing the abstract generating task on the text data in the test data set, and the performance test is performed on the abstract generating task based on the test data set and the generated prediction abstract from the comprehensive view of multiple dimensions, so that the execution effect of the abstract generating task can be tested more comprehensively and systematically, and a reference basis is provided for further optimizing the abstract generating task and improving the abstract quality.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
FIG. 1 is a flow chart of a method for testing a summary generation task according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating a testing method of a summary generation task according to another embodiment of the present application;
FIG. 3 is a flowchart illustrating a testing method of a summary generation task according to another embodiment of the present application;
FIG. 4 is a schematic structural diagram of a testing device for a task of generating a summary according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
For the purposes, technical solutions and advantages of the present application, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
It should be understood that the various steps recited in the method embodiments of this document may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing initial steps. The scope of the present document is not limited in this respect.
The term "comprising" and variations thereof as used in this document is meant to be open ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.
It should be noted that the terms "first," "second," and the like herein are merely used for distinguishing between different devices, modules, or units and not for limiting the order or interdependence of the functions performed by such devices, modules, or units.
It should be noted that references to "one", "a plurality", etc. in this document are intended to be illustrative rather than limiting, and those skilled in the art will appreciate that "one or more" is intended to be interpreted as "one or more" unless the context clearly indicates otherwise.
The names of messages or information interacted between the devices in the present document embodiment are for illustrative purposes only and are not intended to limit the scope of such messages or information.
At present, the test of the abstract generation task is mainly realized by means of manual evaluation. However, the manual evaluation has single dimension, cannot better reflect the abstract generation effect in the application scene, and has certain limitation.
Based on the above, the embodiment of the application provides a method, a device, equipment and a storage medium for testing a summary generation task, which take text data and a reference summary thereof as a test data set to better reflect the actual requirements and characteristics of the summary task and improve the fitting degree and the practicability of the test; on the basis, the corresponding prediction abstract is obtained by executing the abstract generating task on the text data in the test data set, and the performance test is performed on the abstract generating task based on the test data set and the generated prediction abstract from the comprehensive view of multiple dimensions, so that the execution effect of the abstract generating task can be tested more comprehensively and systematically, and a reference basis is provided for further optimizing the abstract generating task and improving the abstract quality.
It should be understood that the testing method of the summary generating task provided in the embodiments of the present application may be executed by an electronic device, and in particular, may be executed by a processor of the electronic device. The electronic device may be a terminal device, such as a smart phone, tablet computer, notebook computer, desktop computer, intelligent voice interaction device, vehicle-mounted terminal, etc.; alternatively, the electronic device may be a server, such as an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides a cloud computing service.
The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.
Referring to fig. 1, a flowchart of a testing method for a summary generation task according to an embodiment of the present application is provided, where the method may include the following steps:
s102, acquiring a summary task data set.
The summary task data set includes text data and a reference summary of the text data. The reference abstract can be obtained by manually labeling the text data. To more fully perform performance testing on summary generation tasks, text data in the summary task data set is derived from a variety of application scenarios, including, for example, but not limited to: instant messaging scenes, document scenes, meeting scenes, mail scenes, etc.
Under the instant messaging scene, the real messages of single chat and group chat can be extracted from at least one round of dialogue and used as text data under the application scene. In the document scene, knowledge text, a periclase record, a daily newspaper, a technical scheme document, a product solution document and the like in an office domain knowledge base can be used as text data in the application scene. In a conference scenario, corresponding text data may be generated based on conference content of a conference, a technical sharing conference, a technical discussion conference, and the like. Under the mail scene, the mail content text can be extracted from various mails such as a full-member mail, a notification mail, a daily communication mail and the like, and is used as text data under the application scene.
In practical application, considering the difficulty of abstract data annotation, the data magnitude of each application scene can be limited, for example, for each application scene, collected various long texts are randomly segmented and sampled according to a preset length to obtain text data, and the text data are added into an abstract task data set.
In addition, in consideration of the existence of sensitive information of text data in some application scenes, the text data is subjected to desensitization processing after corresponding text data are acquired from each application scene. As one example, the desensitization process includes modifying sensitive information in the text data to other fictional information by way of manual construction while maintaining the overall structure of the text data unchanged. The sensitive information comprises company name, personnel information, date, document link, project parameter, technical detail information, organization structure information and the like. For example, a specific company name is replaced with a common name, a user name is replaced with a user serial number, and the like.
And S104, executing a summary generation task on the text data in the summary task data set to obtain a predicted summary of the text data.
The task of generating the abstract of the text data may be performed in various ways, which is not limited in this embodiment of the application. In one embodiment, words related to the abstract may be extracted from the text data, and the extracted words may be combined to obtain a predicted abstract of the text data.
In another embodiment, a pre-trained abstract generation model may be utilized to perform an abstract generation process on the text data to obtain a corresponding predicted abstract. Specifically, considering that the definition of the abstract is relatively broad, the definitions of the abstract are different, and the predicted abstract generated based on the text data is also different. In order to accurately test the execution effect of the abstract generation task, a preset second prompt text and text data can be input into an abstract generation model to perform abstract generation, so that a predicted abstract of the text data is obtained.
The second prompt text is used to describe the definition of the summary. For example, the second hint text may define the length of the summary to be generated, such as "please help me summarize the summary of the text in a sentence"; the second prompt text may define the subject matter of the summary to be generated, such as "please help me summarize the summary related to 'big model floor' in this section of text"; the second prompt text may define a format of a summary to be generated, such as "please help me summarize a summary of the text in a list". Of course, the second prompt text may not impose any constraint on the summary to be generated, such as "please help me summarize the summary of the text.
The abstract generation model may employ various large pre-trained language models, such as GPT-3.5, GPT4, etc., which are not limited in this embodiment of the application. The abstract generation model may be obtained by performing pretraining on a large scale pretrained language model (Large Language Model, LLM) on a large amount of unlabeled text data, and then performing fine tuning on the pretrained large scale pretrained language model on the labeled data set of the abstract task. Because the large pre-training language model is a deep learning model based on a neural network, rich semantic representation can be learned by pre-training a large amount of unlabeled text data; then, fine tuning is performed on the specific abstract generating task, and the abstract generating task obtained through training can be used for accurately executing the text abstract task.
The present embodiment herein shows a part of the implementation of S104 described above. Of course, it should be understood that S104 may be implemented in other manners, which are not limited in this embodiment of the present application.
And S106, performing performance test on the abstract generation task based on the reference abstract and the prediction abstract of the text data and the multidimensional evaluation index.
The multidimensional evaluation index is an index capable of comprehensively and objectively generating an effect from multiple dimensions under an application scene. The multi-dimensional evaluation index may include a semantic understanding index and a structured constraint index.
The semantic understanding index is used for describing the difference of the predicted abstract and the reference abstract of the text data in terms of semantics, and reflects the semantic understanding performance of the abstract generating task. In particular, semantic understanding indicators may include, but are not limited to, facts, integrity, readability, and the like. The facts are used to reflect whether the main content of the text data is correctly understood in the process of executing the abstract generating task, such as whether the predicted abstract has the content which is made in disorder, whether the inferred content is reasonable, whether the inferred content accords with human cognition, whether semantic understanding errors exist, and the like. Integrity is used to reflect whether the predicted summary covers the content of the text data, whether there is a summary omission, such as the coverage of text topics, core keywords, whether the main content in the text (e.g., emphasis, pain points, key questions, etc.) is missed. Readability is used to reflect whether the content description of the predicted digest is clearly readable, such as whether the regulations are clear, whether the sentences are concise, whether there are duplicate content and redundant descriptions, whether emphasis content is emphasized, and the like.
The structured constraint index is used to reflect whether the predicted digest meets structural requirements. The structured constraint index may comprise at least one of the following: summary length, summary format, summary language, summary topic, etc.
In the case where the predicted digest is obtained by performing a digest generation task on the text data based on the second hint text, the structured constraint index may be derived from the second hint text. As an example, before S106, the second prompt text is parsed, so as to obtain the abstract constraint condition contained in the second prompt text; further, a structured constraint index is determined based on the summary constraint condition. Wherein the abstract constraint is used for the structure of the abstract, and specifically may include, for example, but not limited to, at least one of the following conditions: summary length conditions, summary format conditions, summary topic conditions, summary language conditions, and the like.
For example, by analyzing the second prompt text "please help me summarize the abstract related to' large model floor" in the text, the condition that the second prompt text contains the abstract subject is "large model floor" is obtained, and then the abstract subject is used as a structural constraint index to describe whether the description content of the predicted abstract accords with the "large model floor".
For another example, the summary length is defined by analyzing the second prompt text "please summarize the key of the text with about 100 words", so as to obtain the summary length condition "about 100 words" of the second prompt text, and further the summary length can be used as a structural constraint index to describe whether the length of the predicted summary meets the summary length condition.
For another example, the summary format condition list of the second prompt text is obtained by analyzing the second prompt text "please help me summarize the summary of the text in the form of list", and then the summary format is used as a structural constraint index to describe whether the format of the predicted summary meets the summary format condition.
For another example, the second prompt text "please summarize the abstract of the chinese text with english" is parsed to obtain the abstract language condition "english" contained in the second prompt text, and then the abstract language is used as a structural constraint index to describe whether the language of the predicted abstract satisfies the abstract language condition.
Through the step S106, from the multi-dimensional evaluation index synthesis, the performance test is performed on the summary generation task based on the test data set and the generated prediction summary, so that the execution effect of the summary generation task can be tested more comprehensively and systematically, and a reference basis is provided for further optimizing the summary generation task and improving the summary quality.
In an embodiment, the step S106 may include the following steps:
and S161, determining a quality weight corresponding to the multi-dimensional evaluation index of the prediction abstract based on the reference abstract and a test strategy corresponding to the multi-dimensional evaluation index.
As an example, the test strategy corresponding to the semantic understanding index includes determining a quality score through a language model for testing, and the language model can be set according to actual needs, such as GPT4, etc. In this case, the quality weight corresponding to the prediction digest in the semantic understanding dimension may be determined as follows: generating a first prompt text for describing a semantic understanding index; and carrying out semantic scoring on the prediction abstract based on the first prompt text and the reference abstract by using the model to obtain a quality weight corresponding to the prediction abstract in the semantic understanding index.
For example, for the semantic understanding index of facts, a first prompt text "please refer to the abstract, evaluate whether the predicted abstract accurately describes the main content of the text data, give a weight and a specific scoring reason" may be generated, and input the first prompt text, the predicted abstract and the reference abstract into the language model for semantic scoring, so as to obtain the quality weight of the predicted abstract under the semantic understanding index.
For the integrity, a semantic understanding index can be generated, a first prompt text is generated, whether the predicted abstract covers the content of the text data or not is evaluated according to the reference abstract, a weight and a specific scoring reason are given, and the first prompt text, the predicted abstract and the reference abstract are input into a language model for semantic scoring, so that the quality weight of the predicted abstract under the semantic understanding index can be obtained.
For the readability, a first prompt text 'please refer to the abstract according to the meaning, average whether the content description of the predicted abstract is clear and readable or not, and give a weight and a specific scoring reason' can be generated, and the first prompt text, the predicted abstract and the reference abstract are input into a language model to carry out semantic scoring, so that the quality weight of the predicted abstract under the semantic understanding index can be obtained.
It should be noted that, in practical application, for each semantic understanding index, the semantic scoring standard of the semantic understanding index may also be input into the language model together, so that the language model may perform semantic scoring. For example, table 1 below shows a semantic scoring criteria for a semantic understanding index.
TABLE 1
It can be understood that the semantic understanding index relates to the semantic understanding of the prediction abstract, and the semantic understanding and evaluation of the prediction abstract are performed by means of a language model and prompt engineering technology, so that the accuracy and efficiency of the dimension test of the semantic understanding index can be improved.
As another example, the test policy corresponding to the structured constraint index includes a preset correspondence between index data of the structured constraint index and a quality weight. In this case, the quality weight of the prediction abstract corresponding to the structural constraint index is determined by the following manner: generating a test script based on the structural constraint index and a preset corresponding relation; and analyzing by using the test script and the prediction abstract to obtain the quality weight corresponding to the prediction abstract in the structural constraint index. The preset corresponding relation can be determined according to abstract constraint conditions contained in the second prompt text.
For example, table 2 below shows a preset correspondence between index data of a partially structured constraint index and quality weights.
TABLE 2
Based on the preset correspondence shown in the above table 2, the test script is generated using the if else statement and the return statement.
It can be understood that the capability of the abstract generation task to perform structural processing on the text is mainly evaluated by the structural constraint indexes, such as format, length, language and the like, semantic understanding is not involved, and the structural constraint indexes are automatically evaluated by the scripts, so that processing resources can be saved, and the test efficiency is improved.
And S162, determining the performance weight of the abstract generation task based on the quality weight corresponding to the multi-dimensional evaluation index of the prediction abstract.
The quality weight corresponding to the multi-dimensional evaluation index of the predicted abstract can reflect the execution effect of the abstract generation task in the individual dimension. The quality weight corresponding to the multidimensional evaluation index of all the prediction summaries is synthesized, so that the execution effect of the summary generation task can be comprehensively and systematically tested, and a reference basis is provided for further optimizing the summary generation task and improving the summary quality.
As an example, the step S162 may include the steps of: and carrying out weighted summation on the quality weight corresponding to the multi-dimensional evaluation index of the prediction abstract based on the preset weight of each evaluation index in the multi-dimensional evaluation index to obtain the performance weight of the abstract generation task. The preset weight of each evaluation index can be preset according to actual needs or expert experience.
As another example, the step S162 may include the steps of: determining the weight of each evaluation index in the multi-dimensional evaluation index based on the minimum value in the quality weights corresponding to the multi-dimensional evaluation index; and carrying out weighted summation on the quality weight corresponding to the multi-dimensional evaluation index of the prediction abstract based on the weight of each evaluation index to obtain the performance weight of the abstract generation task.
Specifically, the first weight corresponding to the evaluation index corresponding to the minimum value is greater than the second weight, the first weight is the weight when the minimum value is smaller than or equal to the preset weight, and the second weight is the weight when the minimum value is greater than the preset weight.
For example, assuming that the multidimensional evaluation index includes three dimensions of evaluation indexes of facts, integrity and readability, the evaluation indexes are ranked in order of low quality weight, and an evaluation index corresponding to the minimum quality weight is determined. If the minimum quality weight is smaller than or equal to the preset weight, the weights of the evaluation indexes are 0.6, 0.2 and 0.2 in sequence. Thus, the performance weight mos=0.6+0.2+0.2 of the digest generation task can be determined. If the minimum quality weight is larger than the preset weight, the weights of the evaluation indexes are sequentially 1/3, 1/3 and 1/3. Thus, the performance weight mos=1/3×score [0] +1/3×score [1] +1/3×score [2] of the digest generation task can be determined. Wherein score [0] represents the first-order evaluation index, score [1] represents the second-order evaluation index, and score [2] represents the third-order evaluation index.
For another example, assuming that the multidimensional evaluation index includes four evaluation indexes of facts, integrity, readability and structural constraint indexes, the evaluation indexes are ranked in order from low quality weight to high quality weight, and an evaluation index corresponding to the minimum quality weight is determined. If the minimum mass weight is smaller than or equal to the preset weight, the weights of the evaluation indexes are 0.6, 0.2, 0.1 and 0.1 in sequence. Thus, the performance weight mos=0.6+0.2+0.1+0.1+0.3 of the digest generation task can be determined. If the minimum quality weight is greater than the preset weight, the weights of the evaluation indexes are 0.3, 0.2 and 0.2 in sequence. Thus, the performance weight mos=0.3+0.3+0.2+0.2+0.3 of the digest generation task can be determined. Wherein score [0] represents the first-order evaluation index, score [1] represents the second-order evaluation index, score [2] represents the third-order evaluation index, and score [3] represents the fourth-order evaluation index.
It can be appreciated that the weight of each evaluation index is determined based on the minimum quality weight, so that the influence of the evaluation index with low quality weight on the comprehensive performance of the abstract generating task can be focused more in the test process, and powerful support is provided for optimizing the abstract generating task.
The present embodiment herein shows a part of the implementation of S106 described above. Of course, it should be understood that S106 described above may also be implemented in other ways.
According to the method for testing the abstract generating task, provided by the embodiment, the text data and the reference abstract thereof are used as a test data set, so that the actual requirements and characteristics of the abstract task are better reflected, and the fitting degree and the practicability of the test are improved; on the basis, the corresponding prediction abstract is obtained by executing the abstract generating task on the text data in the test data set, and the performance test is performed on the abstract generating task based on the test data set and the generated prediction abstract from the comprehensive view of multiple dimensions, so that the execution effect of the abstract generating task can be tested more comprehensively and systematically, and a reference basis is provided for further optimizing the abstract generating task and improving the abstract quality.
In another embodiment, after S106, the summary generation task may be further optimized to improve the summary generation quality. Specifically, in the case of performing the digest generation task through the digest generation model, the test method for the digest generation task provided in the embodiment of the present application may further include: and optimizing the abstract generation model based on the performance test result of the abstract generation task.
In an embodiment, as shown in fig. 2, after S106, the testing method for a summary generating task provided in the embodiment of the present application may further include:
and S108, if the abstract generation task fails the test, determining an abnormal evaluation index from the multi-dimensional evaluation indexes based on the quality weight corresponding to the multi-dimensional evaluation index of the predicted abstract.
As an example, for each of the multi-dimensional evaluation indexes, if the quality weight of the prediction summary corresponding to the evaluation index is less than or equal to the preset test weight, the evaluation index is determined as an abnormal evaluation index.
S110, optimizing the abstract generation model based on an optimization strategy corresponding to the abnormal evaluation index.
As an example, if the abnormal evaluation index includes a semantic understanding index, updating the reference abstract based on a quality weight of the prediction abstract corresponding to the semantic understanding index, and performing optimization training on the abstract generation model based on the text data and the updated reference abstract.
For example, if the quality weight corresponding to any semantic understanding index of the facts, the integrity and the readability of the predicted abstract is smaller than the preset test weight, the reference abstract can be updated, for example, manually remarked, so as to ensure that the reference abstract provides a more accurate supervision signal; on the basis, text data in the abstract task data set is used as a training sample, the updated reference abstract is used as a label, a more accurate supervision signal is provided for the abstract generation model, and the abstract generation model is optimally trained, so that the summarization capacity of the abstract generation model on the text data can be further improved, and the abstract generation quality is improved.
As another example, if the anomaly evaluation index includes a structured constraint index, the summary task data set is extended, and the summary generation model is optimally trained based on the extended summary task data set.
For example, if the quality weight corresponding to any structural constraint index of the predicted digest in the digest length, the digest format, the digest language, the digest theme, etc. is smaller than the preset test weight, the digest task data set may be expanded to cover text data with more features and the reference digest thereof; on the basis, text data in the abstract task data set is used as a training sample, a reference abstract of the text data is used as a label, a more comprehensive supervision signal is provided for the abstract generating model, and the abstract generating model is optimally trained, so that the summarization capability of the abstract generating model on various text data can be further improved, and the abstract generating quality is improved.
According to the method for testing the abstract generation task, after the performance test result of the abstract generation task is obtained, the abstract generation model is optimized based on the test result, so that the performance of the abstract generation model can be improved, and the abstract generation quality is improved.
The following describes a testing method of a summary generating task provided in the embodiment of the present application, taking an office scenario as an example. Referring to fig. 3, a flow chart of a testing method of a summary generation task according to another embodiment of the present application is shown.
Firstly, collecting text data of various sub-scenes in an office scene, and performing desensitization processing on sensitive information in the collected text data. The text data of the various sub-scenes includes, but is not limited to, message data, document data, conference data, mail data and the like. The message data is from single chat message, group chat message, etc., the document data is from office domain knowledge base, week meeting document, technical proposal, etc., the meeting data is from week meeting summary, technical sharing meeting summary, etc., and the mail data is from whole member mail, notification mail, daily communication mail, etc.
And then, manually labeling the text data to obtain a reference abstract of the text data, and executing an abstract generating task on the text data to obtain a predicted abstract of the text data.
Further, the collected text data, a reference abstract of the text data and a prediction abstract are used as test data in an office scene, and starting from multidimensional evaluation indexes such as reality, integrity, readability and structural constraint indexes, quality scoring is carried out on each prediction abstract, so that quality weights of each prediction abstract corresponding to the multidimensional evaluation indexes are obtained.
And finally, integrating quality weights of all the prediction abstracts corresponding to the multidimensional evaluation indexes, and determining performance test results of the abstract generation task in the office scene.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
Fig. 4 is a schematic structural diagram of a testing device for a task of generating a summary according to an embodiment of the present application. Referring to fig. 4, in a software embodiment, a testing apparatus 400 for a digest generation task may include an acquisition unit 410, a generation unit 420, and a testing unit 430.
An obtaining unit 410 is configured to obtain a summary task data set, where the summary task data set includes text data and a reference summary of the text data.
And the generating unit 420 is configured to perform a summary generating task on the text data, so as to obtain a predicted summary of the text data.
And a testing unit 430, configured to perform a performance test on the summary generating task based on the reference summary, the prediction summary, and a multi-dimensional evaluation index, where the multi-dimensional evaluation index includes a semantic understanding index and a structural constraint index.
In one embodiment, the test unit is specifically configured to:
determining a quality weight corresponding to the multi-dimensional evaluation index of the prediction abstract based on the reference abstract and a testing strategy corresponding to the multi-dimensional evaluation index;
and determining the performance weight of the abstract generating task based on the quality weight of the predicted abstract corresponding to the multi-dimensional evaluation index.
In one embodiment, the test strategy corresponding to the semantic understanding index includes determining a quality weight through a language model for testing;
the quality weight corresponding to the semantic understanding index of the prediction abstract is determined by the following method:
generating a first prompt text for describing the semantic understanding index;
and carrying out semantic scoring on the prediction abstract based on the first prompt text and the reference abstract by using the language model to obtain a quality weight corresponding to the prediction abstract in the semantic understanding index.
In an embodiment, the test policy corresponding to the structural constraint index includes a preset correspondence between index data of the structural constraint index and a quality weight;
the quality weight of the prediction abstract corresponding to the structural constraint index is determined by the following method:
generating a test script based on the structural constraint index and the preset corresponding relation;
and analyzing the prediction abstract by using the test script to obtain a quality weight value of the prediction abstract corresponding to the structural constraint index.
In one embodiment, the predicted abstract is obtained by performing an abstract generating task on the text data based on a preset second prompting text;
the apparatus 400 further comprises:
the analysis unit is used for analyzing the second prompt text before the test unit determines the predicted abstract is in front of the quality weight corresponding to the multi-dimensional evaluation index based on the reference abstract and the test strategy corresponding to the multi-dimensional evaluation index, so as to obtain abstract constraint conditions contained in the second prompt text;
and the first determining unit is used for determining the structural constraint index and the preset corresponding relation based on the abstract constraint condition.
In one embodiment, the summary constraint includes at least one of the following: summary length conditions, summary format conditions, summary subject conditions, summary language conditions.
In one embodiment, the test unit is specifically configured to:
determining the weight of each evaluation index in the multi-dimensional evaluation index based on the minimum value in the quality weights corresponding to the multi-dimensional evaluation index;
and carrying out weighted summation on the quality weight corresponding to the multi-dimensional evaluation index on the prediction abstract based on the weight of each evaluation index to obtain the performance weight of the abstract generation task.
In an embodiment, the first weight of the evaluation index corresponding to the minimum value is greater than the second weight, where the first weight is a weight when the minimum value is less than or equal to a preset weight, and the second weight is a weight when the minimum value is greater than the preset weight.
In one embodiment, the generating unit is specifically configured to:
inputting a preset second prompt text and the text data into a summary generation model to generate a summary, and obtaining a predicted summary of the text data.
In one embodiment, the apparatus 400 further comprises:
The second determining unit is configured to determine, after the testing unit performs a performance test on the summary generating task based on the reference summary, the prediction summary and the multi-dimensional evaluation index, an abnormal evaluation index from the multi-dimensional evaluation index based on a quality weight corresponding to the multi-dimensional evaluation index by the prediction summary if the summary generating task fails the test;
and the optimizing unit is used for optimizing the abstract generating model based on an optimizing strategy corresponding to the abnormal evaluation index.
In one embodiment, the optimizing unit is specifically configured to:
if the abnormal evaluation index comprises a semantic understanding index, updating the reference abstract, and carrying out optimization training on the abstract generation model based on text data in the abstract task data set and the updated reference abstract; and/or the number of the groups of groups,
if the abnormal evaluation index comprises a structural constraint index, expanding the abstract task data set, and carrying out optimization training on the abstract generation model based on the expanded abstract task data set.
Obviously, the testing device for the abstract generating task provided by the embodiment of the application can be used as an execution main body of the testing method for the abstract generating task shown in fig. 1, so that the function of the testing method for the abstract generating task in fig. 1 can be realized. The principle is the same and will not be described again.
Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application. Referring to fig. 5, at the hardware level, the electronic device includes a processor, and optionally an internal bus, a network interface, and a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory (non-volatile Memory), such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.
The processor, network interface, and memory may be interconnected by an internal bus, which may be an ISA (Industry Standard Architecture ) bus, a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one bi-directional arrow is shown in FIG. 5, but not only one bus or type of bus.
And the memory is used for storing programs. In particular, the program may include program code including computer-operating instructions. The memory may include memory and non-volatile storage and provide instructions and data to the processor.
The processor reads the corresponding computer program from the nonvolatile memory to the memory and then runs the computer program to form a testing device of the abstract generating task on the logic level. The processor is used for executing the programs stored in the memory and is specifically used for executing the following operations:
acquiring a summary task data set, wherein the summary task data set comprises text data and a reference summary of the text data;
executing a summary generation task on the text data to obtain a predicted summary of the text data;
and performing performance test on the abstract generating task based on the reference abstract, the prediction abstract and a multi-dimensional evaluation index, wherein the multi-dimensional evaluation index comprises a semantic understanding index and a structural constraint index.
The method executed by the testing device for the digest generation task disclosed in the embodiment shown in fig. 1 of the present application may be applied to a processor or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.
The electronic device may also execute the method of fig. 1, and implement the functions of the test device for the task of generating the abstract in the embodiments shown in fig. 1, fig. 2, and fig. 3, which are not described herein again.
Of course, other implementations, such as a logic device or a combination of hardware and software, are not excluded from the electronic device of the present application, that is, the execution subject of the following processing flow is not limited to each logic unit, but may be hardware or a logic device.
The present embodiments also provide a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a portable electronic device comprising a plurality of application programs, enable the portable electronic device to perform the method of the embodiment of fig. 1, and in particular to:
acquiring a summary task data set, wherein the summary task data set comprises text data and a reference summary of the text data;
executing a summary generation task on the text data to obtain a predicted summary of the text data;
and performing performance test on the abstract generating task based on the reference abstract, the prediction abstract and a multi-dimensional evaluation index, wherein the multi-dimensional evaluation index comprises a semantic understanding index and a structural constraint index.
In summary, the foregoing description is only a preferred embodiment of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

Claims (14)

1. The method for testing the abstract generation task is characterized by comprising the following steps of:
acquiring a summary task data set, wherein the summary task data set comprises text data and a reference summary of the text data;
Executing a summary generation task on the text data to obtain a predicted summary of the text data;
and performing performance test on the abstract generating task based on the reference abstract, the prediction abstract and a multi-dimensional evaluation index, wherein the multi-dimensional evaluation index comprises a semantic understanding index and a structural constraint index.
2. The method of claim 1, wherein performing a performance test on the summary generation task based on the reference summary, the predictive summary, and a multi-dimensional evaluation index comprises:
determining a quality weight corresponding to the multi-dimensional evaluation index of the prediction abstract based on the reference abstract and a testing strategy corresponding to the multi-dimensional evaluation index;
and determining the performance weight of the abstract generating task based on the quality weight of the predicted abstract corresponding to the multi-dimensional evaluation index.
3. The method of claim 2, wherein the test strategy corresponding to the semantic understanding index comprises determining a quality weight by a language model for testing;
the quality weight corresponding to the semantic understanding index of the prediction abstract is determined by the following method:
generating a first prompt text for describing the semantic understanding index;
And carrying out semantic scoring on the prediction abstract based on the first prompt text and the reference abstract by using the language model to obtain a quality weight corresponding to the prediction abstract in the semantic understanding index.
4. The method of claim 2, wherein the test policy corresponding to the structured constraint index comprises a preset correspondence between index data and quality weights of the structured constraint index;
the quality weight of the prediction abstract corresponding to the structural constraint index is determined by the following method:
generating a test script based on the structural constraint index and the preset corresponding relation;
and analyzing the prediction abstract by using the test script to obtain a quality weight value of the prediction abstract corresponding to the structural constraint index.
5. The method of claim 4, wherein the predictive abstract is obtained by performing an abstract generation task on the text data based on a preset second hint text;
before determining the quality weight corresponding to the multi-dimensional evaluation index, the method further comprises the following steps of:
Analyzing the second prompt text to obtain abstract constraint conditions contained in the second prompt text;
and determining the structural constraint index and the preset corresponding relation based on the abstract constraint condition.
6. The method of claim 5, wherein the summary constraint comprises at least one of: summary length conditions, summary format conditions, summary subject conditions, summary language conditions.
7. The method of claim 2, wherein determining the performance weight of the summary generation task based on the quality weight of the prediction summary corresponding to the multi-dimensional evaluation index comprises:
determining the weight of each evaluation index in the multi-dimensional evaluation index based on the minimum value in the quality weights corresponding to the multi-dimensional evaluation index;
and carrying out weighted summation on the quality weight corresponding to the multi-dimensional evaluation index on the prediction abstract based on the weight of each evaluation index to obtain the performance weight of the abstract generation task.
8. The method of claim 7, wherein a first weight of the evaluation index corresponding to the minimum value is greater than a second weight, the first weight being a weight when the minimum value is less than or equal to a preset weight, the second weight being a weight when the minimum value is greater than the preset weight.
9. The method of any one of claims 1 to 8, wherein performing a summary generation task on the text data results in a predicted summary of the text data, comprising:
inputting a preset second prompt text and the text data into a summary generation model to generate a summary, and obtaining a predicted summary of the text data.
10. The method of claim 9, wherein after performance testing the digest generation task based on the reference digest, the predictive digest, and a multi-dimensional evaluation index, the method further comprises:
if the abstract generation task fails the test, determining an abnormal evaluation index from the multi-dimensional evaluation index based on a quality weight corresponding to the multi-dimensional evaluation index by the prediction abstract;
and optimizing the abstract generation model based on an optimization strategy corresponding to the abnormal evaluation index.
11. The method of claim 10, wherein optimizing the summary generation model based on the optimization strategy corresponding to the anomaly evaluation index comprises:
if the abnormal evaluation index comprises the semantic understanding index, updating the reference abstract, and performing optimization training on the abstract generation model based on text data in the abstract task data set and the updated reference abstract; and/or the number of the groups of groups,
If the abnormal evaluation index comprises the structural constraint index, expanding the abstract task data set, and carrying out optimization training on the abstract generation model based on the expanded abstract task data set.
12. A test device for a digest generation task, comprising:
the device comprises an acquisition unit, a storage unit and a storage unit, wherein the acquisition unit is used for acquiring a summary task data set, and the summary task data set comprises text data and a reference summary of the text data;
the generation unit is used for executing a summary generation task on the text data to obtain a predicted summary of the text data;
and the testing unit is used for performing performance test on the abstract generating task based on the reference abstract, the prediction abstract and a multi-dimensional evaluation index, wherein the multi-dimensional evaluation index comprises a semantic understanding index and a structural constraint index.
13. An electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the method of testing a digest generation task according to any one of claims 1 to 11.
14. A computer readable storage medium, characterized in that instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the test method of the digest generation task of any one of claims 1 to 11.
CN202311788743.5A 2023-12-22 2023-12-22 Method, device, equipment and storage medium for testing abstract generation task Pending CN117762795A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311788743.5A CN117762795A (en) 2023-12-22 2023-12-22 Method, device, equipment and storage medium for testing abstract generation task

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311788743.5A CN117762795A (en) 2023-12-22 2023-12-22 Method, device, equipment and storage medium for testing abstract generation task

Publications (1)

Publication Number Publication Date
CN117762795A true CN117762795A (en) 2024-03-26

Family

ID=90317785

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311788743.5A Pending CN117762795A (en) 2023-12-22 2023-12-22 Method, device, equipment and storage medium for testing abstract generation task

Country Status (1)

Country Link
CN (1) CN117762795A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118093831A (en) * 2024-04-15 2024-05-28 清华大学 Text evaluation benchmark construction method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118093831A (en) * 2024-04-15 2024-05-28 清华大学 Text evaluation benchmark construction method and device

Similar Documents

Publication Publication Date Title
JP6476195B2 (en) Identifying tasks in messages
US7966316B2 (en) Question type-sensitive answer summarization
CN111898643B (en) Semantic matching method and device
US20030167245A1 (en) Summary evaluation apparatus and method, and computer-readable recording medium in which summary evaluation program is recorded
US11507743B2 (en) System and method for automatic key phrase extraction rule generation
CN117076650B (en) Intelligent dialogue method, device, medium and equipment based on large language model
US11907863B2 (en) Natural language enrichment using action explanations
CN112905664A (en) Data rule mining method and device
CN115134660A (en) Video editing method and device, computer equipment and storage medium
JP7172187B2 (en) INFORMATION DISPLAY METHOD, INFORMATION DISPLAY PROGRAM AND INFORMATION DISPLAY DEVICE
Yu et al. Localizing function errors in mobile apps with user reviews
CN111143515B (en) Text matching method and device
CN117648418A (en) Multi-document question-answering method and device, electronic equipment and storage medium
CN109726938B (en) Student thinking state early warning method based on deep learning
CN117762795A (en) Method, device, equipment and storage medium for testing abstract generation task
CN115599973A (en) User crowd label screening method, system, equipment and storage medium
CN111783453B (en) Text emotion information processing method and device
Gurcan Identification of mobile development issues using semantic topic modeling of stack overflow posts
CN114117057A (en) Keyword extraction method of product feedback information and terminal equipment
Lee et al. Feed Distillation Using AdaBoost and Topic Maps.
CN118152532A (en) Reply generation method, long tail recognition model training method and corresponding device
EP4328805A1 (en) Method and apparatus for generating target deep learning model
CN117762794A (en) Method, device, equipment and storage medium for testing backlog extraction task
CN117952097A (en) Event extraction method, related equipment and storage medium
CN117350303A (en) Emotion analysis method, device and medium based on cross-dialogue aspect

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination