CN116483991A - Dialogue abstract generation method and system - Google Patents

Dialogue abstract generation method and system Download PDF

Info

Publication number
CN116483991A
CN116483991A CN202310455478.2A CN202310455478A CN116483991A CN 116483991 A CN116483991 A CN 116483991A CN 202310455478 A CN202310455478 A CN 202310455478A CN 116483991 A CN116483991 A CN 116483991A
Authority
CN
China
Prior art keywords
dialogue
information
abstract
words
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310455478.2A
Other languages
Chinese (zh)
Inventor
金彦亮
臧庆福
高塬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN202310455478.2A priority Critical patent/CN116483991A/en
Publication of CN116483991A publication Critical patent/CN116483991A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a dialogue abstract generating method and a dialogue abstract generating system, belonging to the field of automatic abstract in natural language processing. In order to solve the problems that the key information in the dialogue is difficult to focus, useless information in too many dialogues is introduced and the like in the conventional abstract generation method, the method provided by the invention comprises the following steps: preprocessing a dialogue; converting the dialogue corpus into a dialogue text sequence; extracting dialogue multi-granularity semantic features; constructing a multi-feature fusion filtering module to filter the dialogue; generating a summary based on the pre-trained language model BART and the layered fransformer model; and searching to obtain the optimal dialogue abstract by utilizing a beam search algorithm according to the output of the abstract generator. According to the method and the device, useless information in the dialogue can be filtered according to the keywords, the topics, the sentences and the dialogue full text, and the keyword information is fused into the generation process of the abstract through the layered Transformer, so that the generated abstract contains more important information.

Description

Dialogue abstract generation method and system
Technical Field
The invention relates to the technical field of natural language processing, in particular to a dialogue abstract generation method and a dialogue abstract generation system combining multi-granularity semantic features and layered convertors.
Background
The purpose of the conversation summarization task is to refine the most important information in the conversation to shorter text, which can help people quickly capture important information in multiple participant conversation text without reviewing complex conversation contexts. As a very important new task, the conversation abstract offers a strong application potential for many scenarios, such as forensic dialects in civil judgment, service conversations between customer service and users, and teleconferences with multiple participants. However, the dialogue has the characteristics of important information dispersion, more spoken contents, non-unique subject, non-standard format and the like, and the characteristics require that the abstract generation method has good semantic feature extraction capability and semantic understanding capability, thus bringing great challenges to the study of the dialogue abstract.
In recent years, some researches are carried out on dialogue abstract tasks, and a priori knowledge is mostly built, so that the abstract generation quality is improved by utilizing semantic features of dialogue. However, most of these methods require manual annotation of additional dialogue features, adding additional effort to the study of the dialogue abstract. Some progress has been made in extracting semantic features of a conversation using an unsupervised method, and semantic features such as keyword information, topic information, etc. extracted by the unsupervised method have been demonstrated to improve the generation of a conversation abstract, but these methods do not consider the problem of more useless information in the original conversation. In addition, the problem that important information in the generated abstract is not fully covered and often contains irrelevant or redundant words and the like is still a problem to be solved in the current dialogue abstract research.
Chinese patent application CN114372140a discloses a hierarchical conference summary generation method, which has the problems that additional dialogue behavior labels need to be manually annotated, semantic features of a dialogue are not effectively utilized, accuracy of summary generation is difficult to ensure, and useless information in the dialogue is introduced into a summary generation process.
Disclosure of Invention
The invention aims to solve the problems that key information in a dialogue is difficult to focus, useless information in too many dialogues is introduced and the like in the conventional abstract generation method, and provides a dialogue abstract generation method and a dialogue abstract generation system combining multi-granularity semantic features and layered convertors.
The aim of the invention can be achieved by the following technical scheme:
as a first aspect of the present invention, there is provided a dialogue digest generation method characterized by: the method comprises the following steps:
performing dialogue pretreatment, processing noise words in the dialogue, and converting the dialogue into a text sequence form;
acquiring dialogue context representation, and encoding the preprocessed dialogue text sequence into word vector representation;
extracting multi-granularity semantic features, and respectively extracting semantic features in a dialogue by using an unsupervised method, wherein the semantic features comprise keyword information, topic information, sentence information and full-text information;
multiple features fusion filtering, fusing multiple semantic features, and filtering useless information in the dialogue;
and generating a abstract, namely generating a dialogue abstract according to the dialogue and the keyword information after multi-feature fusion and filtering.
Further, the specific process of the dialogue pretreatment comprises the following steps: the method comprises the steps of obtaining a data set, performing Chinese word segmentation on the data set by using a word segmentation device, removing meaningless noise words, unifying symbols in a dialogue, converting the dialogue form into a text sequence which can be processed by a model, and formatting the dialogue text.
Further, the obtaining the dialogue context specifically includes: the input dialog text is encoded using a pre-trained language model BART as an encoder to obtain a matrix vector representation of the dynamic dialog.
Further, the specific step of extracting the keyword information includes:
scoring the importance of the words by combining the semantic information of the words, the occurrence frequency of the words, the part of speech of the words and the information entropy of the words, and obtaining keywords of the conversation based on the scoring result;
w i score =W i pos ×(w s ×w i s +w f ×W i f +w loss ×W i loss )
wherein W is i score Score representing the ith word, W i pos Representing the part of speech of the ith word, W i s Semantic information representing the ith word, W i f Representing the occurrence frequency of the ith word, W i loss Information entropy indicating ith word, w s 、w f 、w loss And respectively representing the contribution degree of semantic information, occurrence frequency and information entropy.
Further, the specific steps of extracting the theme information include:
obtaining sentence vectors of each dialogue in the dialogue by using a Chinese pre-training dialogue model;
calculating sentence vectors of adjacent two dialogue rounds through a topic similarity algorithm to obtain semantic similarity scores between the two dialogue rounds;
and setting a score threshold value, classifying two rounds of conversations with scores exceeding the threshold value into the same class of topics, and classifying adjacent conversations with the same topic into a topic block.
Further, the specific step of extracting the statement information comprises the following steps:
inserting special symbols between the input text sequences, altering the format of the dialog sequences, specifically including adding special symbols [ CLS ] and [ SEP ] at the beginning and end of each round of dialog in the dialog sequences;
the pre-trained model BART is used to encode the input text and the vector representation at the CLS position is chosen as the vector for the dialogue statement.
Further, the specific steps of full text information extraction include:
and encoding by adopting a bi-directional long-short-term memory model BiLSTM dialogue sentence vector, wherein the obtained output is the vector representation of the dialogue full text.
Further, the multi-feature fusion filtering comprises two processes of multi-feature fusion and filtering;
the multi-feature fusion specifically comprises the following steps: calculating semantic relativity of keywords, topics, sentences and full text and dialogue words by using a multi-head attention mechanism, and dynamically fusing semantic relativity calculation results by using a learnable weight;
the filter mechanism is specifically as follows: quantifying the fusion result into a value between 0 and 1 through a Sigmoid function; and constructing a filtering unit based on a gating mechanism, and screening and filtering the information in the dialogue according to the quantized result.
Further, the specific steps of the abstract generation comprise
A pre-training language model BART is adopted as a dialogue encoder, a convolutional neural network is adopted as a keyword encoder, and a hierarchical Transformer network is adopted as a decoder;
the decoder receives the filtered dialogue information and keyword information, and guides the generation of the abstract by using the keywords while filtering useless information in the dialogue;
the output of the decoder is a candidate digest, and the best digest is selected from the plurality of candidate digests as a generated digest.
As a second aspect of the present invention, there is provided a dialogue digest generation system, the system comprising:
the dialogue preprocessing module is responsible for processing nonsensical noise words in the dialogue and converting the dialogue into a text sequence form;
a dialog context representation module that generates a word vector representation of a dialog using a BART model;
the keyword information extraction module is used for scoring the importance of the words according to semantic information of the words, occurrence frequency of the words, part of speech of the words and information entropy of the words, and obtaining keywords of the conversation based on the scoring result;
the topic information extraction module calculates vector representation of each dialog, calculates semantic similarity between two dialog rounds, and generalizes the dialog rounds with the semantic similarity into a topic to obtain a plurality of topic blocks containing topic information in the dialog;
the sentence information extraction module inserts special symbols between the input text sequences, alters the format of the dialogue sequences, encodes the input text by using a pre-training model BART, and acquires the vector of the dialogue sentence;
the full text information extraction module is used for encoding by using BiLSTM dialogue sentence vectors and outputting vector representations of dialogue full text;
the multi-feature fusion filtering module is used for filtering useless information in the dialogue by combining keyword information, theme information, statement information and full-text information, and enhancing feature representation of the dialogue;
and the abstract generating module is used for generating a dialogue abstract according to the filtered dialogue and keyword information.
Compared with the prior art, the invention has the following beneficial effects:
1) The method provided by the invention can automatically extract semantic features such as keywords, topics, sentences, full-text information and the like of the dialogue, does not need additional manual labeling, and greatly reduces the workload of extracting the semantic features;
2) The convolutional neural network and the BART model are used for respectively encoding the keywords and the dialogue context, so that more accurate keyword representation and dialogue representation with rich semantic information can be generated, and the semantic understanding capability of the model is greatly improved;
3) The semantic relativity between semantic features with different granularities and dialogue words is used for filtering words, different semantic features are dynamically fused through a learnable weight, complementary information between different semantic features is synthesized to selectively filter or retain dialogue words, so that a model can learn and recognize useless information in continuous training and filter the useless information, and interference of the useless information on the model is effectively reduced;
4) The hierarchical structure is introduced to improve the traditional converter model, so that when the abstract generator based on the converter model generates the abstract, important information in the dialogue can be further focused on the basis of understanding the whole text, the information coverage in the generated abstract is more complete, and the accuracy of generating the abstract is ensured.
In summary, compared with the prior art, the invention can obviously reduce the workload of semantic feature extraction, effectively weaken the interference of useless information in the dialogue on the model, and greatly improve the quality of abstract generation.
Drawings
FIG. 1 is a flow chart of one embodiment of a method for generating a dialog abstract according to the present invention;
FIG. 2 is a flow chart of Chinese dialogue preprocessing according to one embodiment of the present invention;
FIG. 3 is a schematic representation of multi-granularity semantic feature extraction according to one embodiment of the present invention;
FIG. 4 is a diagram of a multi-feature fusion filtering mechanism according to one embodiment of the present invention;
FIG. 5 is a diagram of a model structure incorporating multiple features and hierarchies in accordance with one embodiment of the present invention;
FIG. 6 is a flowchart of a Beam Search algorithm searching for the best abstract according to one embodiment of the invention;
FIG. 7 is a graph of ROUGE index results of the comparative method of the present invention on a CSDS dataset.
Detailed Description
The invention will now be described in detail with reference to the drawings and specific examples. The present embodiment is implemented on the premise of the technical scheme of the present invention, and a detailed implementation manner and a specific operation process are given, but the protection scope of the present invention is not limited to the following examples.
Example 1
Referring to fig. 1, a flow chart of a dialog abstract generating method combining multi-granularity semantic features and hierarchical transformers according to an embodiment of the invention specifically includes the following steps:
performing dialogue pretreatment, processing noise words in the dialogue, and converting the dialogue into a text sequence form;
acquiring dialogue context representation, and encoding the preprocessed dialogue text sequence into word vector representation;
extracting multi-granularity semantic features, and respectively extracting semantic features in a dialogue by using an unsupervised method, wherein the semantic features comprise keyword information, topic information, sentence information and full-text information;
multiple features fusion filtering, fusing multiple semantic features, and filtering useless information in the dialogue;
and generating a abstract, namely generating a dialogue abstract according to the dialogue and the keyword information after multi-feature fusion and filtering.
Referring to fig. 2, a schematic diagram of a chinese dialogue preprocessing flow according to an embodiment of the present invention mainly includes steps of chinese word segmentation, deleting nonsensical words according to a stop word list, unifying punctuation marks in a dialogue, and converting into a text sequence. Specifically, a Chinese dialogue is firstly obtained, then a jieba word segmentation device is used for segmenting a Chinese dialogue text, then a stop word list is compared, stop words in the dialogue are found out and removed, then characters in the dialogue are replaced by unified characters in a character string matching and regular matching mode, finally the processed dialogue is spliced, and the processed dialogue is converted into a format D= { u of a text sequence 1 ,u 2 ,…,u m -to match the input format of the model. Where D represents the formatted input text sequence, u i Representing the ith round of dialogue sentences, m represents the number of dialogue rounds in a section of dialogue.
Referring to fig. 3, a schematic diagram of multi-granularity semantic feature extraction according to an embodiment of the present invention mainly includes four parts, namely keyword information extraction, topic information extraction, sentence information extraction, and full text information extraction.
Specifically, with respect to keyword information extraction, the invention selects keywords from a dialogue from a plurality of angles such as semantic information of words, occurrence frequency of words, part of speech of words, information entropy of words, and the like. More specifically, the invention uses TextRank algorithm based on word co-occurrence relation to obtain semantic information of words, uses TF-IDF algorithm based on word frequency to obtain occurrence frequency of words, uses part-of-speech recognizer in Chinese natural language processing toolkit HanLP to obtain part-of-speech of words, uses Chinese pre-training dialogue model CDial-GPT to calculate information entropy of each word, and uses formula W i score =W i pos ×(w s ×W i s +w f ×W i f +w loss ×W i loss ) Selecting keywords of a dialog in combination with a plurality of information of words, wherein W i score Score representing the ith word, W i pos Representing the part of speech of the ith word, W i s Semantic information representing the ith word, W i f Representing the occurrence frequency of the ith word, W i loss Information entropy indicating ith word, w s 、w f 、w loss And respectively representing the contribution degree of semantic information, occurrence frequency and information entropy.
Specifically, for extracting topic information, the invention firstly uses a Chinese pre-training dialogue model CDial-GPT to obtain vector expression of each round of dialogue sentences, namely sentence vectors, then uses a classical semantic similarity algorithm C99 algorithm to calculate similarity scores between dialogue sentences according to sentence vectors of two adjacent rounds of dialogue, then sets a similarity threshold, two adjacent sentences with similarity scores larger than the threshold are regarded as a topic, finally generalizes adjacent sentences of the same topic into a topic block, more specifically, the invention inserts [ CLS ] and [ SEP ] marks at the beginning and the end of each topic block respectively to form a topic sequence, and uses the pre-training language model BART to encode the topic sequence to obtain topic information.
Specifically, for sentence information extraction, the present invention first inserts [ CLS ] between input text sequences]And [ SEP ]]Equal sign, converting dialog form into d= { [ CLS ]]u 1 [SEP],[CLS]u 2 [SEP],…,[CLS]u m [SEP]Then using BART model to encode input text to obtain context vector representation of input text, finally selecting [ CLS ]]The vector representation at the location is used as a vector for dialogue statements, thereby obtaining a vector representation for each round of dialogue.
Specifically, for full text information extraction, the invention uses a bi-directional long-short-time memory model BiLSTM to gather sentence information of a dialogue, the BiLSTM can encode dialogue sentences from front to back and from back to front, the problem of missing of context information caused by encoding in one direction is avoided, and the output of the BiLSTM is vector representation of the full text information.
Referring to fig. 4, a multi-feature fusion filtering schematic diagram provided by an embodiment of the present invention mainly includes two parts, namely multi-feature fusion and sentence filtering, specifically, multi-feature fusion uses a multi-head attention mechanism to calculate the semantic relevance between each word in an input dialogue and a dialogue keyword, a dialogue topic, a dialogue sentence and a dialogue full text, then comprehensively considers the relevance between the word and each feature, dynamically fuses the semantic relevance calculation result through a learnable weight, and obtains the importance score of each word in the dialogue according to a plurality of semantic features. The fusion result is quantized to a value between 0 and 1 through a Sigmoid function, the importance of each word is evaluated through a value between 0 and 1, and the value which is closer to 0 is less important, and the value which is closer to 1 is more important. And constructing a filtering unit based on a gating mechanism, and screening and filtering the information in the dialogue according to the quantized result. The filtering mechanism selectively enhances or weakens the vector representation of certain words according to the importance of each word, thereby achieving the effect of filtering useless information in the dialogue.
Referring to fig. 5, a structure diagram of a dialogue summary model combining multi-feature and layered Tramsformer is provided in an embodiment of the present invention, specifically, the present invention uses a BART model as an encoder to encode a topic sequence and an input dialogue sequence, respectively, to obtain a topic information representation and a dialogue context representation, then extracts a sentence information representation from the dialogue context representation, and then encodes the sentence information by BiLSTM to obtain a full text information representation. The keywords are encoded using a convolutional neural network to obtain a keyword information representation. And carrying out multi-feature fusion filtering on the obtained keyword information representation, the topic information representation, the sentence information representation and the full text information representation, carrying out filtering operation on the input dialogue, filtering useless information in the dialogue, and enhancing the feature representation of the dialogue. A hierarchical fransformer decoder is used to receive both keyword information and filtered dialog information, the decoder comprising two cross-attention modules, the decoder focusing on the entire dialog text first and generating a corresponding representation at the first cross-attention module and then focusing on the keyword information at the second cross-attention module, directing the decoder to focus on important parts of the original dialog. Ideally, the focus points of two cross-attention modules are complementary, the first cross-attention module focusing on global information of the dialog and the second cross-attention module focusing on detailed information of the dialog. Finally, the output vector of the word segmentation converter is subjected to linear mapping and Softmax calculation to obtain probability distribution of the generated word, and the word corresponding to the maximum probability position is the predicted word at the current moment. The invention trains the model by minimizing the cross entropy loss function and evaluates the generation quality of the abstract by using the ROUGE evaluation index.
Referring to fig. 6, a flowchart of searching the best abstract according to the Beam Search algorithm provided by an embodiment of the present invention is shown, specifically, the Beam Search selects a plurality of words with the highest probability as candidate words, and finally outputs a sequence of abstract with the highest probability. More specifically, the bundle search algorithm needs to first determine a bundle size K, which represents the number of candidate words that have the highest probability of being selected per step. The searching process mainly comprises two steps: 1) At each output time, K words with the highest probability at the last time are selected to form a new sequence, and the new sequence is stored in a candidate set. 2) K sequences with the maximum current probability are selected from the candidate set and used as output results of the current moment. The two steps are circularly performed until the output sequence meets the stop condition.
Referring to FIG. 7, a ROUGE index line graph of a comparative method of the present invention is shown for testing in a CSDS dataset. The CSDS data set used in the present invention collects over ten thousand Chinese dialogs, each segment of dialogs containing 321.92 words on average and each reference abstract containing 83.21 words on average. According to the method, a plurality of models are compared on the CSDS data set, and experimental results show that the method provided by the invention can effectively improve the quality of dialogue abstract generation and generate an abstract which is more fit with original text information.
In summary, the invention provides a dialogue abstracting method combining multi-granularity semantic features and layered convertors, and a comparison experiment is carried out on a CSDS data set and a plurality of dialogue abstracting models, so that the invention can effectively improve the abstract generating effect, and solve the problems that the conventional abstracting method cannot gather dialogue key information, and the generated abstract contains useless or redundant information. The invention is oriented to the practical problems faced in daily life, has good performance, has stronger social and economic values and has bright application prospect.
Example 2
As another embodiment of the present invention, a dialogue digest generation system combining multi-granularity semantic features and hierarchical convertors is provided, which is configured to implement the dialogue digest generation method provided in the foregoing embodiment, and mainly includes: the system comprises a dialogue preprocessing module, a dialogue context encoding module, a keyword information extraction module, a theme information extraction module, a sentence information extraction module, a full text information extraction module, a multi-feature fusion filtering module and a summary generation module.
The dialogue preprocessing module is used for a Chinese data set CSDS, and for Chinese dialogue, word segmentation is carried out, noise words are filtered, and the dialogue is formatted into a text sequence form to match with the input format of a model.
The dialog context representation module is constructed by adopting a pre-training language model BART and is used for encoding the preprocessed dialog text sequence into a vector representation rich in context information and providing the vector representation to the model for subsequent calculation.
The keyword information extraction module is used for extracting keywords in the dialogue, scoring the importance of the words from multiple angles by using an unsupervised extraction method, and selecting the words with the highest scores as the keywords of the dialogue according to the scoring result.
The topic information extraction module is used for extracting topic information of a dialogue, obtaining dialogue sentence vectors through a pre-training dialogue model CDial-GPT, calculating topic similarity of two adjacent sentences by using a C99 algorithm, and inducing the sentences with similar topics into a topic, so that topic information of the dialogue is obtained.
The sentence information extracting module is used for extracting the information of each round of dialogue, changing the format of the dialogue sequence, adding special symbols [ CLS ] and [ SEP ] at the beginning and ending of each round of dialogue in the dialogue sequence, and obtaining context vector representation by using the BART model coding dialogue sequence, wherein the vector representation at the [ CLS ] position is a sentence vector, so that the sentence information of the dialogue is obtained.
The full text information extraction module is used for extracting the representation of the whole dialogue, and a Bi-directional long-short-time Memory model BiLSTM (Bi-directional Long Short-Term Memory) is utilized for gathering up and down Wen Yugou vectors, so that full text information of the dialogue is obtained.
The multi-feature fusion filtering module is used for fusing semantic features with multiple granularities and filtering useless information in an original dialogue according to semantic information with different angles.
The abstract generation module is used for receiving the keyword information and the filtered dialogue information and completing the generation of dialogue abstract. And (3) fine tuning the network parameters of the model according to the pre-training-fine tuning paradigm, so that the network parameters are continuously updated to reach an optimal value, and generating the optimal dialogue abstract by using the model under the condition of the optimal value.
The Chinese CSDS data set is a public and open research data resource which is easy to obtain from the Internet.
The dialogue data formatting, the jieba word segmentation device is used for word segmentation of the dialogue, python codes are written to complete the filtering of noise words in the dialogue, symbols in the dialogue are unified, and the format of the dialogue text is converted into D= { u 1 ,u 2 ,…,u m -to match the input format of the model. Where D represents the formatted input text sequence, u i Representing the ith round of dialogue sentences, m represents the number of dialogue rounds in a section of dialogue.
The dialog context representation module is used for encoding the input dialog sequence into a vector representation rich in context information, and the context representation of the dialog is obtained by utilizing the strong context feature extraction capability of the BART model.
The keyword information extraction module is used for extracting keywords in a dialogue from a plurality of angles, and specifically comprises the following steps: keywords of the dialogue are selected by combining semantic information of the words, occurrence frequency of the words, part of speech of the words and information entropy of the words. Semantic letters of wordsThe information is obtained by a TextRank algorithm; the occurrence frequency of the words is obtained by adopting a TF-IDF algorithm; the part of speech of the word is acquired by adopting a natural language processing tool package HanLP; the information entropy of the words is obtained by adopting a pre-training dialogue model CDial-GPT. By the formula W i score =W i pos ×(W s ×W i s +w f ×W i f +w loss ×W i loss ) Selecting keywords of a dialog in combination with a plurality of information of words, wherein W i score Score representing the ith word, W i pos Representing the part of speech of the ith word, W i s Semantic information representing the ith word, W i f Representing the occurrence frequency of the ith word, W i loss Information entropy indicating ith word, w s 、w f 、w loss And respectively representing the contribution degree of semantic information, occurrence frequency and information entropy.
The topic information extraction module is used for dividing different topic blocks in a conversation, and specifically comprises the steps of obtaining sentence vectors of each conversation in the conversation by using a Chinese pre-training conversation model CDial-GPT, calculating sentence vectors of two adjacent conversational rounds by a topic similarity algorithm C99 algorithm, so as to obtain semantic similarity scores between the two conversations, setting a score threshold, classifying the two conversations with scores exceeding the threshold as the same topic, classifying the adjacent conversations with the same topic as one topic block, and distinguishing different topics in the conversations.
The sentence information extraction module is used for obtaining the vector representation of each round of dialogue, and specifically comprises the steps of inserting [ CLS ] between input text sequences]And [ SEP ]]Etc. special symbols, converting dialogue form into D= { [ CLS ]]u 1 [SEP],[CLS]u 2 [SEP],…,[CLS]u m [SEP]Encoding the input text using a pre-trained model BART, selecting [ CLS ]]The vector representation at the location is used as a vector for dialogue statements, thereby obtaining a vector representation for each round of dialogue.
The full text information extraction is used for obtaining the vector representation of the whole dialogue, the vector representation of each round of dialogue is gathered by utilizing the context feature coding capability of the BiLSTM, and the output of the BiLSTM is the vector representation of the whole dialogue.
The multi-feature fusion filtering module is used for filtering useless information in the dialogue by combining a plurality of semantic information such as keyword information, theme information, sentence information, full text information and the like, and enhancing feature representation of the dialogue. The method specifically comprises the following steps: respectively calculating semantic relativity of dialogue words, keyword information, theme information, sentence information and full-text information through a multi-head attention mechanism, and dynamically fusing semantic relativity calculation results through a learnable weight to obtain importance scores of each word; and weakening the input representation of the useless words according to the importance scores of the words through a filtering mechanism, and quantifying the fusion result into a value between 0 and 1 through a Sigmoid function. And constructing a filtering unit based on a gating mechanism, screening and filtering information in the dialogue according to the quantized result, and enhancing the input representation of the useful words.
The abstract generation module is used for completing abstract generation and specifically comprises the steps of adopting a pre-training language model BART as a dialogue encoder, adopting a convolutional neural network as a keyword encoder and adopting a hierarchical Transformer network as a decoder. The decoder receives the filtered dialogue information and the keyword information, and uses the keyword to guide the generation of the abstract while filtering useless information in the dialogue. The output of the decoder is the candidate abstract, and the best abstract is selected from the candidate abstracts as the generated abstract by adopting the Beam Search algorithm.
The foregoing describes in detail preferred embodiments of the present invention. It should be understood that numerous modifications and variations can be made in accordance with the concepts of the invention by one of ordinary skill in the art without undue burden. Therefore, all technical solutions which can be obtained by logic analysis, reasoning or limited experiments based on the prior art by the person skilled in the art according to the inventive concept shall be within the scope of protection defined by the claims.

Claims (10)

1. A dialogue abstract generating method is characterized in that: the method comprises the following steps:
performing dialogue pretreatment, processing noise words in the dialogue, and converting the dialogue into a text sequence form;
acquiring dialogue context representation, and encoding the preprocessed dialogue text sequence into word vector representation;
extracting multi-granularity semantic features, and respectively extracting semantic features in a dialogue by using an unsupervised method, wherein the semantic features comprise keyword information, topic information, sentence information and full-text information;
multiple features fusion filtering, fusing multiple semantic features, and filtering useless information in the dialogue;
and generating a abstract, namely generating a dialogue abstract according to the dialogue and the keyword information after multi-feature fusion and filtering.
2. The method for generating a dialogue digest according to claim 1, wherein the specific process of the dialogue preprocessing includes: the method comprises the steps of obtaining a data set, performing Chinese word segmentation on the data set by using a word segmentation device, removing meaningless noise words, unifying symbols in a dialogue, converting the dialogue form into a text sequence which can be processed by a model, and formatting the dialogue text.
3. The method for generating a dialogue digest according to claim 1, wherein the step of obtaining a dialogue context representation is specifically: the input dialog text is encoded using a pre-trained language model BART as an encoder to obtain a matrix vector representation of the dynamic dialog.
4. The method for generating a dialogue digest according to claim 1, wherein the specific step of extracting the keyword information includes:
scoring the importance of the words by combining the semantic information of the words, the occurrence frequency of the words, the part of speech of the words and the information entropy of the words, and obtaining keywords of the conversation based on the scoring result;
W i score =W i pos ×(w s ×W i s +w f ×W i f +w loss ×W i loss )
wherein w is i score Score representing the ith word, W i pos Representing the part of speech of the ith word, W i s Semantic information representing the ith word, W i f Representing the occurrence frequency of the ith word, W i loss Information entropy indicating ith word, w s 、w f 、w loss And respectively representing the contribution degree of semantic information, occurrence frequency and information entropy.
5. The method for generating a dialogue digest according to claim 1, wherein the specific step of extracting the subject information comprises:
obtaining sentence vectors of each dialogue in the dialogue by using a Chinese pre-training dialogue model;
calculating sentence vectors of adjacent two dialogue rounds through a topic similarity algorithm to obtain semantic similarity scores between the two dialogue rounds;
and setting a score threshold value, classifying two rounds of conversations with scores exceeding the threshold value into the same class of topics, and classifying adjacent conversations with the same topic into a topic block.
6. The method for generating a dialogue digest according to claim 1, wherein the specific step of extracting sentence information comprises:
inserting special symbols between the input text sequences, altering the format of the dialog sequences, specifically including adding special symbols [ CLS ] and [ SEP ] at the beginning and end of each round of dialog in the dialog sequences;
the pre-trained model BART is used to encode the input text and the vector representation at the CLS position is chosen as the vector for the dialogue statement.
7. The method for generating a dialogue digest according to claim 6, wherein said specific step of extracting full-text information comprises:
and encoding by adopting a bi-directional long-short-term memory model BiLSTM dialogue sentence vector, wherein the obtained output is the vector representation of the dialogue full text.
8. The method for generating a dialogue digest according to claim 1, wherein said multi-feature fusion filtering includes two processes of multi-feature fusion and filtering;
the multi-feature fusion specifically comprises the following steps: calculating semantic relativity of keywords, topics, sentences and full text and dialogue words by using a multi-head attention mechanism, and dynamically fusing semantic relativity calculation results by using a learnable weight;
the filter mechanism is specifically as follows: quantifying the fusion result into a value between 0 and 1 through a Sigmoid function; and constructing a filtering unit based on a gating mechanism, and screening and filtering the information in the dialogue according to the quantized result.
9. The method for generating a dialogue digest according to claim 1, wherein said digest generating step comprises the steps of
A pre-training language model BART is adopted as a dialogue encoder, a convolutional neural network is adopted as a keyword encoder, and a hierarchical Transformer network is adopted as a decoder;
the decoder receives the filtered dialogue information and keyword information, and guides the generation of the abstract by using the keywords while filtering useless information in the dialogue;
the output of the decoder is a candidate digest, and the best digest is selected from the plurality of candidate digests as a generated digest.
10. A dialog digest generation system, the system comprising:
the dialogue preprocessing module is responsible for processing nonsensical noise words in the dialogue and converting the dialogue into a text sequence form;
a dialog context representation module that generates a word vector representation of a dialog using a BART model;
the keyword information extraction module is used for scoring the importance of the words according to semantic information of the words, occurrence frequency of the words, part of speech of the words and information entropy of the words, and obtaining keywords of the conversation based on the scoring result;
the topic information extraction module calculates vector representation of each dialog, calculates semantic similarity between two dialog rounds, and generalizes the dialog rounds with the semantic similarity into a topic to obtain a plurality of topic blocks containing topic information in the dialog;
the sentence information extraction module inserts special symbols between the input text sequences, alters the format of the dialogue sequences, encodes the input text by using a pre-training model BART, and acquires the vector of the dialogue sentence;
the full text information extraction module is used for encoding by using BiLSTM dialogue sentence vectors and outputting vector representations of dialogue full text;
the multi-feature fusion filtering module is used for filtering useless information in the dialogue by combining keyword information, theme information, statement information and full-text information, and enhancing feature representation of the dialogue;
and the abstract generating module is used for generating a dialogue abstract according to the filtered dialogue and keyword information.
CN202310455478.2A 2023-04-25 2023-04-25 Dialogue abstract generation method and system Pending CN116483991A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310455478.2A CN116483991A (en) 2023-04-25 2023-04-25 Dialogue abstract generation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310455478.2A CN116483991A (en) 2023-04-25 2023-04-25 Dialogue abstract generation method and system

Publications (1)

Publication Number Publication Date
CN116483991A true CN116483991A (en) 2023-07-25

Family

ID=87226469

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310455478.2A Pending CN116483991A (en) 2023-04-25 2023-04-25 Dialogue abstract generation method and system

Country Status (1)

Country Link
CN (1) CN116483991A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117009501A (en) * 2023-10-07 2023-11-07 腾讯科技(深圳)有限公司 Method and related device for generating abstract information
CN117725928A (en) * 2024-02-18 2024-03-19 西南石油大学 Financial text abstracting method based on keyword heterograms and semantic matching

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117009501A (en) * 2023-10-07 2023-11-07 腾讯科技(深圳)有限公司 Method and related device for generating abstract information
CN117009501B (en) * 2023-10-07 2024-01-30 腾讯科技(深圳)有限公司 Method and related device for generating abstract information
CN117725928A (en) * 2024-02-18 2024-03-19 西南石油大学 Financial text abstracting method based on keyword heterograms and semantic matching
CN117725928B (en) * 2024-02-18 2024-04-30 西南石油大学 Financial text abstracting method based on keyword heterograms and semantic matching

Similar Documents

Publication Publication Date Title
US11194972B1 (en) Semantic sentiment analysis method fusing in-depth features and time sequence models
CN112101028B (en) Multi-feature bidirectional gating field expert entity extraction method and system
WO2019085779A1 (en) Machine processing and text correction method and device, computing equipment and storage media
CN109992669B (en) Keyword question-answering method based on language model and reinforcement learning
CN109815476B (en) Word vector representation method based on Chinese morpheme and pinyin combined statistics
CN116483991A (en) Dialogue abstract generation method and system
Samanta et al. A deep generative model for code-switched text
KR20200119410A (en) System and Method for Recognizing Emotions from Korean Dialogues based on Global and Local Contextual Information
CN110807324A (en) Video entity identification method based on IDCNN-crf and knowledge graph
CN111814477B (en) Dispute focus discovery method and device based on dispute focus entity and terminal
CN115599902B (en) Oil-gas encyclopedia question-answering method and system based on knowledge graph
CN112612871A (en) Multi-event detection method based on sequence generation model
CN114153973A (en) Mongolian multi-mode emotion analysis method based on T-M BERT pre-training model
CN113111663A (en) Abstract generation method fusing key information
CN115712731A (en) Multi-modal emotion analysis method based on ERNIE and multi-feature fusion
CN114428850A (en) Text retrieval matching method and system
Fu et al. RepSum: Unsupervised dialogue summarization based on replacement strategy
Neelima et al. A comprehensive review on word embedding techniques
CN114611520A (en) Text abstract generating method
Ahmed et al. Fine-tuning arabic pre-trained transformer models for egyptian-arabic dialect offensive language and hate speech detection and classification
CN112069816A (en) Chinese punctuation adding method, system and equipment
CN112989839A (en) Keyword feature-based intent recognition method and system embedded in language model
CN113901172B (en) Case-related microblog evaluation object extraction method based on keyword structural coding
Fernández et al. Identifying relevant phrases to summarize decisions in spoken meetings.
CN112270192A (en) Semantic recognition method and system based on filtering of part of speech and stop words

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination