CN112397201B

CN112397201B - Intelligent inquiry system-oriented repeated sentence generation optimization method

Info

Publication number: CN112397201B
Application number: CN202011457520.7A
Authority: CN
Inventors: 黄剑平; 丰仕琦
Original assignee: Hangzhou Normal University
Current assignee: Hangzhou Normal University
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2024-05-28
Anticipated expiration: 2040-12-10
Also published as: CN112397201A

Abstract

The invention discloses a method for optimizing the generation of repeated sentences for an intelligent inquiry system, which comprises the steps of carrying out text clustering on a central Wen Wenzhen corpus to obtain the repeated corpus, respectively extracting sentences to be repeated and sentence templates in the repeated corpus, carrying out template matching and sentence generation on the sentences to be repeated and a repeated template group to obtain a candidate generated sentence set, and finally calculating the comprehensive similarity score of the candidate generated sentences by utilizing an RNN-LM model and a CNN model based on similarity and dissimilarity information, thereby obtaining the best repeated generated sentences in the candidate generated sentence set.

Description

Intelligent inquiry system-oriented repeated sentence generation optimization method

Technical Field

The invention relates to the technical field of intelligent inquiry, in particular to a repeated sentence generation optimization method for an intelligent inquiry system.

Background

The intelligent inquiry system mainly combines intelligent inquiry and medical inquiry and is an intelligent inquiry and response system facing the medical field. The intelligent question-answering system is an interactive system which utilizes related technologies such as natural language processing, knowledge extraction and the like to analyze and process natural language input by a user and returns accurate answers to the user. The intelligent question-answering product not only can provide more friendly and convenient interaction modes for people, but also greatly improves the working and living efficiency.

However, the current intelligent question-answering system has poor understanding capability and a certain gap from the ideal state of real intelligence, and is mainly embodied in the aspects of low answer accuracy, limited question-answering field and the like. Therefore, it is still a great challenge to make the intelligent question-answering system more intelligent and humanized. This is because the existing intelligent question-answering system is mainly composed of three modules, namely, question analysis (the system needs to know what the user wants to ask), information retrieval (to retrieve information the user wants to ask), and answer extraction, and some key technologies in the question analysis and information retrieval modules are not mature enough. The problem analysis module is to correctly identify the user intention, analyze the user intention and generate corresponding search information. The information retrieval module is used for accurately matching user intentions, and performing full-matching retrieval in a system corpus to obtain corpus resources possibly containing answers. However, since the user's input is not fixed and the same semantic problem may have a variety of different sentence patterns, this creates great difficulties in accurately understanding and retrieving the user's intent.

Application of the review method to the intelligent question-answering system is one of the ways to effectively solve the above-mentioned problems. The re-description refers to a method for displaying the same meaning in different expression forms, and can be used for rewriting the vocabulary or sentences input by the user into a plurality of words and sentences with the same meaning but different expression forms. The method can be used for generating synonymous corpus and expanding the corpus scale.

The related research method mainly comprises the generation of the double-language parallel corpus-based double-language sentence, the generation of the template-matching-based double-language sentence and the generation of the residual-based LSTM double-language sentence. The double-language parallel corpus based double-sentence generation method has the defects that a large number of phrases with non-language structures can be extracted to interfere with the generation of double-sentence, and the collection of the high-quality double-language parallel corpus needs to consume a large amount of manpower resources, and meanwhile, the filtering method has limited effect. The complex sentence generation method based on template matching does not consider the effects of special function words and simplified sentence patterns independently in the word segmentation process, so that the template generalization capability is poor. The LSTM repeated sentence generation method based on the residual error lacks large-scale high-precision repeated corpus as a model training set, and the learning capability is limited to a great extent.

Based on the above, the invention focuses on how to use the existing medical inquiry data set to perform efficient template extraction and sentence pattern simplification, and how to use the deep learning algorithm to sort the generated compound sentence, so as to obtain the compound sentence with higher accuracy.

Disclosure of Invention

Aiming at the technical problems, the invention provides a method for optimizing the generation of the repeated sentences for an intelligent inquiry system, which is based on a medical inquiry corpus, generates a repeated corpus by using a text clustering method, respectively extracts sentences to be repeated and sentence templates in the repeated corpus, then carries out template matching and sentence generation on the sentences to be repeated and the repeated template set, thereby obtaining a candidate generated sentence set, and finally sequences sentences in the candidate generated sentence set to obtain the best repeated generated sentences.

A method for optimizing the generation of a repeated sentence facing an intelligent inquiry system comprises the following steps:

(A) Selecting a question-answer data set which exists in a question-answer pair form and has a limited question length, wherein the question does not contain punctuation marks and modification limiting components;

(B) Text clustering is carried out on the question-answer data set, and questions with similar semanteme are classified into the same cluster;

(C) Sentence pattern simplification and template extraction are carried out on all questions to obtain corresponding reiteration templates, wherein all reiteration templates in one cluster are used as a reiteration template group; carrying out the same sentence pattern simplification and template extraction on the sentence to be recovered to obtain a sentence template to be recovered;

(D) Extracting a template of a sentence to be repeated and all template groups for searching and matching, if the template which is the same as the template of the sentence to be repeated is found in a certain template group, indicating that all the templates in the template group have the possibility of being rewritten into new repeated sentences, and respectively generating different repeated generated sentences according to all the repeated templates in the matched template group;

(E) And sequencing all the repeated generation sentences according to the comprehensive similarity, and selecting the best repeated generation sentence with the highest comprehensive similarity according to the sequencing.

Preferably, in the step (a), the chinese inquiry data set is collected in the form of an inquiry pair, the questions are respectively classified into different categories according to the symptoms, the dependency relationship of the questions is analyzed, punctuation marks and modification limiting components in the questions are removed, the length of the questions is limited within the range of [3,20] kanji words, and the processed data set is reserved.

Preferably, in the step (B), text clustering is performed on the question-answer data set by a K-means clustering method, an elbow method and a contour coefficient method are utilized to determine an optimal clustering number, text clustering is performed on the basis of existing classification according to symptoms, and semantically similar question sentences are concentrated in the same cluster.

Preferably, the specific steps of sentence pattern reduction and template extraction in the step (C) include:

(C-1) performing word segmentation, part-of-speech tagging and named entity recognition processing on each sentence by using a jieba component, keeping the sequence of words in the original sentence unchanged, and then replacing the corresponding words in the sentence with part-of-speech tagging and named entity tags respectively to form a preliminary sentence template;

(C-2) replacing the special function word with a special function word label, and updating the preliminary sentence template to obtain a new sentence pattern template;

and (C-3) utilizing syntactic analysis to establish a syntactic tree, and removing the modification relation part which does not affect the sentence main body, so as to simplify the sentence pattern and obtain a corresponding compound template.

Preferably, in the step (D), all the matched compound templates in the compound template group are respectively compared with the to-be-compound sentence templates, the same part of the to-be-compound sentence templates as the word slot is filled, the different parts are reserved, and finally, the words in the to-be-compound sentence are sequentially filled into the word slot according to the labels corresponding to the word slot, so as to generate the compound sentence.

Preferably, in the step (E), the calculation of the comprehensive similarity is performed by adopting an RNN-LM language model and a CNN model (convolutional neural network model) based on similar and dissimilar information, and the specific steps include:

(E-1) scoring the repeated generated sentences by using an RNN-LM model, and normalizing to obtain scores of the RNN-LM model;

(E-2) calculating a cosine similarity matrix of the sentence to be repeated and the generated sentence; calculating semantic matching vectors of the sentences by combining the most similar words in the sentences to be repeated, and dividing word vectors of the generated sentences to be repeated into similar vectors and dissimilar vectors of the sentences to be repeated according to the semantic matching vectors; similarly, calculating semantic matching vectors of the most similar words in the generated sentences by combining the repeated generation sentences, and dividing word vectors of the sentences to be repeated into similar vectors and dissimilar vectors of the generated sentences by the repeated generation according to the semantic matching vectors;

Training a similarity matrix and a dissimilarity matrix respectively formed by the similarity vector and the dissimilarity vector by adopting a double-channel CNN model to obtain feature vectors of the repeated generation sentence and the sentence to be repeated, and calculating the similarity between the repeated generation sentence and the sentence to be repeated according to the feature vectors to serve as a CNN model score;

And (E-4) comprehensively calculating the scores of the RNN-LM model and the CNN model, taking the scores as the final score of the comprehensive similarity of the repeated generation sentences, sequencing all the repeated generation sentences according to the scores from high to low, and taking the repeated generation sentence with the first sequencing as the best repeated generation sentence.

Preferably, the specific steps of step (E-2) include:

(I) Recording the generated sentence as S sentence, recording the sentence to be re-recorded as T sentence, respectively performing Word segmentation processing on the S sentence and the T sentence, and representing corresponding Word vectors by combining Word2vec models, wherein the S sentence and the T sentence are respectively represented as vector matrix sums and are represented by using Word vectors with dimension d, and the number of words contained in the S sentence and the T sentence is m and n respectively;

(II) obtaining a similarity matrix A _m×n of the S sentence and the T sentence through a cosine similarity algorithm, wherein a unit a _ij in the similarity matrix A _m×n is the cosine similarity of the S _i and the T _j, and the calculation formula is shown in the following formula (1.1):

Wherein S _i and T _j are respectively the ith word vector in the S sentence and the jth word vector in the T sentence;

The vocabulary corresponding to the maximum value of a _ij is the vocabulary most similar to S _i in T sentences, the vocabulary is named as T _k, and the weighted average of the word vectors of the context of T _k and the window size w is used for representing S _i, so as to derive the semantic matching vector corresponding to S _i The calculation formula of (2) is shown as the following formula (1.2):

Wherein, the semantic matching vector The semantic coverage of the T sentence pair S _i is represented, k=argmax _ja_ij, the subscript of the word T _k which is the most similar to S _i in the T sentence, w represents the window size of the context range of the word T _k, T _j is the jth word vector in the T sentence, and the weight size of each word is the cosine similarity a _ij of the word and S _i;

(III) by S _i is decomposed into 2 vectors, one is used as a similar vector of S _i and T sentence, the other is used as a dissimilar vector of S _i and T sentence, and S _i is based on S _i and/>The cosine similarity of (2) is decomposed to derive the mathematical expression shown in formula (1.3):

wherein alpha represents S _i and Cosine similarity of/>Representing the similarity matrix of S _i,/>Representing a distinct matrix of S _i;

(IV) carrying out the operations of the steps (II) and (III) on each word in the S sentence, and decomposing the S sentence into a similarity matrix And a dissimilarity matrix/>

And (V) similarly, obtaining semantic matching vectors through T _j calculation according to the steps (II) - (IV)Representing the semantic coverage of the S sentence pair T _j, decomposing the T sentence into a similarity matrix T ⁺ and a dissimilarity matrix T ^-: /(I)

Preferably, the specific steps of step (E-3) include:

For the S sentence, a group of filters { w ₀,w₁ } are arranged on similar channels and dissimilar channels of a convolution layer, wherein the size of each filter is d multiplied by h, d is the dimension of a word vector, h is the window size and is used for generating a group of characteristics, and the mathematical operation formula is shown as the following formula (1.4):

Wherein, And/>Representing the sub-vectors dividing S ⁺ and S ^- into h dimensions, b _c is the offset, f is a nonlinear function,The operational representation of (a) will/>All elements in (a) are weighted and summed according to the weight of w ₀,/>The operational representation of (a) will/>All elements in the table are weighted and summed according to the weight of w ₁;

the convolutional layer outputs a set of features Inputting the set of features to a max pooling layer, selecting the set of features/>As output result, i.e./>

Setting O group filter to obtain feature vector F _S＝[c₀,c₁,…,c_O-1 of S sentence;

Performing the same method operation on the T sentence, and setting an O group filter together to obtain a feature vector F _T＝[d₀,d₁,…,d_O-1 of the T sentence;

And calculating the feature vector, expanding based on the full-connection layer, and normalizing the corresponding result to obtain similarity scores of the S sentence and the T sentence as CNN model scores.

Preferably, in the step (E-4), the calculation formula of the integrated similarity final score is shown in the following formula (1.5):

Score＝0.0001s₁+s₂ (1.5)，

Wherein s ₁ is the RNN-LM model score and s ₂ is the CNN model score.

Compared with the prior art, the invention has the main advantages that:

1) By introducing special function words and simplifying sentence patterns, sentence templates are more reasonably expressed, and the generalization capability of the templates is expanded.

2) The screening of candidate repeated generation sentences adopts a new sorting method, so that the discrimination of repeated generation sentences can be better realized, and the problem that the traditional deep learning method is limited by a high-quality large-scale corpus is avoided.

Drawings

FIG. 1 is a schematic diagram of a method for optimizing sentence generation for an intelligent inquiry system according to an embodiment;

FIG. 2 is a schematic drawing of a preliminary template extraction in accordance with an embodiment;

FIG. 3 is a schematic diagram of functional feature words involved in an embodiment;

FIG. 4 is a simplified diagram of a modification relationship according to an embodiment;

FIG. 5 is a schematic diagram of a replication generation process involved in an embodiment;

Fig. 6 is a schematic diagram of a CNN model based on similar and dissimilar information according to an embodiment.

Detailed Description

The invention will be further elucidated with reference to the drawings and to specific embodiments. It is to be understood that these examples are illustrative of the present invention and are not intended to limit the scope of the present invention. The methods of operation, under which specific conditions are not noted in the examples below, are generally in accordance with conventional conditions, or in accordance with the conditions recommended by the manufacturer.

According to the intelligent inquiry system-oriented sentence generation optimization method, as shown in fig. 1, text clustering is carried out on a center Wen Wenzhen corpus to obtain a repeated corpus, then sentence templates in the sentence to be repeated and the repeated corpus are respectively extracted, template matching and sentence generation are carried out on the sentence templates to be repeated and the repeated template group to obtain a candidate generated sentence set, and finally, the comprehensive similarity score of the candidate generated sentences is calculated by utilizing an RNN-LM model and a CNN model based on similarity and dissimilarity information, so that the best repeated generated sentences are obtained in the candidate generated sentence set, and the method comprises the following steps:

(A) The related data 79210 of Chinese inquiry are collected in the form of inquiry and answer pair, and the disease symptoms are respectively classified into different categories according to 240 categories in total. Analyzing the dependency relationship of the question by using jieba tools, marking the part of speech, deleting punctuation marks in the question resources, removing modification limiting components in the question, limiting the length range of the question to be [3,20] Chinese character numbers, wherein the processed corpus contains 57974 questions, which are 140 types in total.

(B) And carrying out text clustering on the data set by adopting a K-means algorithm, wherein different clustering number K values are respectively selected by using an elbow method and a contour coefficient method to carry out clustering test, and finally, when the clustering number K value is 12, the clustering effect is optimal. Therefore, setting the k value as 12, and using the k value as the optimal clustering number, carrying out text clustering on the data set preprocessed in the step (A), clustering each type of corpus into 12 similar corpus clusters with different semantics, and using the corpus clusters as a repeated corpus extracted by a template; semantic similar questions belong to the same cluster.

The sentence pattern reduction and template extraction specifically comprises the following steps:

(C-1) performing word segmentation, part-of-speech tagging and named entity recognition processing on each sentence in the compound corpus by using a jieba tool, keeping the sequence of words in the original sentence unchanged, and then replacing the corresponding words in the sentence with part-of-speech tagging tags and named entity tags respectively to form a preliminary sentence template, as shown in fig. 2;

(C-2) manually labeling some special function words, assigning a label to each function word, and finally forming a function feature word list, as shown in FIG. 3. Replacing the special function word with a special function word label, and updating the preliminary sentence template to obtain a new sentence pattern template;

And (C-3) utilizing syntactic analysis to establish a syntactic tree, and removing the modification relation part which does not affect the sentence main body, so as to simplify the sentence pattern and obtain a corresponding compound template. In this embodiment, the part of the sentence containing 8 kinds of modification relations as shown in fig. 4 is removed. And sequentially carrying out the operations on sentences in the similar corpus clusters to finally obtain all the repeated templates.

And comparing all the matched repeated templates in the repeated template group with the template of the sentence to be repeated, filling the same part of the template of the sentence to be repeated as a word slot, reserving different parts, and finally filling the words in the sentence to be repeated into the word slot in sequence according to the labels corresponding to the word slot to generate a repeated generated sentence, wherein the process is shown in fig. 5 for example.

The comprehensive similarity calculation is carried out by adopting an RNN-LM language model and a CNN model (shown in figure 6) based on similar and dissimilar information, and the specific steps comprise:

The method comprises the following specific steps:

(II) judging whether the words in the S sentence can be replaced by the words or phrases in the T sentence or not through semantic similarity, and calculating the similarity between the S sentence and the T sentence, wherein specifically, each word S _i in the S sentence can be represented by part of word vectors in the T sentence, so as to obtain a semantic matching vector corresponding to S _i

The similarity matrix A _m×n of the S sentence and the T sentence is obtained through a cosine similarity algorithm, a unit a _ij in the similarity matrix A _m×n is the cosine similarity of the S _i and the T _j, and a calculation formula is shown in the following formula (1.1):

Wherein, the semantic matching vector The semantic coverage of the T sentence pair S _i is represented, k=argmax _j a_ij, the subscript of the word T _k which is the most similar to S _i in the T sentence, w represents the window size of the context range of the word T _k, T _j is the jth word vector in the T sentence, and the weight size of each word is the cosine similarity a _ij of the word and S _i; taking T _k as an example, the weight of the element a _ik of the similarity matrix A _m×n;

wherein alpha represents S _i and Cosine similarity of/>Representing the similarity matrix of S _i,/>Representing a distinct matrix of S _i; it can be seen that if S _i and/>The higher the similarity, the assignment to/>The more parts in (a);

(V) similarly, judging whether the words in the T sentence can be replaced by the words or phrases in the S sentence or not through semantic similarity, and calculating the similarity between the T sentence and the S sentence, wherein specifically, each word T _j in the T sentence can be represented by part of word vectors in the S sentence, so as to obtain a semantic matching vector corresponding to T _j According to the steps (II) - (IV), obtaining the semantic matching vector/>, through T _j calculationRepresenting the semantic coverage of the S sentence pair T _j, decomposing the T sentence into a similarity matrix T ⁺ and a dissimilarity matrix T ^-:

(E-3) calculating the comprehensive similarity of the S sentence and the T sentence by using the result and modeling the two. Training a similarity matrix and a dissimilarity matrix respectively formed by the similarity vector and the dissimilarity vector by adopting a double-channel CNN model to obtain feature vectors of the repeated generation sentence and the sentence to be repeated, and calculating the similarity between the repeated generation sentence and the sentence to be repeated according to the feature vectors to serve as a CNN model score;

The method comprises the following specific steps:

Taking the S sentence as an example, the core is to set a set of filters { w ₀,w₁ } on similar channels and dissimilar channels of the convolution layer, where the size of the filters is d×h, d is the dimension of the word vector, h is the window size, and the filter is used for generating a set of features, and the mathematical operation formula is shown in the following formula (1.4):

Wherein, And/>Representing the sub-vectors dividing S ⁺ and S ^- into h dimensions, b _c is the offset, f is a nonlinear function,The operational representation of (a) will/>All elements in (a) are weighted and summed according to the weight of w ₀,/>Is to be operated byAll elements in the table are weighted and summed according to the weight of w ₁;

the convolutional layer outputs a set of features Inputting the set of features to a max pooling layer, selecting the set of features/>As output result, i.e./>Thus, sentences of different lengths are not greatly affected. In the max pooling layer, since each filtering process is used to produce a one-dimensional result, the number of filters can ultimately determine the dimension of the feature vector.

Setting O groups of filters (500 groups of filters are set in the embodiment) according to the method to obtain a feature vector F _S＝[c₀,c₁,…,c_O-1 of the S sentence;

the same method operation is carried out on the T sentence, Setting O group filter (500 groups filter in the embodiment) to obtain feature vector F _T＝[d₀,d₁,…,d_O-1 of T sentence;

By constantly comparing experiments, the calculation formula of the integrated similarity final score is shown as the following formula (1.5):

Score＝0.0001s₁+s₂ (1.5)，

Wherein s ₁ is the RNN-LM model score and s ₂ is the CNN model score.

Further, it is to be understood that various changes and modifications of the present application may be made by those skilled in the art after reading the above description of the application, and that such equivalents are intended to fall within the scope of the application as defined in the appended claims.

Claims

1. The method for optimizing the generation of the repeated sentence facing the intelligent inquiry system is characterized by comprising the following steps:

(C-3) utilizing syntactic analysis to establish a syntactic tree, removing a modification relation part which does not affect the sentence main body, thereby simplifying the sentence pattern and obtaining a corresponding compound template;

2. The method of optimizing the generation of a sentence according to claim 1, wherein in the step (a), the chinese inquiry data set is collected in the form of an inquiry pair, and the questions are respectively classified into different categories according to the symptoms, the dependency relationship of the questions is analyzed, punctuation marks and modification limiting components in the questions are removed, and the length of the questions is limited to a range of [3,20] kanji words, and the processed data set is retained.

3. The method of optimizing the generation of multiple sentences according to claim 2, wherein in the step (B), text clustering is performed on the question-answer data set by a K-means clustering method, an optimal clustering number is determined by an elbow method and a contour coefficient method, text clustering is performed on the basis of existing classification according to symptoms, and semantically similar question sentences are concentrated in the same cluster.

4. The method for optimizing the generation of a sentence according to claim 1, wherein in the step (D), all the matched compound templates in the compound template group are respectively compared with the sentence template to be compound, the same part as the compound template in the sentence template to be compound is filled as word slots, different parts are reserved, and finally, the words in the sentence to be compound are sequentially filled into the word slots according to the labels corresponding to the word slots, so as to generate the compound generated sentence.

5. The sentence generation optimizing method according to claim 1, wherein in the step (E), the calculation of the comprehensive similarity is performed by using the RNN-LM language model and the CNN model based on the similarity and dissimilarity information, the specific steps include:

6. The sentence generation optimizing method of claim 5, wherein the specific steps of step (E-2) include:

7. The sentence generation optimizing method of claim 6, wherein the specific step of step (E-3) includes:

8. The sentence generation optimizing method according to claim 5 or 7, wherein in step (E-4), the calculation formula of the integrated similarity final score is shown in the following formula (1.5):

Score＝0.0001s₁+s₂ (1.5)，

Wherein s ₁ is the RNN-LM model score and s ₂ is the CNN model score.