CN112397201B - Intelligent inquiry system-oriented repeated sentence generation optimization method - Google Patents

Intelligent inquiry system-oriented repeated sentence generation optimization method Download PDF

Info

Publication number
CN112397201B
CN112397201B CN202011457520.7A CN202011457520A CN112397201B CN 112397201 B CN112397201 B CN 112397201B CN 202011457520 A CN202011457520 A CN 202011457520A CN 112397201 B CN112397201 B CN 112397201B
Authority
CN
China
Prior art keywords
sentence
repeated
sentences
template
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011457520.7A
Other languages
Chinese (zh)
Other versions
CN112397201A (en
Inventor
黄剑平
丰仕琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Normal University
Original Assignee
Hangzhou Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Normal University filed Critical Hangzhou Normal University
Priority to CN202011457520.7A priority Critical patent/CN112397201B/en
Publication of CN112397201A publication Critical patent/CN112397201A/en
Application granted granted Critical
Publication of CN112397201B publication Critical patent/CN112397201B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Public Health (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Primary Health Care (AREA)
  • Epidemiology (AREA)
  • Human Computer Interaction (AREA)
  • Pathology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for optimizing the generation of repeated sentences for an intelligent inquiry system, which comprises the steps of carrying out text clustering on a central Wen Wenzhen corpus to obtain the repeated corpus, respectively extracting sentences to be repeated and sentence templates in the repeated corpus, carrying out template matching and sentence generation on the sentences to be repeated and a repeated template group to obtain a candidate generated sentence set, and finally calculating the comprehensive similarity score of the candidate generated sentences by utilizing an RNN-LM model and a CNN model based on similarity and dissimilarity information, thereby obtaining the best repeated generated sentences in the candidate generated sentence set.

Description

Intelligent inquiry system-oriented repeated sentence generation optimization method
Technical Field
The invention relates to the technical field of intelligent inquiry, in particular to a repeated sentence generation optimization method for an intelligent inquiry system.
Background
The intelligent inquiry system mainly combines intelligent inquiry and medical inquiry and is an intelligent inquiry and response system facing the medical field. The intelligent question-answering system is an interactive system which utilizes related technologies such as natural language processing, knowledge extraction and the like to analyze and process natural language input by a user and returns accurate answers to the user. The intelligent question-answering product not only can provide more friendly and convenient interaction modes for people, but also greatly improves the working and living efficiency.
However, the current intelligent question-answering system has poor understanding capability and a certain gap from the ideal state of real intelligence, and is mainly embodied in the aspects of low answer accuracy, limited question-answering field and the like. Therefore, it is still a great challenge to make the intelligent question-answering system more intelligent and humanized. This is because the existing intelligent question-answering system is mainly composed of three modules, namely, question analysis (the system needs to know what the user wants to ask), information retrieval (to retrieve information the user wants to ask), and answer extraction, and some key technologies in the question analysis and information retrieval modules are not mature enough. The problem analysis module is to correctly identify the user intention, analyze the user intention and generate corresponding search information. The information retrieval module is used for accurately matching user intentions, and performing full-matching retrieval in a system corpus to obtain corpus resources possibly containing answers. However, since the user's input is not fixed and the same semantic problem may have a variety of different sentence patterns, this creates great difficulties in accurately understanding and retrieving the user's intent.
Application of the review method to the intelligent question-answering system is one of the ways to effectively solve the above-mentioned problems. The re-description refers to a method for displaying the same meaning in different expression forms, and can be used for rewriting the vocabulary or sentences input by the user into a plurality of words and sentences with the same meaning but different expression forms. The method can be used for generating synonymous corpus and expanding the corpus scale.
The related research method mainly comprises the generation of the double-language parallel corpus-based double-language sentence, the generation of the template-matching-based double-language sentence and the generation of the residual-based LSTM double-language sentence. The double-language parallel corpus based double-sentence generation method has the defects that a large number of phrases with non-language structures can be extracted to interfere with the generation of double-sentence, and the collection of the high-quality double-language parallel corpus needs to consume a large amount of manpower resources, and meanwhile, the filtering method has limited effect. The complex sentence generation method based on template matching does not consider the effects of special function words and simplified sentence patterns independently in the word segmentation process, so that the template generalization capability is poor. The LSTM repeated sentence generation method based on the residual error lacks large-scale high-precision repeated corpus as a model training set, and the learning capability is limited to a great extent.
Based on the above, the invention focuses on how to use the existing medical inquiry data set to perform efficient template extraction and sentence pattern simplification, and how to use the deep learning algorithm to sort the generated compound sentence, so as to obtain the compound sentence with higher accuracy.
Disclosure of Invention
Aiming at the technical problems, the invention provides a method for optimizing the generation of the repeated sentences for an intelligent inquiry system, which is based on a medical inquiry corpus, generates a repeated corpus by using a text clustering method, respectively extracts sentences to be repeated and sentence templates in the repeated corpus, then carries out template matching and sentence generation on the sentences to be repeated and the repeated template set, thereby obtaining a candidate generated sentence set, and finally sequences sentences in the candidate generated sentence set to obtain the best repeated generated sentences.
A method for optimizing the generation of a repeated sentence facing an intelligent inquiry system comprises the following steps:
(A) Selecting a question-answer data set which exists in a question-answer pair form and has a limited question length, wherein the question does not contain punctuation marks and modification limiting components;
(B) Text clustering is carried out on the question-answer data set, and questions with similar semanteme are classified into the same cluster;
(C) Sentence pattern simplification and template extraction are carried out on all questions to obtain corresponding reiteration templates, wherein all reiteration templates in one cluster are used as a reiteration template group; carrying out the same sentence pattern simplification and template extraction on the sentence to be recovered to obtain a sentence template to be recovered;
(D) Extracting a template of a sentence to be repeated and all template groups for searching and matching, if the template which is the same as the template of the sentence to be repeated is found in a certain template group, indicating that all the templates in the template group have the possibility of being rewritten into new repeated sentences, and respectively generating different repeated generated sentences according to all the repeated templates in the matched template group;
(E) And sequencing all the repeated generation sentences according to the comprehensive similarity, and selecting the best repeated generation sentence with the highest comprehensive similarity according to the sequencing.
Preferably, in the step (a), the chinese inquiry data set is collected in the form of an inquiry pair, the questions are respectively classified into different categories according to the symptoms, the dependency relationship of the questions is analyzed, punctuation marks and modification limiting components in the questions are removed, the length of the questions is limited within the range of [3,20] kanji words, and the processed data set is reserved.
Preferably, in the step (B), text clustering is performed on the question-answer data set by a K-means clustering method, an elbow method and a contour coefficient method are utilized to determine an optimal clustering number, text clustering is performed on the basis of existing classification according to symptoms, and semantically similar question sentences are concentrated in the same cluster.
Preferably, the specific steps of sentence pattern reduction and template extraction in the step (C) include:
(C-1) performing word segmentation, part-of-speech tagging and named entity recognition processing on each sentence by using a jieba component, keeping the sequence of words in the original sentence unchanged, and then replacing the corresponding words in the sentence with part-of-speech tagging and named entity tags respectively to form a preliminary sentence template;
(C-2) replacing the special function word with a special function word label, and updating the preliminary sentence template to obtain a new sentence pattern template;
and (C-3) utilizing syntactic analysis to establish a syntactic tree, and removing the modification relation part which does not affect the sentence main body, so as to simplify the sentence pattern and obtain a corresponding compound template.
Preferably, in the step (D), all the matched compound templates in the compound template group are respectively compared with the to-be-compound sentence templates, the same part of the to-be-compound sentence templates as the word slot is filled, the different parts are reserved, and finally, the words in the to-be-compound sentence are sequentially filled into the word slot according to the labels corresponding to the word slot, so as to generate the compound sentence.
Preferably, in the step (E), the calculation of the comprehensive similarity is performed by adopting an RNN-LM language model and a CNN model (convolutional neural network model) based on similar and dissimilar information, and the specific steps include:
(E-1) scoring the repeated generated sentences by using an RNN-LM model, and normalizing to obtain scores of the RNN-LM model;
(E-2) calculating a cosine similarity matrix of the sentence to be repeated and the generated sentence; calculating semantic matching vectors of the sentences by combining the most similar words in the sentences to be repeated, and dividing word vectors of the generated sentences to be repeated into similar vectors and dissimilar vectors of the sentences to be repeated according to the semantic matching vectors; similarly, calculating semantic matching vectors of the most similar words in the generated sentences by combining the repeated generation sentences, and dividing word vectors of the sentences to be repeated into similar vectors and dissimilar vectors of the generated sentences by the repeated generation according to the semantic matching vectors;
Training a similarity matrix and a dissimilarity matrix respectively formed by the similarity vector and the dissimilarity vector by adopting a double-channel CNN model to obtain feature vectors of the repeated generation sentence and the sentence to be repeated, and calculating the similarity between the repeated generation sentence and the sentence to be repeated according to the feature vectors to serve as a CNN model score;
And (E-4) comprehensively calculating the scores of the RNN-LM model and the CNN model, taking the scores as the final score of the comprehensive similarity of the repeated generation sentences, sequencing all the repeated generation sentences according to the scores from high to low, and taking the repeated generation sentence with the first sequencing as the best repeated generation sentence.
Preferably, the specific steps of step (E-2) include:
(I) Recording the generated sentence as S sentence, recording the sentence to be re-recorded as T sentence, respectively performing Word segmentation processing on the S sentence and the T sentence, and representing corresponding Word vectors by combining Word2vec models, wherein the S sentence and the T sentence are respectively represented as vector matrix sums and are represented by using Word vectors with dimension d, and the number of words contained in the S sentence and the T sentence is m and n respectively;
(II) obtaining a similarity matrix A m×n of the S sentence and the T sentence through a cosine similarity algorithm, wherein a unit a ij in the similarity matrix A m×n is the cosine similarity of the S i and the T j, and the calculation formula is shown in the following formula (1.1):
Wherein S i and T j are respectively the ith word vector in the S sentence and the jth word vector in the T sentence;
The vocabulary corresponding to the maximum value of a ij is the vocabulary most similar to S i in T sentences, the vocabulary is named as T k, and the weighted average of the word vectors of the context of T k and the window size w is used for representing S i, so as to derive the semantic matching vector corresponding to S i The calculation formula of (2) is shown as the following formula (1.2):
Wherein, the semantic matching vector The semantic coverage of the T sentence pair S i is represented, k=argmax jaij, the subscript of the word T k which is the most similar to S i in the T sentence, w represents the window size of the context range of the word T k, T j is the jth word vector in the T sentence, and the weight size of each word is the cosine similarity a ij of the word and S i;
(III) by S i is decomposed into 2 vectors, one is used as a similar vector of S i and T sentence, the other is used as a dissimilar vector of S i and T sentence, and S i is based on S i and/>The cosine similarity of (2) is decomposed to derive the mathematical expression shown in formula (1.3):
wherein alpha represents S i and Cosine similarity of/>Representing the similarity matrix of S i,/>Representing a distinct matrix of S i;
(IV) carrying out the operations of the steps (II) and (III) on each word in the S sentence, and decomposing the S sentence into a similarity matrix And a dissimilarity matrix/>
And (V) similarly, obtaining semantic matching vectors through T j calculation according to the steps (II) - (IV)Representing the semantic coverage of the S sentence pair T j, decomposing the T sentence into a similarity matrix T + and a dissimilarity matrix T -: /(I)
Preferably, the specific steps of step (E-3) include:
For the S sentence, a group of filters { w 0,w1 } are arranged on similar channels and dissimilar channels of a convolution layer, wherein the size of each filter is d multiplied by h, d is the dimension of a word vector, h is the window size and is used for generating a group of characteristics, and the mathematical operation formula is shown as the following formula (1.4):
Wherein, And/>Representing the sub-vectors dividing S + and S - into h dimensions, b c is the offset, f is a nonlinear function,The operational representation of (a) will/>All elements in (a) are weighted and summed according to the weight of w 0,/>The operational representation of (a) will/>All elements in the table are weighted and summed according to the weight of w 1;
the convolutional layer outputs a set of features Inputting the set of features to a max pooling layer, selecting the set of features/>As output result, i.e./>
Setting O group filter to obtain feature vector F S=[c0,c1,…,cO-1 of S sentence;
Performing the same method operation on the T sentence, and setting an O group filter together to obtain a feature vector F T=[d0,d1,…,dO-1 of the T sentence;
And calculating the feature vector, expanding based on the full-connection layer, and normalizing the corresponding result to obtain similarity scores of the S sentence and the T sentence as CNN model scores.
Preferably, in the step (E-4), the calculation formula of the integrated similarity final score is shown in the following formula (1.5):
Score=0.0001s1+s2 (1.5),
Wherein s 1 is the RNN-LM model score and s 2 is the CNN model score.
Compared with the prior art, the invention has the main advantages that:
1) By introducing special function words and simplifying sentence patterns, sentence templates are more reasonably expressed, and the generalization capability of the templates is expanded.
2) The screening of candidate repeated generation sentences adopts a new sorting method, so that the discrimination of repeated generation sentences can be better realized, and the problem that the traditional deep learning method is limited by a high-quality large-scale corpus is avoided.
Drawings
FIG. 1 is a schematic diagram of a method for optimizing sentence generation for an intelligent inquiry system according to an embodiment;
FIG. 2 is a schematic drawing of a preliminary template extraction in accordance with an embodiment;
FIG. 3 is a schematic diagram of functional feature words involved in an embodiment;
FIG. 4 is a simplified diagram of a modification relationship according to an embodiment;
FIG. 5 is a schematic diagram of a replication generation process involved in an embodiment;
Fig. 6 is a schematic diagram of a CNN model based on similar and dissimilar information according to an embodiment.
Detailed Description
The invention will be further elucidated with reference to the drawings and to specific embodiments. It is to be understood that these examples are illustrative of the present invention and are not intended to limit the scope of the present invention. The methods of operation, under which specific conditions are not noted in the examples below, are generally in accordance with conventional conditions, or in accordance with the conditions recommended by the manufacturer.
According to the intelligent inquiry system-oriented sentence generation optimization method, as shown in fig. 1, text clustering is carried out on a center Wen Wenzhen corpus to obtain a repeated corpus, then sentence templates in the sentence to be repeated and the repeated corpus are respectively extracted, template matching and sentence generation are carried out on the sentence templates to be repeated and the repeated template group to obtain a candidate generated sentence set, and finally, the comprehensive similarity score of the candidate generated sentences is calculated by utilizing an RNN-LM model and a CNN model based on similarity and dissimilarity information, so that the best repeated generated sentences are obtained in the candidate generated sentence set, and the method comprises the following steps:
(A) The related data 79210 of Chinese inquiry are collected in the form of inquiry and answer pair, and the disease symptoms are respectively classified into different categories according to 240 categories in total. Analyzing the dependency relationship of the question by using jieba tools, marking the part of speech, deleting punctuation marks in the question resources, removing modification limiting components in the question, limiting the length range of the question to be [3,20] Chinese character numbers, wherein the processed corpus contains 57974 questions, which are 140 types in total.
(B) And carrying out text clustering on the data set by adopting a K-means algorithm, wherein different clustering number K values are respectively selected by using an elbow method and a contour coefficient method to carry out clustering test, and finally, when the clustering number K value is 12, the clustering effect is optimal. Therefore, setting the k value as 12, and using the k value as the optimal clustering number, carrying out text clustering on the data set preprocessed in the step (A), clustering each type of corpus into 12 similar corpus clusters with different semantics, and using the corpus clusters as a repeated corpus extracted by a template; semantic similar questions belong to the same cluster.
(C) Sentence pattern simplification and template extraction are carried out on all questions to obtain corresponding reiteration templates, wherein all reiteration templates in one cluster are used as a reiteration template group; carrying out the same sentence pattern simplification and template extraction on the sentence to be recovered to obtain a sentence template to be recovered;
The sentence pattern reduction and template extraction specifically comprises the following steps:
(C-1) performing word segmentation, part-of-speech tagging and named entity recognition processing on each sentence in the compound corpus by using a jieba tool, keeping the sequence of words in the original sentence unchanged, and then replacing the corresponding words in the sentence with part-of-speech tagging tags and named entity tags respectively to form a preliminary sentence template, as shown in fig. 2;
(C-2) manually labeling some special function words, assigning a label to each function word, and finally forming a function feature word list, as shown in FIG. 3. Replacing the special function word with a special function word label, and updating the preliminary sentence template to obtain a new sentence pattern template;
And (C-3) utilizing syntactic analysis to establish a syntactic tree, and removing the modification relation part which does not affect the sentence main body, so as to simplify the sentence pattern and obtain a corresponding compound template. In this embodiment, the part of the sentence containing 8 kinds of modification relations as shown in fig. 4 is removed. And sequentially carrying out the operations on sentences in the similar corpus clusters to finally obtain all the repeated templates.
(D) Extracting a template of a sentence to be repeated and all template groups for searching and matching, if the template which is the same as the template of the sentence to be repeated is found in a certain template group, indicating that all the templates in the template group have the possibility of being rewritten into new repeated sentences, and respectively generating different repeated generated sentences according to all the repeated templates in the matched template group;
And comparing all the matched repeated templates in the repeated template group with the template of the sentence to be repeated, filling the same part of the template of the sentence to be repeated as a word slot, reserving different parts, and finally filling the words in the sentence to be repeated into the word slot in sequence according to the labels corresponding to the word slot to generate a repeated generated sentence, wherein the process is shown in fig. 5 for example.
(E) And sequencing all the repeated generation sentences according to the comprehensive similarity, and selecting the best repeated generation sentence with the highest comprehensive similarity according to the sequencing.
The comprehensive similarity calculation is carried out by adopting an RNN-LM language model and a CNN model (shown in figure 6) based on similar and dissimilar information, and the specific steps comprise:
(E-1) scoring the repeated generated sentences by using an RNN-LM model, and normalizing to obtain scores of the RNN-LM model;
(E-2) calculating a cosine similarity matrix of the sentence to be repeated and the generated sentence; calculating semantic matching vectors of the sentences by combining the most similar words in the sentences to be repeated, and dividing word vectors of the generated sentences to be repeated into similar vectors and dissimilar vectors of the sentences to be repeated according to the semantic matching vectors; similarly, calculating semantic matching vectors of the most similar words in the generated sentences by combining the repeated generation sentences, and dividing word vectors of the sentences to be repeated into similar vectors and dissimilar vectors of the generated sentences by the repeated generation according to the semantic matching vectors;
The method comprises the following specific steps:
(I) Recording the generated sentence as S sentence, recording the sentence to be re-recorded as T sentence, respectively performing Word segmentation processing on the S sentence and the T sentence, and representing corresponding Word vectors by combining Word2vec models, wherein the S sentence and the T sentence are respectively represented as vector matrix sums and are represented by using Word vectors with dimension d, and the number of words contained in the S sentence and the T sentence is m and n respectively;
(II) judging whether the words in the S sentence can be replaced by the words or phrases in the T sentence or not through semantic similarity, and calculating the similarity between the S sentence and the T sentence, wherein specifically, each word S i in the S sentence can be represented by part of word vectors in the T sentence, so as to obtain a semantic matching vector corresponding to S i
The similarity matrix A m×n of the S sentence and the T sentence is obtained through a cosine similarity algorithm, a unit a ij in the similarity matrix A m×n is the cosine similarity of the S i and the T j, and a calculation formula is shown in the following formula (1.1):
Wherein S i and T j are respectively the ith word vector in the S sentence and the jth word vector in the T sentence;
The vocabulary corresponding to the maximum value of a ij is the vocabulary most similar to S i in T sentences, the vocabulary is named as T k, and the weighted average of the word vectors of the context of T k and the window size w is used for representing S i, so as to derive the semantic matching vector corresponding to S i The calculation formula of (2) is shown as the following formula (1.2):
Wherein, the semantic matching vector The semantic coverage of the T sentence pair S i is represented, k=argmax j aij, the subscript of the word T k which is the most similar to S i in the T sentence, w represents the window size of the context range of the word T k, T j is the jth word vector in the T sentence, and the weight size of each word is the cosine similarity a ij of the word and S i; taking T k as an example, the weight of the element a ik of the similarity matrix A m×n;
(III) by S i is decomposed into 2 vectors, one is used as a similar vector of S i and T sentence, the other is used as a dissimilar vector of S i and T sentence, and S i is based on S i and/>The cosine similarity of (2) is decomposed to derive the mathematical expression shown in formula (1.3):
wherein alpha represents S i and Cosine similarity of/>Representing the similarity matrix of S i,/>Representing a distinct matrix of S i; it can be seen that if S i and/>The higher the similarity, the assignment to/>The more parts in (a);
(IV) carrying out the operations of the steps (II) and (III) on each word in the S sentence, and decomposing the S sentence into a similarity matrix And a dissimilarity matrix/>
(V) similarly, judging whether the words in the T sentence can be replaced by the words or phrases in the S sentence or not through semantic similarity, and calculating the similarity between the T sentence and the S sentence, wherein specifically, each word T j in the T sentence can be represented by part of word vectors in the S sentence, so as to obtain a semantic matching vector corresponding to T j According to the steps (II) - (IV), obtaining the semantic matching vector/>, through T j calculationRepresenting the semantic coverage of the S sentence pair T j, decomposing the T sentence into a similarity matrix T + and a dissimilarity matrix T -:
(E-3) calculating the comprehensive similarity of the S sentence and the T sentence by using the result and modeling the two. Training a similarity matrix and a dissimilarity matrix respectively formed by the similarity vector and the dissimilarity vector by adopting a double-channel CNN model to obtain feature vectors of the repeated generation sentence and the sentence to be repeated, and calculating the similarity between the repeated generation sentence and the sentence to be repeated according to the feature vectors to serve as a CNN model score;
The method comprises the following specific steps:
Taking the S sentence as an example, the core is to set a set of filters { w 0,w1 } on similar channels and dissimilar channels of the convolution layer, where the size of the filters is d×h, d is the dimension of the word vector, h is the window size, and the filter is used for generating a set of features, and the mathematical operation formula is shown in the following formula (1.4):
Wherein, And/>Representing the sub-vectors dividing S + and S - into h dimensions, b c is the offset, f is a nonlinear function,The operational representation of (a) will/>All elements in (a) are weighted and summed according to the weight of w 0,/>Is to be operated byAll elements in the table are weighted and summed according to the weight of w 1;
the convolutional layer outputs a set of features Inputting the set of features to a max pooling layer, selecting the set of features/>As output result, i.e./>Thus, sentences of different lengths are not greatly affected. In the max pooling layer, since each filtering process is used to produce a one-dimensional result, the number of filters can ultimately determine the dimension of the feature vector.
Setting O groups of filters (500 groups of filters are set in the embodiment) according to the method to obtain a feature vector F S=[c0,c1,…,cO-1 of the S sentence;
the same method operation is carried out on the T sentence, Setting O group filter (500 groups filter in the embodiment) to obtain feature vector F T=[d0,d1,…,dO-1 of T sentence;
And calculating the feature vector, expanding based on the full-connection layer, and normalizing the corresponding result to obtain similarity scores of the S sentence and the T sentence as CNN model scores.
And (E-4) comprehensively calculating the scores of the RNN-LM model and the CNN model, taking the scores as the final score of the comprehensive similarity of the repeated generation sentences, sequencing all the repeated generation sentences according to the scores from high to low, and taking the repeated generation sentence with the first sequencing as the best repeated generation sentence.
By constantly comparing experiments, the calculation formula of the integrated similarity final score is shown as the following formula (1.5):
Score=0.0001s1+s2 (1.5),
Wherein s 1 is the RNN-LM model score and s 2 is the CNN model score.
Further, it is to be understood that various changes and modifications of the present application may be made by those skilled in the art after reading the above description of the application, and that such equivalents are intended to fall within the scope of the application as defined in the appended claims.

Claims (8)

1. The method for optimizing the generation of the repeated sentence facing the intelligent inquiry system is characterized by comprising the following steps:
(A) Selecting a question-answer data set which exists in a question-answer pair form and has a limited question length, wherein the question does not contain punctuation marks and modification limiting components;
(B) Text clustering is carried out on the question-answer data set, and questions with similar semanteme are classified into the same cluster;
(C) Sentence pattern simplification and template extraction are carried out on all questions to obtain corresponding reiteration templates, wherein all reiteration templates in one cluster are used as a reiteration template group; carrying out the same sentence pattern simplification and template extraction on the sentence to be recovered to obtain a sentence template to be recovered;
The sentence pattern reduction and template extraction specifically comprises the following steps:
(C-1) performing word segmentation, part-of-speech tagging and named entity recognition processing on each sentence by using a jieba component, keeping the sequence of words in the original sentence unchanged, and then replacing the corresponding words in the sentence with part-of-speech tagging and named entity tags respectively to form a preliminary sentence template;
(C-2) replacing the special function word with a special function word label, and updating the preliminary sentence template to obtain a new sentence pattern template;
(C-3) utilizing syntactic analysis to establish a syntactic tree, removing a modification relation part which does not affect the sentence main body, thereby simplifying the sentence pattern and obtaining a corresponding compound template;
(D) Extracting a template of a sentence to be repeated and all template groups for searching and matching, if the template which is the same as the template of the sentence to be repeated is found in a certain template group, indicating that all the templates in the template group have the possibility of being rewritten into new repeated sentences, and respectively generating different repeated generated sentences according to all the repeated templates in the matched template group;
(E) And sequencing all the repeated generation sentences according to the comprehensive similarity, and selecting the best repeated generation sentence with the highest comprehensive similarity according to the sequencing.
2. The method of optimizing the generation of a sentence according to claim 1, wherein in the step (a), the chinese inquiry data set is collected in the form of an inquiry pair, and the questions are respectively classified into different categories according to the symptoms, the dependency relationship of the questions is analyzed, punctuation marks and modification limiting components in the questions are removed, and the length of the questions is limited to a range of [3,20] kanji words, and the processed data set is retained.
3. The method of optimizing the generation of multiple sentences according to claim 2, wherein in the step (B), text clustering is performed on the question-answer data set by a K-means clustering method, an optimal clustering number is determined by an elbow method and a contour coefficient method, text clustering is performed on the basis of existing classification according to symptoms, and semantically similar question sentences are concentrated in the same cluster.
4. The method for optimizing the generation of a sentence according to claim 1, wherein in the step (D), all the matched compound templates in the compound template group are respectively compared with the sentence template to be compound, the same part as the compound template in the sentence template to be compound is filled as word slots, different parts are reserved, and finally, the words in the sentence to be compound are sequentially filled into the word slots according to the labels corresponding to the word slots, so as to generate the compound generated sentence.
5. The sentence generation optimizing method according to claim 1, wherein in the step (E), the calculation of the comprehensive similarity is performed by using the RNN-LM language model and the CNN model based on the similarity and dissimilarity information, the specific steps include:
(E-1) scoring the repeated generated sentences by using an RNN-LM model, and normalizing to obtain scores of the RNN-LM model;
(E-2) calculating a cosine similarity matrix of the sentence to be repeated and the generated sentence; calculating semantic matching vectors of the sentences by combining the most similar words in the sentences to be repeated, and dividing word vectors of the generated sentences to be repeated into similar vectors and dissimilar vectors of the sentences to be repeated according to the semantic matching vectors; similarly, calculating semantic matching vectors of the most similar words in the generated sentences by combining the repeated generation sentences, and dividing word vectors of the sentences to be repeated into similar vectors and dissimilar vectors of the generated sentences by the repeated generation according to the semantic matching vectors;
Training a similarity matrix and a dissimilarity matrix respectively formed by the similarity vector and the dissimilarity vector by adopting a double-channel CNN model to obtain feature vectors of the repeated generation sentence and the sentence to be repeated, and calculating the similarity between the repeated generation sentence and the sentence to be repeated according to the feature vectors to serve as a CNN model score;
And (E-4) comprehensively calculating the scores of the RNN-LM model and the CNN model, taking the scores as the final score of the comprehensive similarity of the repeated generation sentences, sequencing all the repeated generation sentences according to the scores from high to low, and taking the repeated generation sentence with the first sequencing as the best repeated generation sentence.
6. The sentence generation optimizing method of claim 5, wherein the specific steps of step (E-2) include:
(I) Recording the generated sentence as S sentence, recording the sentence to be re-recorded as T sentence, respectively performing Word segmentation processing on the S sentence and the T sentence, and representing corresponding Word vectors by combining Word2vec models, wherein the S sentence and the T sentence are respectively represented as vector matrix sums and are represented by using Word vectors with dimension d, and the number of words contained in the S sentence and the T sentence is m and n respectively;
(II) obtaining a similarity matrix A m×n of the S sentence and the T sentence through a cosine similarity algorithm, wherein a unit a ij in the similarity matrix A m×n is the cosine similarity of the S i and the T j, and the calculation formula is shown in the following formula (1.1):
Wherein S i and T j are respectively the ith word vector in the S sentence and the jth word vector in the T sentence;
The vocabulary corresponding to the maximum value of a ij is the vocabulary most similar to S i in T sentences, the vocabulary is named as T k, and the weighted average of the word vectors of the context of T k and the window size w is used for representing S i, so as to derive the semantic matching vector corresponding to S i The calculation formula of (2) is shown as the following formula (1.2):
Wherein, the semantic matching vector The semantic coverage of the T sentence pair S i is represented, k=argmax jaij, the subscript of the word T k which is the most similar to S i in the T sentence, w represents the window size of the context range of the word T k, T j is the jth word vector in the T sentence, and the weight size of each word is the cosine similarity a ij of the word and S i;
(III) by S i is decomposed into 2 vectors, one is used as a similar vector of S i and T sentence, the other is used as a dissimilar vector of S i and T sentence, and S i is based on S i and/>The cosine similarity of (2) is decomposed to derive the mathematical expression shown in formula (1.3):
wherein alpha represents S i and Cosine similarity of/>Representing the similarity matrix of S i,/>Representing a distinct matrix of S i;
(IV) carrying out the operations of the steps (II) and (III) on each word in the S sentence, and decomposing the S sentence into a similarity matrix And a dissimilarity matrix/>
And (V) similarly, obtaining semantic matching vectors through T j calculation according to the steps (II) - (IV)Representing the semantic coverage of the S sentence pair T j, decomposing the T sentence into a similarity matrix T + and a dissimilarity matrix T -: /(I)
7. The sentence generation optimizing method of claim 6, wherein the specific step of step (E-3) includes:
For the S sentence, a group of filters { w 0,w1 } are arranged on similar channels and dissimilar channels of a convolution layer, wherein the size of each filter is d multiplied by h, d is the dimension of a word vector, h is the window size and is used for generating a group of characteristics, and the mathematical operation formula is shown as the following formula (1.4):
Wherein, And/>Representing the sub-vectors dividing S + and S - into h dimensions, b c is the offset, f is a nonlinear function,The operational representation of (a) will/>All elements in (a) are weighted and summed according to the weight of w 0,/>The operational representation of (a) will/>All elements in the table are weighted and summed according to the weight of w 1;
the convolutional layer outputs a set of features Inputting the set of features to a max pooling layer, selecting the set of features/>As output result, i.e./>
Setting O group filter to obtain feature vector F S=[c0,c1,…,cO-1 of S sentence;
Performing the same method operation on the T sentence, and setting an O group filter together to obtain a feature vector F T=[d0,d1,…,dO-1 of the T sentence;
And calculating the feature vector, expanding based on the full-connection layer, and normalizing the corresponding result to obtain similarity scores of the S sentence and the T sentence as CNN model scores.
8. The sentence generation optimizing method according to claim 5 or 7, wherein in step (E-4), the calculation formula of the integrated similarity final score is shown in the following formula (1.5):
Score=0.0001s1+s2 (1.5),
Wherein s 1 is the RNN-LM model score and s 2 is the CNN model score.
CN202011457520.7A 2020-12-10 2020-12-10 Intelligent inquiry system-oriented repeated sentence generation optimization method Active CN112397201B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011457520.7A CN112397201B (en) 2020-12-10 2020-12-10 Intelligent inquiry system-oriented repeated sentence generation optimization method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011457520.7A CN112397201B (en) 2020-12-10 2020-12-10 Intelligent inquiry system-oriented repeated sentence generation optimization method

Publications (2)

Publication Number Publication Date
CN112397201A CN112397201A (en) 2021-02-23
CN112397201B true CN112397201B (en) 2024-05-28

Family

ID=74625192

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011457520.7A Active CN112397201B (en) 2020-12-10 2020-12-10 Intelligent inquiry system-oriented repeated sentence generation optimization method

Country Status (1)

Country Link
CN (1) CN112397201B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113822034B (en) * 2021-06-07 2024-04-19 腾讯科技(深圳)有限公司 Method, device, computer equipment and storage medium for replying text
CN113971394A (en) * 2021-10-26 2022-01-25 上海交通大学 Text repeat rewriting system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008203717A (en) * 2007-02-22 2008-09-04 Oki Electric Ind Co Ltd Text sentence selecting method for corpus-based speech synthesis, and program thereof and device thereof
JP2016057815A (en) * 2014-09-09 2016-04-21 日本電信電話株式会社 Sentence rewrite processing device, learning device, method, and program
JP2016184210A (en) * 2015-03-25 2016-10-20 日本電信電話株式会社 Clause function part rewriting device, and learning device, method and program
JP2018073411A (en) * 2016-11-04 2018-05-10 株式会社リコー Natural language generation method, natural language generation device, and electronic apparatus
CN109710915A (en) * 2017-10-26 2019-05-03 华为技术有限公司 Repeat sentence generation method and device
CN110309289A (en) * 2019-08-23 2019-10-08 深圳市优必选科技股份有限公司 Sentence generation method, sentence generation device and intelligent equipment
CN111814451A (en) * 2020-05-21 2020-10-23 北京嘀嘀无限科技发展有限公司 Text processing method, device, equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008203717A (en) * 2007-02-22 2008-09-04 Oki Electric Ind Co Ltd Text sentence selecting method for corpus-based speech synthesis, and program thereof and device thereof
JP2016057815A (en) * 2014-09-09 2016-04-21 日本電信電話株式会社 Sentence rewrite processing device, learning device, method, and program
JP2016184210A (en) * 2015-03-25 2016-10-20 日本電信電話株式会社 Clause function part rewriting device, and learning device, method and program
JP2018073411A (en) * 2016-11-04 2018-05-10 株式会社リコー Natural language generation method, natural language generation device, and electronic apparatus
CN109710915A (en) * 2017-10-26 2019-05-03 华为技术有限公司 Repeat sentence generation method and device
CN110309289A (en) * 2019-08-23 2019-10-08 深圳市优必选科技股份有限公司 Sentence generation method, sentence generation device and intelligent equipment
CN111814451A (en) * 2020-05-21 2020-10-23 北京嘀嘀无限科技发展有限公司 Text processing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN112397201A (en) 2021-02-23

Similar Documents

Publication Publication Date Title
CN112214610B (en) Entity relationship joint extraction method based on span and knowledge enhancement
CN111639171B (en) Knowledge graph question-answering method and device
CN107463607B (en) Method for acquiring and organizing upper and lower relations of domain entities by combining word vectors and bootstrap learning
CN109344250B (en) Rapid structuring method of single disease diagnosis information based on medical insurance data
CN113011533A (en) Text classification method and device, computer equipment and storage medium
CN112101028B (en) Multi-feature bidirectional gating field expert entity extraction method and system
CN109271506A (en) A kind of construction method of the field of power communication knowledge mapping question answering system based on deep learning
CN102262634B (en) Automatic questioning and answering method and system
CN112542223A (en) Semi-supervised learning method for constructing medical knowledge graph from Chinese electronic medical record
CN111858896B (en) Knowledge base question-answering method based on deep learning
CN111061882A (en) Knowledge graph construction method
CN112397201B (en) Intelligent inquiry system-oriented repeated sentence generation optimization method
CN112256939A (en) Text entity relation extraction method for chemical field
CN112101014B (en) Chinese chemical industry document word segmentation method based on mixed feature fusion
CN111858842A (en) Judicial case screening method based on LDA topic model
CN117236338B (en) Named entity recognition model of dense entity text and training method thereof
CN114662495A (en) English literature pollutant information extraction method based on deep learning
CN110162651B (en) News content image-text disagreement identification system and identification method based on semantic content abstract
CN111444720A (en) Named entity recognition method for English text
CN113065352B (en) Method for identifying operation content of power grid dispatching work text
CN112651241A (en) Chinese parallel structure automatic identification method based on semi-supervised learning
CN116187323A (en) Knowledge graph in field of numerical control machine tool and construction method thereof
CN113569004B (en) Intelligent prompting method for modeling of restrictive natural language use case
CN114756617A (en) Method, system, equipment and storage medium for extracting structured data of engineering archives
Maheswari et al. Rule based morphological variation removable stemming algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant