CN109145292B - Paraphrase text depth matching model construction method and paraphrase text depth matching method - Google Patents

Paraphrase text depth matching model construction method and paraphrase text depth matching method Download PDF

Info

Publication number
CN109145292B
CN109145292B CN201810836453.6A CN201810836453A CN109145292B CN 109145292 B CN109145292 B CN 109145292B CN 201810836453 A CN201810836453 A CN 201810836453A CN 109145292 B CN109145292 B CN 109145292B
Authority
CN
China
Prior art keywords
training sentence
sentence
training
semantic
test
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810836453.6A
Other languages
Chinese (zh)
Other versions
CN109145292A (en
Inventor
孔蕾蕾
韩中元
齐浩亮
韩咏
孙栩
于海浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Heilongjiang Institute of Technology
Original Assignee
Heilongjiang Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Heilongjiang Institute of Technology filed Critical Heilongjiang Institute of Technology
Priority to CN201810836453.6A priority Critical patent/CN109145292B/en
Publication of CN109145292A publication Critical patent/CN109145292A/en
Application granted granted Critical
Publication of CN109145292B publication Critical patent/CN109145292B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a construction method of a paraphrase text depth matching model and a paraphrase text depth matching method. The construction method of the paraphrase text depth matching model comprises the following steps: obtaining a training sample set, wherein the training sample set comprises a plurality of training sentence pairs, and each training sentence pair comprises a first training sentence, a second training sentence and a semantic matching score between the first training sentence and the second training sentence; obtaining a semantic feature vector and a syntactic feature vector of each word in each training sentence, calculating a tensor product of the semantic feature vector and the syntactic feature vector of the word, and determining a matrix sum of tensor products corresponding to each word in the sentence as a syntactic and semantic interaction feature quantity of the training sentence; taking the matrix difference between the syntax and the semantic interactive characteristic quantity of the first training sentence and the second training sentence in each training sentence pair as the syntax and the semantic interactive characteristic quantity of the training sentence pair; and training each training sentence pair by utilizing a convolutional neural network model to obtain a paraphrase text deep matching model. The technology can improve the accuracy of text paraphrase recognition.

Description

Paraphrase text depth matching model construction method and paraphrase text depth matching method
Technical Field
The invention relates to the technology, in particular to a construction method of a paraphrase text deep matching model and a paraphrase text deep matching method.
Background
The goal of text paraphrase recognition is to determine whether two text segments have the same meaning, i.e., whether two texts semantically have a paraphrase relationship. The essence of the method is semantic matching among texts, and the method relates to the problems of understanding of word senses, calculation of text semantic similarity and the like. Therefore, paraphrase recognition is one of the most fundamental problems in natural language processing. In recent years, paraphrase recognition has been widely applied to natural language processing fields such as recognition and generation of repeated statements, machine translation, question answering, and text similarity calculation, and has attracted increasing attention from researchers.
Early paraphrase recognition methods generally recognized paraphrases based on exact matching of words (or N-grams of words) or similarity of vectors in word space, represented texts by bag-of-words models, N-grams models, TFIDF models, etc., and then determined whether two texts have paraphrase relationships using some text similarity calculation methods (e.g., edit distance, longest common substring, Jaccard coefficient, cosine distance, etc.). However, paraphrase sentences usually adopt synonyms of words or phrases, antisense word replacement, syntax modification, sentence reduction, combination, recombination, word shuffling, concept generalization and specialization and other means to change the appearance of original texts on the premise of keeping the semantics of source sentences, so that the paraphrase recognition method based on word matching cannot judge whether the texts have paraphrase relations through the calculation of the text semantics because the paraphrase recognition method does not consider the semantics of words.
Another syntactic feature-based approach that does not consider semantics has also gained widespread application in paraphrase recognition, especially cross-language paraphrase recognition. These studies are based on the assumption that similar documents have similar syntactic structures. That is, if two sentences describe the same event, they are likely to contain the same dependency structure syntactically. However, paraphrase recognition that relies solely on the similarity of syntactic structures without considering semantics does not solve the problem of "two sentences in which the syntactic structures are not identical do not necessarily differ in semantics" in a language.
The syntax semantics of the bragg school suggests that sentences are the basic unit of a grammar system, the sentences have a syntactic structure, and the natural language written sentences are not a simple set of words but text with a syntactic structure under grammatical constraints. Danes considers that there is a correlation between semantics and syntax, and it is a symbiotic companion, and when we need to convey information and need to express it in a suitable way, the semantics and syntax of the sentence will work together. Researchers have applied this feature of syntactic structure to many areas of natural language processing, and have especially succeeded in statistical machine translation based on syntactic and neural language models. This motivates us to combine syntax and semantics in text paraphrase recognition or to contribute to performance improvements.
In recent years, in the trending of deep learning, paraphrase recognition of text shows a tendency to shift from a conventional paraphrase recognition model to a deep paraphrase recognition model. Various continuous representation-based deep text matching models have significantly improved the performance of paraphrase recognition tasks over traditional text matching models. These models focus on representing documents using deep learning methods, focusing on identifying paraphrases through learning of the degree of matching and the structure of the matching. In the deep paraphrase recognition model, besides semantic information which is widely accepted and used, researchers also pay attention to the role of a syntactic structure in constructing text expression and calculating text semantic matching, and a multi-semantic deep text matching model which integrates or utilizes the syntactic structure is provided. These studies validate the role of syntactic structures in a model for deep paraphrase recognition of multi-lingual text.
Disclosure of Invention
The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. It should be understood that this summary is not an exhaustive overview of the invention. It is not intended to determine the key or critical elements of the present invention, nor is it intended to limit the scope of the present invention. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later.
In view of the above, the invention provides a paraphrase text depth matching model construction method and a paraphrase text depth matching method, so as to at least solve the problem of inaccurate recognition result of the existing text paraphrase recognition technology.
According to an aspect of the present invention, there is provided a paraphrase text depth matching model construction method including: obtaining a training sample set, wherein the training sample set comprises a plurality of training sentence pairs, each training sentence pair comprises a first training sentence, a second training sentence and a semantic matching score between the first training sentence and the second training sentence, and the semantic matching score represents the semantic similarity degree between two corresponding sentences; for each training sentence in a first training sentence and a second training sentence in each training sentence pair, obtaining a semantic feature vector and a syntactic feature vector of each word in the training sentence, and for each word in the training sentence, calculating a tensor product of the semantic feature vector and the syntactic feature vector of the word so as to determine a matrix sum of tensor products corresponding to each word in the training sentence as a syntactic and semantic interaction feature quantity of the training sentence; aiming at each training sentence pair, taking the matrix difference between the syntax and the semantic interactive characteristic quantity of a first training sentence and a second training sentence in the training sentence pair as the syntax and the semantic interactive characteristic quantity of the training sentence pair; and training each training sentence pair by utilizing a convolutional neural network model, wherein the syntax and semantic interactive characteristic quantity of each training sentence pair are used as model input, the semantic matching scores of the first training sentence and the second training sentence are used as model output, and the trained convolutional neural network model is used as the paraphrase text deep matching model.
Further, for each training sentence in the first training sentence and the second training sentence in each training sentence pair, obtaining a syntactic feature vector of each word in the training sentence by: for each training sentence in the first training sentence and the second training sentence in each training sentence pair, obtaining a syntactic analysis result of the training sentence, so as to obtain a syntactic feature vector of each word in the training sentence according to the syntactic analysis result of the training sentence.
Further, the training sample set comprises positive example training sentence pairs and negative example training sentence pairs; the semantic matching score of the first training sentence and the second training sentence in the positive example training sentence pair is 1, and the semantic matching score of the first training sentence and the second training sentence in the negative example training sentence pair is 0.
Further, the training sample set comprises positive example training sentence pairs and negative example training sentence pairs; the semantic matching scores of the first training sentence and the second training sentence in the positive example training sentence pair are larger than a first threshold, and the semantic matching scores of the first training sentence and the second training sentence in the negative example training sentence pair are smaller than or equal to a second threshold.
Further, the convolutional neural network model comprises a convolutional layer, a max-pooling layer and a multi-layer perceptron.
The invention also provides a paraphrase text depth matching method, which comprises the following steps: adopting the paraphrase text depth matching model construction method to construct a paraphrase text depth matching model in advance; obtaining a test sentence pair, wherein the test sentence pair comprises a first test sentence and a second test sentence; for each test statement in a first test statement and a second test statement in the test statement pair, acquiring a semantic feature vector and a syntactic feature vector of each word in the test statement, and for each word in the test statement, calculating a tensor product of the semantic feature vector and the syntactic feature vector of the word so as to determine a matrix sum of tensor products corresponding to each word in the test statement as a syntactic and semantic interaction feature quantity of the test statement; taking the matrix difference between the syntax and the semantic interactive characteristic quantity of the first test statement and the second test statement in the test statement pair as the syntax and the semantic interactive characteristic quantity of the test statement pair; and inputting the syntax and semantic interactive characteristic quantity of the test sentence pair into the paraphrase text deep matching model to obtain semantic matching scores of a first test sentence and a second test sentence in the test sentence pair.
Further, for each test sentence in the first test sentence and the second test sentence in the test sentence pair, obtaining a syntactic feature vector of each word in the test sentence by: and aiming at each test statement in the first test statement and the second test statement in the test statement pair, obtaining a syntactic analysis result of the test statement so as to obtain a syntactic feature vector of each word in the test statement according to the syntactic analysis result of the test statement.
The model realizes syntax and semantic interaction by applying tensor, enhances semantic expression of sentences by syntax information, and realizes pattern extraction and paraphrase recognition in a paraphrase text matching space through a convolutional neural network, so that the recognition accuracy is improved compared with the prior art. The model is evaluated on a Microsoft MSRP (Microsoft Research Paraphrase) data set of a classical public data set of Paraphrase text and PAN 2010 and PAN 2012 which are constructed according to a plagiarism data set detected by plagiarism detection, and the comparison result with a traditional text Paraphrase recognition model and a plurality of deep text Paraphrase recognition models verifies the effectiveness of the text method.
These and other advantages of the present invention will become more apparent from the following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings.
Drawings
The invention may be better understood by referring to the following description in conjunction with the accompanying drawings, in which like reference numerals are used throughout the figures to indicate like or similar parts. The accompanying drawings, which are incorporated in and form a part of this specification, illustrate preferred embodiments of the present invention and, together with the detailed description, serve to further illustrate the principles and advantages of the invention. In the drawings:
FIG. 1 is a flow chart schematically illustrating one example of a paraphrase text depth matching model construction method of the present invention;
FIG. 2 is a flow chart that schematically illustrates one example of a paraphrase text depth matching method of the present invention;
FIG. 3 is a diagram illustrating interaction syntax and semantic features;
FIG. 4 is a schematic diagram illustrating the building process of a deep paraphrase text matching model of interactive syntax and semantics.
Skilled artisans appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help improve the understanding of the embodiments of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described hereinafter with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual implementation are described in the specification. It will of course be appreciated that in the development of any such actual embodiment, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which will vary from one implementation to another. Moreover, it will be appreciated that such a development effort might be complex and time-consuming, but would nevertheless be a routine undertaking for those of ordinary skill in the art having the benefit of this disclosure.
It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the device structures and/or processing steps closely related to the solution according to the present invention are shown in the drawings, and other details not so relevant to the present invention are omitted.
The embodiment of the invention provides a construction method of a paraphrase text depth matching model, which comprises the following steps: obtaining a training sample set, wherein the training sample set comprises a plurality of training sentence pairs, each training sentence pair comprises a first training sentence, a second training sentence and a semantic matching score between the first training sentence and the second training sentence, and the semantic matching score represents the semantic similarity degree between two corresponding sentences; for each training sentence in a first training sentence and a second training sentence in each training sentence pair, obtaining a semantic feature vector and a syntactic feature vector of each word in the training sentence, and for each word in the training sentence, calculating a tensor product of the semantic feature vector and the syntactic feature vector of the word so as to determine a matrix sum of tensor products corresponding to each word in the training sentence as a syntactic and semantic interaction feature quantity of the training sentence; aiming at each training sentence pair, taking the matrix difference between the syntax and the semantic interactive characteristic quantity of a first training sentence and a second training sentence in the training sentence pair as the syntax and the semantic interactive characteristic quantity of the training sentence pair; and training each training sentence pair by utilizing a convolutional neural network model, wherein the syntax and semantic interactive characteristic quantity of each training sentence pair are used as model input, the semantic matching scores of the first training sentence and the second training sentence are used as model output, and the trained convolutional neural network model is used as the paraphrase text deep matching model.
FIG. 1 is a flow chart of the construction method of paraphrase text depth matching model of the present invention.
As shown in fig. 1, after the construction method of the paraphrase text deep matching model starts, in step S110, a training sample set is obtained, where the training sample set includes a plurality of training sentence pairs, each training sentence pair includes a first training sentence, a second training sentence, and a semantic matching score between the first training sentence and the second training sentence, where the semantic matching score represents a semantic similarity degree between two corresponding sentences. Then, step S120 is performed.
The semantic matching score between the first training sentence and the second training sentence in each training sentence pair is, for example, labeled in advance.
For example, the training sample set may include a positive example training sentence pair and a negative example training sentence pair; the semantic matching score of the first training sentence and the second training sentence in the positive example training sentence pair is, for example, 1, and the semantic matching score of the first training sentence and the second training sentence in the negative example training sentence pair is, for example, 0.
As another example, the training sample set includes a positive example training sentence pair and a negative example training sentence pair; the semantic matching score of the first training sentence and the second training sentence in the positive training sentence pair is larger than a first threshold value, and the semantic matching score of the first training sentence and the second training sentence in the negative training sentence pair is smaller than or equal to a second threshold value. The first threshold may be greater than the second threshold, or the first threshold may be equal to the second threshold, for example. Further, the first threshold value and the second threshold value may be set to empirical values, for example.
In step S120, for each of the first training sentence and the second training sentence in each training sentence pair, a semantic feature vector and a syntactic feature vector of each word in the training sentence are obtained, and for each word in the training sentence, a tensor product of the semantic feature vector and the syntactic feature vector of the word is calculated, so as to determine a matrix sum of tensor products corresponding to each word in the training sentence as a syntactic and semantic interaction feature quantity of the training sentence.
The "tensor product of the semantic feature vector and the syntactic feature vector of the word" is a tensor product of two vectors, namely the semantic feature vector of the word and the syntactic feature vector of the word, wherein the tensor product is a matrix.
The "matrix sum of tensor products corresponding to each word in the training sentence" is to perform matrix addition on the tensor products corresponding to each word in the training sentence, and the obtained sum is the matrix sum.
For each of the first training sentence and the second training sentence in each training sentence pair, a syntactic feature vector of each word in the training sentence may be obtained by: for each training sentence in the first training sentence and the second training sentence in each training sentence pair, obtaining a syntactic analysis result of the training sentence, so as to obtain a syntactic feature vector of each word in the training sentence according to the syntactic analysis result of the training sentence.
In step S130, for each training sentence pair, a matrix difference between the syntax and semantic interaction feature quantities of the first training sentence and the second training sentence in the training sentence pair is used as the syntax and semantic interaction feature quantity of the training sentence pair.
The matrix difference between the syntax and the semantic interaction feature quantity of the first training sentence and the second training sentence in the training sentence pair is obtained by performing matrix subtraction on the syntax and the semantic interaction feature quantity of the first training sentence in the training sentence pair and the syntax and the semantic interaction feature quantity of the second training sentence in the training sentence, and the obtained difference is the matrix difference.
Then, in step S140, each training sentence pair is trained by using a convolutional neural network model, where the syntax and semantic interaction feature quantity of each training sentence pair are used as model inputs, the semantic matching scores of the first training sentence and the second training sentence included in the training sentence pair are used as model outputs, and the trained convolutional neural network model is used as a paraphrase text deep matching model.
Convolutional neural network models include, for example, convolutional layers, max-pooling layers, and multi-layer perceptrons.
The process of training the convolutional neural network model can be implemented by the prior art, and is not described herein again.
The invention also provides a paraphrase text depth matching method, which comprises the following steps: adopting the construction method to construct a paraphrase text depth matching model in advance; obtaining a test sentence pair, wherein the test sentence pair comprises a first test sentence and a second test sentence; for each test statement in a first test statement and a second test statement in the test statement pair, acquiring a semantic feature vector and a syntactic feature vector of each word in the test statement, and for each word in the test statement, calculating a tensor product of the semantic feature vector and the syntactic feature vector of the word so as to determine a matrix sum of tensor products corresponding to each word in the test statement as a syntactic and semantic interaction feature quantity of the test statement; taking the matrix difference between the syntax and the semantic interactive characteristic quantity of the first test statement and the second test statement in the test statement pair as the syntax and the semantic interactive characteristic quantity of the test statement pair; and inputting the syntax and semantic interactive characteristic quantity of the test sentence pair into the paraphrase text deep matching model to obtain semantic matching scores of a first test sentence and a second test sentence in the test sentence pair.
FIG. 2 is a flow chart of the paraphrase text depth matching method described above.
As shown in fig. 2, after the paraphrase text depth matching method starts, in step S210, a paraphrase text depth matching model is pre-constructed in advance by using the paraphrase text depth matching model construction method shown in fig. 1. Then, step S220 is performed.
In step S220, a test sentence pair is obtained, the test sentence pair including a first test sentence and a second test sentence. Then, step S230 is performed.
In step S230, for each test sentence in the first test sentence and the second test sentence in the test sentence pair, the semantic feature vector and the syntactic feature vector of each word in the test sentence are obtained, and for each word in the test sentence, the tensor product of the semantic feature vector and the syntactic feature vector of the word is calculated, so as to determine the matrix sum of the tensor products corresponding to each word in the test sentence as the syntactic and semantic interaction feature quantity of the test sentence. Then, step S240 is performed.
For each test statement in the first test statement and the second test statement in the test statement pair, the syntactic feature vector of each word in the test statement can be obtained through the following processes: for each test statement in a first test statement and a second test statement in a test statement pair, obtaining a syntactic analysis result of the test statement, so as to obtain a syntactic feature vector of each word in the test statement according to the syntactic analysis result of the test statement.
The processes of calculating the tensor product, the matrix sum and the matrix difference in step S230 and step S240 may be respectively similar to the processes of calculating the tensor product, the matrix sum and the matrix difference for the training sentence, and are not described herein again.
For each training sentence or each test sentence, the semantic feature Vector of each Word in the sentence can be obtained using the prior art (see, for example, Mikolov T, Chen K, Corrado G, et al. efficient Estimation of Word responses in Vector Space [ J ]. arXiv prediction arXiv:1301.3781,2013:1-13. and Mikolov T, Sutskeeper I, Chen K, et al. Distributed responses of Words and phenols and theory compatibility [ J ]. Advances in Neural Information Processing Systems,2013: 3111-.
For each training sentence or each test sentence, the syntactic analysis result of the sentence can be obtained by the prior art, and the syntactic feature vector of each word in the sentence can also be obtained by the prior art.
In step S240, a matrix difference between the syntax and semantic interactive feature quantities of the first test sentence and the second test sentence in the test sentence pair is used as the syntax and semantic interactive feature quantity of the test sentence pair. Then, step S250 is performed.
In step S250, the syntax and semantic interaction feature quantities of the test sentence pair are input into the paraphrase text deep matching model to obtain semantic matching scores of the first test sentence and the second test sentence in the test sentence pair.
One preferred example of the present invention is described below.
As shown in FIG. 3, let sentence skExpressed as:
Figure BDA0001744654590000111
wherein
Figure BDA0001744654590000112
Representing a sentence skThe ith word of
Figure BDA0001744654590000113
Is a word
Figure BDA0001744654590000114
The semantic feature vector of (a) is,
Figure BDA0001744654590000115
is a word
Figure BDA00017446545900001123
In sentence skSyntax of (2)TDPIM-ISS by vector
Figure BDA0001744654590000116
And
Figure BDA0001744654590000117
calculating the product of the tenses
Figure BDA0001744654590000118
Form of obtaining words
Figure BDA0001744654590000119
The semantic and syntactic interaction structure of (1). Namely:
Figure BDA00017446545900001110
(equation (1)). Wherein the content of the first and second substances,
Figure BDA00017446545900001111
representing a sentence skThe ith word
Figure BDA00017446545900001112
Corresponding semantic feature vector
Figure BDA00017446545900001113
And syntactic feature vector
Figure BDA00017446545900001114
The tensor product of.
Wherein the content of the first and second substances,
Figure BDA00017446545900001115
is a word
Figure BDA00017446545900001116
The semantic feature vector of (a), expressed using word embedding,
Figure BDA00017446545900001117
is that
Figure BDA00017446545900001118
Syntactic feature vector for capturing words
Figure BDA00017446545900001119
Syntactic roles played in sentences by aligning the sentence skIs obtained. Is provided with
Figure BDA00017446545900001120
gmRepresenting a fixed syntactic characteristic, e.g. word
Figure BDA00017446545900001121
Whether it is a subject, whether it is a noun, etc., g m1 means
Figure BDA00017446545900001122
Having an mth syntactic characteristic, g m0 means that this syntactic characteristic is not present.
For each word in a sentence, the word embedding representation (i.e., tensor product) of the interactive semantics and syntax of the word, which is represented as a two-dimensional matrix, can be obtained using equation (1), as shown on the left side of fig. 4. Further, the sentence skIs represented as a word-embedded representation of its interaction semantics and syntax, the sentence s is obtainedkA three-dimensional matrix representation of
Figure BDA0001744654590000121
As shown on the right side of fig. 4.
Figure BDA0001744654590000122
There are three dimensions, one for each word w in the sentence, one for the semantic vector e of the word, and one for the syntactic vector g of the word. The TDPIM-ISS captures the interaction between semantic features and syntactic features in a tensor mode by the tensor mode based on the tensor features, describes the role of the semantics of each word in the syntax, and is the syntactic decomposition of an input sentence.
To obtain a sentence expression, multiple interactive semantic and syntactic word embeddingSum of
Figure BDA0001744654590000123
Mapping to a two-dimensional space of semantic dimensions and syntactic dimensions to obtain a sentence skIs represented by TkThis application refers to the sentence representation (i.e., the matrix sum described above) of the interactive semantics and syntax:
Figure BDA0001744654590000124
further, for a given two sentences skAnd spThe interaction is expressed as a matrix (i.e. the matrix difference described above) using equation 3, using Ak,pRepresents:
Ak,p=T(k)-T(p) (3)
paraphrase sentence matching based on the convolutional neural network is based on the structure to realize paraphrase sentence matching calculation.
The convolutional neural network model includes, for example, convolutional Layer Conv solutions Layer, Max Pooling Max pool, and multi Layer Perceptron based classification of three parts.
The model trains the parameters according to a minimum cross entropy loss, using two-norm regularization to mitigate overfitting. The loss function is defined as follows:
Figure BDA0001744654590000125
wherein, yiIs the label of the ith training data, which is the regularization term coefficient, W1、W2Are parameters of the MLP first layer and second layer. In training, a standard back propagation algorithm is adopted and an Adam optimizer is used for optimizing the model, and parameters are updated as follows:
Figure BDA0001744654590000131
wherein t representsCurrent time, WtRefers to the weight parameter, W, of the current timet-1Representing the weight of W in the previous training round, eta is the learning rate, epsilon is a parameter,
Figure BDA0001744654590000132
and
Figure BDA0001744654590000133
is a variable for controlling the descending direction, and the gradient is corrected by β as follows:
Figure BDA0001744654590000134
Figure BDA0001744654590000135
mt=β1,tmt-1+(1-β1,t)gt
Figure BDA0001744654590000136
wherein, gtIs the gradient calculated at time t, defined as follows:
Figure BDA0001744654590000137
as can be seen from the analysis, the matching of words is important information between paraphrase sentences. In order to utilize the information in the matching model, the TDPIM-ISS is fused with vocabulary characteristics to further improve the performance of the model.
The Jaccard coefficient and Precision, Recall, F1, Fmean, Penalty and METEOR Score based on the machine translation evaluation index METEOR were selected as two sentences s, r lexical features. In the stage of obtaining a matching result based on a multilayer perceptron by the TDPIM-ISS, the seven characteristics and effective characteristics learned by a convolutional neural network from expressions of two sentences are spliced together to learn a paraphrase recognition model. The definition of these features is shown in table 1:
TABLE 1 lexical characteristics fused by TDPIM-ISS
Figure BDA0001744654590000138
Figure BDA0001744654590000141
Among the above features, the Jaccard coefficient is widely applied to similarity calculation of texts for capturing completely consistent vocabulary matching. The METEOR index is also used by researchers for paraphrase recognition, for capturing information such as exact matches, stem matches, and WordNet-based synonyms. Based on these features, the text segments s and r can not only be based on word comparison, but also solve morphological variation of words and deal with synonym replacement problems.
Aiming at the matching of paraphrase texts, the invention provides a deep paraphrase text matching model TDPIM-ISS with syntax and semantic interaction. The model realizes the interaction of syntax and semantics by applying tensor, obtains enhanced sentence expression fusing syntactic characteristics and semantic characteristics, extracts effective characteristics through a convolutional neural network, and finally extracts a mode in a paraphrase text matching space to complete a paraphrase matching task. Experiments on microsoft's MSRP paraphrase, plagiarism detection paraphrase PAN 2010 and PAN 2012 show that: the TDPIM-ISS model obtains better performance, and especially obtains larger performance improvement on a PAN 2012 paraphrase data set of which the data contains larger sentence pattern transformation and vocabulary replacement.
While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense, and the scope of the present invention is defined by the appended claims.

Claims (7)

1. The paraphrase text depth matching model construction method is characterized by comprising the following steps of:
obtaining a training sample set, wherein the training sample set comprises a plurality of training sentence pairs, each training sentence pair comprises a first training sentence, a second training sentence and a semantic matching score between the first training sentence and the second training sentence, and the semantic matching score represents the semantic similarity degree between two corresponding sentences;
for each training sentence in a first training sentence and a second training sentence in each training sentence pair, obtaining a semantic feature vector and a syntactic feature vector of each word in the training sentence, and for each word in the training sentence, calculating a tensor product of the semantic feature vector and the syntactic feature vector of the word so as to determine a matrix sum of tensor products corresponding to each word in the training sentence as a syntactic and semantic interaction feature quantity of the training sentence;
aiming at each training sentence pair, taking the matrix difference between the syntax and the semantic interaction characteristic quantity of a first training sentence and a second training sentence in the training sentence pair as the syntax and the semantic interaction characteristic quantity of the training sentence pair; and
and training each training sentence pair by utilizing a convolutional neural network model, wherein the syntax and semantic interaction characteristic quantity of each training sentence pair are used as model input, the semantic matching scores of the first training sentence and the second training sentence are used as model output, and the trained convolutional neural network model is used as the paraphrase text deep matching model.
2. The method of constructing an paraphrase text deep matching model of claim 1, wherein for each of the first training sentence and the second training sentence in each training sentence pair, a syntactic feature vector for each word in the training sentence is obtained by:
for each training sentence in the first training sentence and the second training sentence in each training sentence pair, obtaining a syntactic analysis result of the training sentence, so as to obtain a syntactic feature vector of each word in the training sentence according to the syntactic analysis result of the training sentence.
3. The paraphrase text depth matching model construction method of claim 1 or 2, wherein the training sample set comprises pairs of positive and negative examples of training sentences;
the semantic matching score of the first training sentence and the second training sentence in the positive example training sentence pair is 1, and the semantic matching score of the first training sentence and the second training sentence in the negative example training sentence pair is 0.
4. The paraphrase text depth matching model construction method of claim 1 or 2, wherein the training sample set comprises pairs of positive and negative examples of training sentences;
the semantic matching scores of the first training sentence and the second training sentence in the positive example training sentence pair are larger than a first threshold, and the semantic matching scores of the first training sentence and the second training sentence in the negative example training sentence pair are smaller than or equal to a second threshold.
5. The method of constructing paraphrase text deep matching model of claim 1 or 2, wherein the convolutional neural network model comprises convolutional layers, max pooling layers and multi-layer perceptrons.
6. A paraphrase text depth matching method, the paraphrase text depth matching method comprising:
pre-constructing an paraphrase text depth matching model by using the paraphrase text depth matching model construction method of any one of claims 1 to 4;
obtaining a test sentence pair, wherein the test sentence pair comprises a first test sentence and a second test sentence;
for each test statement in a first test statement and a second test statement in the test statement pair, acquiring a semantic feature vector and a syntactic feature vector of each word in the test statement, and for each word in the test statement, calculating a tensor product of the semantic feature vector and the syntactic feature vector of the word so as to determine a matrix sum of tensor products corresponding to each word in the test statement as a syntactic and semantic interaction feature quantity of the test statement;
taking the matrix difference between the syntax and the semantic interactive characteristic quantity of the first test statement and the second test statement in the test statement pair as the syntax and the semantic interactive characteristic quantity of the test statement pair;
and inputting the syntax and semantic interactive characteristic quantity of the test sentence pair into the paraphrase text deep matching model to obtain semantic matching scores of a first test sentence and a second test sentence in the test sentence pair.
7. The paraphrase text depth matching method of claim 6, wherein for each of the first and second test sentences in the pair of test sentences, a syntactic feature vector for each word in the test sentence is obtained by:
and aiming at each test statement in the first test statement and the second test statement in the test statement pair, obtaining a syntactic analysis result of the test statement so as to obtain a syntactic feature vector of each word in the test statement according to the syntactic analysis result of the test statement.
CN201810836453.6A 2018-07-26 2018-07-26 Paraphrase text depth matching model construction method and paraphrase text depth matching method Active CN109145292B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810836453.6A CN109145292B (en) 2018-07-26 2018-07-26 Paraphrase text depth matching model construction method and paraphrase text depth matching method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810836453.6A CN109145292B (en) 2018-07-26 2018-07-26 Paraphrase text depth matching model construction method and paraphrase text depth matching method

Publications (2)

Publication Number Publication Date
CN109145292A CN109145292A (en) 2019-01-04
CN109145292B true CN109145292B (en) 2022-05-27

Family

ID=64797970

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810836453.6A Active CN109145292B (en) 2018-07-26 2018-07-26 Paraphrase text depth matching model construction method and paraphrase text depth matching method

Country Status (1)

Country Link
CN (1) CN109145292B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019685B (en) * 2019-04-10 2021-08-20 鼎富智能科技有限公司 Deep text matching method and device based on sequencing learning
CN111241845B (en) * 2019-12-31 2024-01-16 上海犀语科技有限公司 Automatic financial subject identification method and device based on semantic matching method
CN111859090A (en) * 2020-03-18 2020-10-30 齐浩亮 Method for obtaining plagiarism source document based on local matching convolutional neural network model facing source retrieval
CN112328762B (en) * 2020-11-04 2023-12-19 平安科技(深圳)有限公司 Question-answer corpus generation method and device based on text generation model
CN112395426B (en) * 2020-11-16 2023-04-21 四川大学 Semantic matching model training method and system, retrieval system, device and medium
CN112633012B (en) * 2020-12-31 2024-02-02 浙大城市学院 Login word replacement method based on entity type matching
CN114417879B (en) * 2021-12-29 2022-12-27 北京百度网讯科技有限公司 Method and device for generating cross-language text semantic model and electronic equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106997376A (en) * 2017-02-28 2017-08-01 浙江大学 The problem of one kind is based on multi-stage characteristics and answer sentence similarity calculating method
CN107330130A (en) * 2017-08-29 2017-11-07 北京易掌云峰科技有限公司 A kind of implementation method of dialogue robot to artificial customer service recommendation reply content

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10509860B2 (en) * 2016-02-10 2019-12-17 Weber State University Research Foundation Electronic message information retrieval system
CN106649786B (en) * 2016-12-28 2020-04-07 北京百度网讯科技有限公司 Answer retrieval method and device based on deep question answering

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106997376A (en) * 2017-02-28 2017-08-01 浙江大学 The problem of one kind is based on multi-stage characteristics and answer sentence similarity calculating method
CN107330130A (en) * 2017-08-29 2017-11-07 北京易掌云峰科技有限公司 A kind of implementation method of dialogue robot to artificial customer service recommendation reply content

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
A Grammar-Based Semantic Similarity Algorithm for Natural Language Sentences;Ming Che Lee 等;《Scientific World Journal》;20140410;第1-17页 *
中文短文本语法语义相似度算法;廖志芳 等;《湖南大学学报(自然科学版)》;20160229;第43卷(第2期);第135-140页 *
基于统计机器学习的抄袭检测建模研究;孔蕾蕾;《中国博士学位论文全文数据库》;20190115(第01期);第1-164页 *
深度文本匹配综述;庞亮 等;《计算机学报》;20170430;第40卷(第4期);第985-1003页 *

Also Published As

Publication number Publication date
CN109145292A (en) 2019-01-04

Similar Documents

Publication Publication Date Title
CN109145292B (en) Paraphrase text depth matching model construction method and paraphrase text depth matching method
CN112001185B (en) Emotion classification method combining Chinese syntax and graph convolution neural network
Arora et al. Character level embedding with deep convolutional neural network for text normalization of unstructured data for Twitter sentiment analysis
CN112001187B (en) Emotion classification system based on Chinese syntax and graph convolution neural network
Dos Santos et al. Deep convolutional neural networks for sentiment analysis of short texts
CN112001186A (en) Emotion classification method using graph convolution neural network and Chinese syntax
Liu et al. Social network sentiment classification method combined Chinese text syntax with graph convolutional neural network
Othman et al. Learning english and arabic question similarity with siamese neural networks in community question answering services
CN110765769A (en) Entity attribute dependency emotion analysis method based on clause characteristics
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
Ma et al. DM_NLP at SemEval-2018 Task 8: Neural sequence labeling with linguistic features
Li et al. Phrase embedding learning from internal and external information based on autoencoder
Zhu Machine reading comprehension: Algorithms and practice
Al-Thanyyan et al. Simplification of Arabic text: A hybrid approach integrating machine translation and transformer-based lexical model
Hughes Automatic inference of causal reasoning chains from student essays
Wang et al. Classification-based RNN machine translation using GRUs
CN111914084A (en) Deep learning-based emotion label text generation and evaluation system
She et al. Leveraging hierarchical deep semantics to classify implicit discourse relations via a mutual learning method
Kalita Detecting and extracting events from text documents
Alhijawi et al. Novel textual entailment technique for the Arabic language using genetic algorithm
Su et al. Automatic ontology population using deep learning for triple extraction
Hatanpää A generative pre-trained transformer model for Finnish
Weiss et al. Sense classification of shallow discourse relations with focused RNNs
Adewumi Vector representations of idioms in data-driven chatbots for robust assistance
Li et al. Improving word vector with prior knowledge in semantic dictionary

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant