WO2021164199A1

WO2021164199A1 - Multi-granularity fusion model-based intelligent semantic chinese sentence matching method, and device

Info

Publication number: WO2021164199A1
Application number: PCT/CN2020/104723
Authority: WO
Inventors: 鹿文鹏; 王荣耀; 张旭; 贾瑞祥; 郭韦钰; 张维玉
Original assignee: 齐鲁工业大学
Priority date: 2020-02-20
Filing date: 2020-07-27
Publication date: 2021-08-26
Also published as: CN111310438B; CN111310438A

Abstract

Disclosed are a multi-granularity fusion model-based intelligent semantic Chinese sentence matching method and a device, pertaining to the field of artificial intelligence and the field of natural language processing. The present invention addresses the technical problems of non-comprehensive semantic analysis and inaccurate sentence matching of single granularity-based models. The method is specifically as follows: S1, constructing a text matching knowledge database; S2, constructing a training data set of a text matching model; S3, constructing a multi-granularity fusion model, which is specifically as follows: S301, constructing a character word mapping conversion table; S302, constructing an input layer; S303, constructing a multi-granularity embedding layer; S304, constructing a multi-granularity fusion encoding layer; S305, constructing an interaction matching layer, and S306, constructing a prediction layer; and S4, training the multi-granularity fusion model. The device comprises a text matching knowledge database construction unit, a training data set construction unit for a text matching model, a multi-granularity fusion model construction unit, and a multi-granularity fusion model training unit.

Description

基于多粒度融合模型的中文句子语义智能匹配方法及装置Chinese sentence semantic intelligent matching method and device based on multi-granularity fusion model

技术领域Technical field

本发明涉及人工智能领域和自然语言处理领域，具体地说是一种基于多粒度融合模型的中文句子语义智能匹配方法及装置。The invention relates to the field of artificial intelligence and natural language processing, in particular to a method and device for intelligently matching Chinese sentence semantics based on a multi-granularity fusion model.

背景技术Background technique

句子语义匹配在许多自然语言处理任务中扮演着关键角色，例如问答(QA)、自然语言推理(NLI)、机器翻译(MT)等。句子语义匹配的关键是计算给定句子对的语义之间的匹配程度。句子可以从不同的粒度上进行分割，例如字符、词语和短语等。当前，常用的文本分割粒度是词语，特别是在中文领域中更为普遍。Sentence semantic matching plays a key role in many natural language processing tasks, such as question answering (QA), natural language inference (NLI), machine translation (MT), and so on. The key to sentence semantic matching is to calculate the degree of matching between the semantics of a given sentence pair. Sentences can be segmented at different granularities, such as characters, words, and phrases. Currently, the commonly used text segmentation granularity is words, especially in the Chinese field.

目前，中文句子语义匹配模型多数是面向词语粒度的，而忽略了其它分割粒度。这些模型无法完全捕获嵌入在句子中的语义特征，有时甚至会产生噪音，这会影响句子匹配的准确性。目前，该领域的研究人员逐渐倾向于从句子的多种不同角度或粒度考虑语义匹配，比较成功的模型方法有MultiGranCNN、MV-LSTM、MPCM、BiMPM、DIIN等。尽管这些模型在一定程度上缓解了词语粒度上建模的局限性，但仍无法彻底解决句子语义的精准匹配问题，这在具有丰富语义特征的中文上表现更为突出。At present, most Chinese sentence semantic matching models are oriented to word granularity, while ignoring other segmentation granularities. These models cannot fully capture the semantic features embedded in the sentence, and sometimes even generate noise, which will affect the accuracy of sentence matching. At present, researchers in this field gradually tend to consider semantic matching from a variety of different perspectives or granularities of sentences. The more successful model methods include MultiGranCNN, MV-LSTM, MPCM, BiMPM, DIIN, etc. Although these models alleviate the limitations of word granularity modeling to a certain extent, they still cannot completely solve the problem of precise matching of sentence semantics, which is more prominent in Chinese with rich semantic features.

专利号为CN106569999A的专利文献公开了一种多粒度短文本语义相似度比较方法，其包括如下步骤：S1、对短文本进行预处理；所述预处理包括中文分词以及词性标注；S2、对经过预处理的短文本进行特征选择；S3、对经过特征选择的向量集进行距离测量以确定短文本的相似度。但是该技术方案无法彻底解决句子语义的精准匹配问题。The patent document with the patent number CN106569999A discloses a multi-granularity short text semantic similarity comparison method, which includes the following steps: S1, short text preprocessing; the preprocessing includes Chinese word segmentation and part-of-speech tagging; S2. Feature selection is performed on the preprocessed short text; S3, distance measurement is performed on the vector set after feature selection to determine the similarity of the short text. However, this technical solution cannot completely solve the problem of precise matching of sentence semantics.

发明内容Summary of the invention

本发明的技术任务是提供一种基于多粒度融合模型的中文句子语义智能匹配方法及装置，来解决单粒度模型语义分析不全面和句子匹配不精确的问题。The technical task of the present invention is to provide a Chinese sentence semantic intelligent matching method and device based on a multi-granularity fusion model to solve the problems of incomplete semantic analysis of a single-granularity model and imprecise sentence matching.

本发明的技术任务是按以下方式实现的，基于多粒度融合模型的中文句子语义智能匹配方法，该方法具体如下：The technical task of the present invention is achieved in the following manner. The intelligent matching method for Chinese sentence semantics based on a multi-granularity fusion model is specifically as follows:

S1、构建文本匹配知识库；S1. Build a text matching knowledge base;

S2、构建文本匹配模型的训练数据集：对于每一个句子，在文本匹配知识库中都会有一个与之对应的标准的语义匹配的句子，此句子可与其组合用来构建训练正例；其他不匹配的句子可自由组合用来构建训练负例；用户可根据文本匹配知识库大小来设定负例的数量，从而构建训练数据集；S2. Constructing the training data set of the text matching model: For each sentence, there will be a corresponding standard semantic matching sentence in the text matching knowledge base. This sentence can be combined with it to construct training examples; others are not Matched sentences can be freely combined to construct training negative examples; users can set the number of negative examples according to the size of the text matching knowledge base to construct a training data set;

S3、构建多粒度融合模型；具体如下：S3. Build a multi-granularity fusion model; the details are as follows:

S301、构建字符词语映射转换表；S301. Construct a character word mapping conversion table;

S302、构建输入层；S302. Construct an input layer;

S303、构建多粒度嵌入层：对句子中的词语和字符进行向量映射，得到词语级句子向量和字符级句子向量；S303. Construct a multi-granularity embedding layer: perform vector mapping on words and characters in the sentence to obtain word-level sentence vectors and character-level sentence vectors;

S304、构建多粒度融合编码层：对词语级句子向量和字符级句子向量进行编码处理，得到句子语义特征向量；S304. Construct a multi-granularity fusion coding layer: perform coding processing on word-level sentence vectors and character-level sentence vectors to obtain sentence semantic feature vectors;

S305、构建交互匹配层：对句子语义特征向量进行分层比较，得到句子对的匹配表征向量；S305. Construct an interactive matching layer: perform hierarchical comparison of sentence semantic feature vectors to obtain matching representation vectors of sentence pairs;

S306、构建预测层：经预测层的Sigmoid函数处理，判断句子对的语义匹配程度；S306. Construct a prediction layer: After the Sigmoid function of the prediction layer is processed, the degree of semantic matching of sentence pairs is judged;

S4、训练多粒度融合模型。S4. Train a multi-granularity fusion model.

作为优选，所述步骤S1中构建文本匹配知识库具体如下：Preferably, the construction of the text matching knowledge base in step S1 is specifically as follows:

S101、使用爬虫获取原始数据：在互联网公共问答平台爬取问题集，得到原始相似句子知识库；或者使用网上公开的句子匹配数据集，作为原始相似句子知识库；S101. Use crawlers to obtain original data: Crawl the question set on the Internet public question and answer platform to obtain the original similar sentence knowledge base; or use the sentence matching data set disclosed on the Internet as the original similar sentence knowledge base;

S102、预处理原始数据：预处理原始相似句子知识库中的相似文本，对每个句子进行分词和断字处理，得到文本匹配知识库；其中，分词处理是以中文里的每个词语作为基本单位，对每条数据进行分词操作；断字处理是以中文里的每个字作为基本单位，对每条数据进行断字操作；每个汉字和词语之间用空格进行切分，并保留每条数据中包括的数字、标点以及特殊字符在内的所有内容；S102. Preprocess the original data: preprocess the similar text in the original similar sentence knowledge base, and perform word segmentation and hyphenation processing on each sentence to obtain a text matching knowledge base; wherein, the word segmentation processing is based on each word in Chinese Unit, perform word segmentation operations on each piece of data; hyphenation processing is based on each character in Chinese as the basic unit, and each piece of data is hyphenated; each Chinese character and word is divided by a space, and each word is reserved. All contents including numbers, punctuation and special characters in the data;

所述步骤S2中构建文本匹配模型的训练数据集具体如下：The training data set for constructing the text matching model in the step S2 is specifically as follows:

S201、构建训练正例：将句子与其对应的语义匹配的句子进行组合，构建训练正例，形式化为：(Q1-char,Q1-word,Q2-char,Q2-word,1)；S201. Construct a training positive example: Combine a sentence with its corresponding semantically matched sentence to construct a training positive example, which is formalized as: (Q1-char, Q1-word, Q2-char, Q2-word, 1);

其中，Q1-char表示字符级粒度的句子1；Q1-word表示词语级粒度的句子1；Q2-char表示字符级粒度的句子2；Q2-word表示词语级粒度的句子2；1表示句子1和句子2这两个文本相匹配，是正例；Among them, Q1-char represents sentence 1 at character level granularity; Q1-word represents sentence 1 at word level granularity; Q2-char represents sentence 2 at character level granularity; Q2-word represents sentence 2 at word level granularity; 1 represents sentence 1 It matches the two texts of sentence 2, which is a positive example;

S202、构建训练负例：选中一个句子Q1，再从文本匹配知识库中随机选择一个与句子Q1不匹配的句子Q2，将Q1与Q2进行组合，构建负例，形式化为：(Q1-char,Q1-word,Q2-char,Q2-word,0)；S202. Construct a training negative example: select a sentence Q1, then randomly select a sentence Q2 that does not match the sentence Q1 from the text matching knowledge base, combine Q1 and Q2 to construct a negative example, which is formalized as: (Q1-char ,Q1-word,Q2-char,Q2-word,0);

其中，Q1-char表示字符级粒度的句子1；Q1-word表示词语级粒度的句子1；Q2-char表示字符级粒度的句子2；Q2-word表示词语级粒度的句子2；0表示句子Q1和句子Q2这两个文本不匹配，是负例；Among them, Q1-char represents sentence 1 at character level granularity; Q1-word represents sentence 1 at word level granularity; Q2-char represents sentence 2 at character level granularity; Q2-word represents sentence 2 at word level granularity; 0 represents sentence Q1 It does not match the two texts of sentence Q2, which is a negative example;

S203、构建训练数据集：将经过步骤S201和步骤S202操作后所获得的全部的正例样本和负例样本进行组合，并打乱其顺序，构建最终的训练数据集；其中，无论是正例数据还是负例数据均包含五个维度，即Q1-char、Q1-word、Q2-char、Q2-word、0或1。S203. Construct a training data set: Combine all the positive samples and negative samples obtained after the operations of step S201 and step S202, and disrupt their order to construct a final training data set; among them, whether it is positive data Still negative data includes five dimensions, namely Q1-char, Q1-word, Q2-char, Q2-word, 0 or 1.

更优地，所述步骤S301中构建字符词语映射转换表具体如下：More preferably, the construction of the character word mapping conversion table in step S301 is specifically as follows:

S30101、字符词语表通过预处理后得到的文本匹配知识库来构建；S30101. The character vocabulary table is constructed by the text matching knowledge base obtained after preprocessing;

S30102、字符词语表构建完成后，表中每个字符和词语均被映射为唯一的数字标识，映射规则为：以数字1为起始，随后按照每个字符、词语被录入字符词语表的顺序依次递增排序，从而形成字符词语映射转换表；S30102. After the construction of the character vocabulary table is completed, each character and word in the table is mapped to a unique digital identifier. The mapping rule is: start with the number 1, and then follow the order in which each character and word is entered into the character vocabulary table Sort in ascending order to form a character-word mapping conversion table;

S30103、使用Word2Vec训练字符词语向量模型，得到字符词语向量矩阵权重embedding_matrix；S30103. Use Word2Vec to train the character word vector model to obtain the weight embedding_matrix of the character word vector matrix;

所述步骤S302中构建输入层具体如下：The construction of the input layer in the step S302 is specifically as follows:

S30201、输入层包括四个输入，对两个待匹配的句子进行预处理分别获取Q1-char、Q1-word、Q2-char、Q2-word，将其形式化为：(Q1-char,Q1-word,Q2-char,Q2-word)；S30201. The input layer includes four inputs. The two sentences to be matched are preprocessed to obtain Q1-char, Q1-word, Q2-char, and Q2-word respectively, which are formalized as: (Q1-char, Q1- word,Q2-char,Q2-word);

S30202、对于输入句子中的每个字符和词语均按照步骤S301中构建完成的字符词语映射转换表将其转化为相应的数字标识。S30202: For each character and word in the input sentence, it is converted into a corresponding digital identifier according to the character-word mapping conversion table constructed in step S301.

更优地，所述步骤S303中构建多粒度嵌入层具体如下：More preferably, the construction of the multi-granularity embedding layer in step S303 is specifically as follows:

S30301、通过加载步骤S301中训练所得的字符词语向量矩阵权重来初始化当前层的权重参数；S30301: Initialize the weight parameters of the current layer by loading the weights of the character word vector matrix obtained through training in step S301;

S30302、针对输入句子Q1和Q2，经过多粒度嵌入层处理后得到其词语级句子向量和字符级句子向量Q1-word Emd、Q1-char Emd、Q2-word Emd、Q2-char Emd；其中，文本匹配知识库中每一个句子均能通过字符词语向量映射的方式，将文本信息转化为向量形式；S30302. For the input sentences Q1 and Q2, the word-level sentence vector and character-level sentence vector Q1-word Emd, Q1-char Emd, Q2-word Emd, Q2-char Emd are obtained after multi-granular embedding layer processing; among them, text Each sentence in the matching knowledge base can transform text information into vector form through character word vector mapping;

所述步骤S304中构建多粒度融合编码层是将步骤S303中多粒度嵌入层输出的词语级句子向量和字符级句子向量作为输入，从两个角度获取文本语义特征，即字符级别语义特征提取和词语级别语义特征提取；再通过按位相加的形式，对两个角度的文本语义特征进行整合，得到最终的句子语义特征向量；对于句子Q1求取最终的句子语义特征向量具体如下：The construction of the multi-granularity fusion coding layer in step S304 takes the word-level sentence vector and character-level sentence vector output by the multi-granularity embedding layer in step S303 as input, and obtains text semantic features from two perspectives, namely, character-level semantic feature extraction and Word-level semantic feature extraction; then through the form of bitwise addition, the text semantic features from the two perspectives are integrated to obtain the final sentence semantic feature vector; for sentence Q1, the final sentence semantic feature vector is obtained as follows:

S30401、针对字符级别语义特征提取，具体如下：S30401. For character-level semantic feature extraction, the details are as follows:

S3040101、使用LSTM进行特征提取，得到特征向量

公式如下： S3040101, use LSTM for feature extraction to obtain feature vectors

The formula is as follows:

S3040102、针对

进一步采用两种不同的方法进行编码，具体如下： S3040102, for

Further use two different methods for encoding, as follows:

①、对

继续使用LSTM进行二次特征提取，得到相应特征向量

公式如下： ①, yes

Continue to use LSTM for secondary feature extraction to obtain the corresponding feature vector

The formula is as follows:

②、对

使用注意力机制Attention提取特征，得到相应特征向量

公式如下： ②, yes

Use the attention mechanism Attention to extract features and get the corresponding feature vector

The formula is as follows:

S3040103、针对

使用Attention再次进行编码提取关键特征，得到特征向量

公式如下： S3040103, for

Use Attention to code again to extract key features to obtain feature vectors

The formula is as follows:

S3040104、将

与

按位相加得到字符级别的语义特征

公式如下： S3040104, will

and

Bitwise addition to obtain character-level semantic features

The formula is as follows:

其中，i表示相应字符向量在句子中的相对位置，Q _i为句子Q1中每个字符的相应向量表示；Q′ _i为经过初次LSTM编码后每个字符的相应向量表示；Q″ _i为经过第二次LSTM编码后每个字符的相应向量表示； Among them, i represents the relative position of the corresponding character vector in the sentence, and Q _i is the corresponding vector representation of each character in the sentence Q1; Q′ _i is the corresponding vector representation of each character after the initial LSTM encoding; Q″ _i is the corresponding vector representation of each character after the initial LSTM encoding. The corresponding vector representation of each character after the second LSTM encoding;

S30402、针对词语级别语义特征提取，具体如下：S30402. For word-level semantic feature extraction, the details are as follows:

S3040201、使用LSTM进行特征提取，得到特征向量

公式如下： S3040201 Use LSTM for feature extraction to obtain feature vectors

The formula is as follows:

S3040202、针对

进一步采用LSTM进行二次特征提取，得到相应特征向量

公式如下： S3040202, for

Further use LSTM for secondary feature extraction to obtain the corresponding feature vector

The formula is as follows:

S3040203、针对

使用Attention再次进行编码提取关键特征，得到词语级别特征向量

公式如下： S3040203, for

Use Attention to code again to extract key features to obtain word-level feature vectors

The formula is as follows:

其中，i'表示相应词语向量在句子中的相对位置；Q _i′为句子Q1中每个词语的相应向量表示；Q′ _i′为经过初次LSTM编码后每个词语的相应向量表示；Q″ _i′为经过第二次LSTM编码后每个词语的相应向量表示； Wherein, i 'represents the relative position of the corresponding word in the sentence vector; Q _i' expressed as Q1 respective sentence vectors of each word; Q _'i' expressed as a respective vector for each of words after the initial coding LSTM; Q " _i′ is the corresponding vector representation of each word after the second LSTM encoding;

S30403、经过步骤S30401和步骤S30402得到相应字符级别的特征向量

以及词语级别的特征向量

将

和

按位相加，得到针对文本Q1的最终句子语义特征向量

公式如下： S30403: After step S30401 and step S30402, the feature vector of the corresponding character level is obtained

And word-level feature vectors

will

with

Add bitwise to get the final sentence semantic feature vector for text Q1

The formula is as follows:

对于句子Q2求取最终的句子语义特征向量

的方法，同步骤S30401到步骤S30403。 For sentence Q2, obtain the final sentence semantic feature vector

The method is the same as step S30401 to step S30403.

更优地，所述步骤S305构建交互匹配层具体如下：More preferably, the construction of the interactive matching layer in step S305 is specifically as follows:

S30501、经过步骤S304处理得到Q1、Q2的句子语义特征向量

和

针对

和

进行减法、叉乘以及点乘三种操作，得到

公式如下： S30501: After processing in step S304, the sentence semantic feature vectors of Q1 and Q2 are obtained

with

against

with

Perform three operations of subtraction, cross product, and dot product to get

The formula is as follows:

其中，点乘：也叫数量积，结果是一个向量在另一个向量方向上投影的长度，是一个标量；叉乘：也叫向量积，结果是一个和已有两个向量都垂直的向量；Among them, dot product: also called the quantified product, the result is the length of a vector projected in the direction of another vector, which is a scalar; cross product: also called the vector product, the result is a vector that is perpendicular to the two existing vectors;

同时，使用一个全连接层Dense进一步编码得到

和

公式如下： At the same time, use a fully connected layer Dense to further encode

with

The formula is as follows:

其中，i表示相应语义特征在句子中的相对位置；Q1 _i为文本Q1经过步骤S304特征提取得到的

中每个语义特征的相应向量表示；Q2 _i为文本Q2经过步骤S304特征提取得到的

中每个语义特征的相应向量表示；

为针对句子语义特征向量

和

使用Dense进一步提取，得到的特征向量；

表示编码维度为300； Among them, i represents the relative position of the corresponding semantic feature in the sentence; Q1 _i is the text Q1 obtained by feature extraction in step S304

Each respective feature vector is the semantic representation; _I Q2 Q2 through the text feature extraction step S304 obtained

The corresponding vector representation of each semantic feature in;

For sentence semantic feature vector

with

Use Dense to further extract the feature vector obtained;

Indicates that the coding dimension is 300;

S30502、将

和

联接起来得到

公式如下： S30502, will

with

Connect to get

The formula is as follows:

同时，

和

同样进行减法、叉乘操作，公式如下： at the same time,

with

The same subtraction and cross multiplication operations are performed, and the formula is as follows:

再将二者结果联接得到

公式如下： Then connect the two results to get

The formula is as follows:

S30503、将

使用两层全连接层进行特征提取得到

并将

与

进行求和，得到

公式如下： S30503, will

Use two fully connected layers for feature extraction to get

And will

and

To sum and get

The formula is as follows:

S30504、将

经过一层全连接层编码后的结果与步骤S30501中

求和，得到句子对的匹配表征向量

公式如下： S30504, will

The result after a layer of fully connected layer encoding is the same as in step S30501

Sum, get the matching representation vector of the sentence pair

The formula is as follows:

所述步骤S306中构建预测层具体如下：The construction of the prediction layer in step S306 is specifically as follows:

S30601、预测层接收步骤S305输出的匹配表征向量，使用Sigmoid函数进行计算，得到处于[0,1]之间的匹配度表示y _pred； S30601: The prediction layer receives the matching representation vector output in step S305, and uses the Sigmoid function for calculation to obtain a matching degree representation y _pred between [0, 1];

S30602、将y _pred与设立的阈值进行比较来判别句子对的匹配程度，具体如下： S30602. Compare y _pred with the established threshold to determine the matching degree of the sentence pair, which is specifically as follows:

①、当y _pred≥0.5时，表示句子Q1以及句子Q2相匹配； ①. When y _pred ≥0.5, it means that sentence Q1 and sentence Q2 match;

②、当y _pred<0.5时，表示句子Q1以及句子Q2不匹配。 ②. When y _pred <0.5, it means that sentence Q1 and sentence Q2 do not match.

作为优选，所述步骤S4中训练多粒度融合模型具体如下：Preferably, the training of the multi-granularity fusion model in step S4 is specifically as follows:

S401、构建损失函数：通过将均方误差(MSE)设置为交叉熵的平衡因子，设计出平衡交叉熵，其中，均方误差的公式如下：S401. Construct a loss function: Design a balanced cross-entropy by setting the mean square error (MSE) as the cross-entropy balance factor, where the formula of the mean square error is as follows:

其中，y _true表示真实标签，即每条训练样例中表示匹配与否的0、1标志；y _pred表示预测结果； Among them, y _true represents the true label, that is, the 0 and 1 flags that indicate match or not in each training example; y _pred represents the prediction result;

当分类边界模糊时，平衡交叉熵的使用能够自动平衡正负样本，并提高分类的准确性；其将交叉熵与均方误差融合，公式如下：When the classification boundary is blurred, the use of balanced cross entropy can automatically balance the positive and negative samples and improve the accuracy of classification; it fuses the cross entropy with the mean square error, and the formula is as follows:

S402、优化训练模型：选择使用RMSprop优化函数作为本模型的优化函数，超参数均选择Keras中的默认值设置。S402. Optimize the training model: choose to use the RMSprop optimization function as the optimization function of this model, and the hyperparameters are all set to default values in Keras.

一种基于多粒度融合模型的中文句子语义智能匹配装置，该装置包括，A Chinese sentence semantic intelligent matching device based on a multi-granularity fusion model, the device comprising:

文本匹配知识库构建单元，用于使用爬虫程序，在互联网公共问答平台爬取问题集，或者使用网上公开的文本匹配数据集，作为原始相似句子知识库，再对原始相似句子知识库进行预处理，主要操作为对原始相似句子知识库中的每个句子进行断字处理和分词处理，从而构建用于模型训练的文本匹配知识库；The text matching knowledge base building unit is used to use crawlers to crawl question sets on the Internet public question and answer platform, or use text matching data sets published on the Internet as the original similar sentence knowledge base, and then preprocess the original similar sentence knowledge base , The main operation is to perform hyphenation and word segmentation on each sentence in the original similar sentence knowledge base, so as to construct a text matching knowledge base for model training;

训练数据集生成单元，用于根据文本匹配知识库中的句子来构建训练正例数据和训练负例数据，并且基于正例数据与负例数据来构建最终的训练数据集；The training data set generation unit is used to construct training positive example data and training negative example data according to the sentences in the text matching knowledge base, and construct the final training data set based on the positive example data and the negative example data;

多粒度融合模型构建单元，用于构建字符词语映射转换表，并同时构建输入层、多粒度嵌入层、多粒度融合编码层、交互匹配层、预测层；其中，多粒度融合模型构建单元包括，The multi-granularity fusion model building unit is used to construct the character word mapping conversion table, and to construct the input layer, the multi-granularity embedding layer, the multi-granularity fusion coding layer, the interactive matching layer, and the prediction layer at the same time; among them, the multi-granularity fusion model building unit includes:

字符词语映射转换表构建子单元，用于对文本匹配知识库中的每个句子按字符和词语进行切分，并将每个字符和词语依次存入一个列表中，从而得到一个字符词语表，随后以数字1为起始，按照每个字符和词语被录入字符词语表的顺序依次递增排序，从而形成本发明所需的字符词语映射转换表；字符词语映射转换表构建完成后，表中每个字符和词语均被映射为唯一的数字标识；其后，使用Word2Vec训练字符词语向量模型，得到字符词语向量矩阵权重；The character word mapping conversion table constructs a subunit, which is used to segment each sentence in the text matching knowledge base according to characters and words, and store each character and word in a list in turn to obtain a character word table. Subsequently, starting with the number 1, the characters and words are entered in the character word table in ascending order, thereby forming the character word mapping conversion table required by the present invention; after the character word mapping conversion table is constructed, each character in the table Characters and words are mapped to unique digital identifiers; after that, use Word2Vec to train the character word vector model to obtain the weight of the character word vector matrix;

输入层构建子单元，用于根据字符词语映射转换表，将输入句子中的每个字符和词语转化为相应的数字标识，从而完成数据的输入，具体来说就是分别获取q1与q2，将其形式化为：(q1-char,q1-word,q2-char,q2-word)；The input layer constructs sub-units, which are used to convert each character and word in the input sentence into a corresponding digital identifier according to the character-word mapping conversion table, thereby completing the data input, specifically to obtain q1 and q2 respectively, and convert them Formalized as: (q1-char,q1-word,q2-char,q2-word);

多粒度嵌入层构建子单元，用于加载预训练好的字符词语向量权重，将输入句子中的字符词语转换为字符词语向量形式，进而构成完整的句子向量表示；该操作根据字符词语的数字标识查找字符词语向量矩阵而完成；The multi-granular embedding layer constructs sub-units, which are used to load pre-trained character word vector weights, convert the character words in the input sentence into the character word vector form, and then form a complete sentence vector representation; this operation is based on the digital identification of the character words Completed by looking up the character word vector matrix;

多粒度融合编码层构建子单元，用于将多粒度嵌入层输出的词语级句子向量和字符级句子向量作为输入；先从两个角度来获取文本语义特征，即字符级别语义特征提取和词语级别语义特征提取；再通过按位相加的形式，对两个角度的文本语义特征进行整合，得到最终的句子语义特征向量；The multi-granularity fusion coding layer constructs sub-units, which are used to take the word-level sentence vector and character-level sentence vector output by the multi-granularity embedding layer as input; first obtain text semantic features from two perspectives, namely, character-level semantic feature extraction and word-level Semantic feature extraction; then through the form of bitwise addition, the semantic features of the text from the two perspectives are integrated to obtain the final sentence semantic feature vector;

交互匹配层构建子单元，用于将输入的两个句子语义特征向量，经过分层匹配计算，得到句子对的匹配表征向量；The interactive matching layer constructs sub-units, which are used to perform hierarchical matching calculations on the semantic feature vectors of the two input sentences to obtain the matching representation vectors of the sentence pairs;

预测层构建子单元，用于接收交互匹配层输出的匹配表征向量，使用Sigmoid函数进行计算，得到处于[0,1]之间的匹配度，最终通过与设立的阈值进行比较来判别句子对的匹配程度；The prediction layer constructs a sub-unit for receiving the matching representation vector output by the interactive matching layer, using the Sigmoid function to calculate, and obtaining the matching degree between [0,1], and finally judging the sentence pair by comparing with the established threshold Matching degree;

多粒度融合模型训练单元，用于构建模型训练过程中所需要的损失函数，并完成模型的优化训练。The multi-granularity fusion model training unit is used to construct the loss function needed in the model training process and complete the optimization training of the model.

作为优选，所述文本匹配知识库构建单元包括，Preferably, the text matching knowledge base building unit includes:

爬取原始数据子单元，用于在互联网公共问答平台爬取问题集，或者使用网上公开的文本匹配数据集，构建原始相似句子知识库；Crawling the original data subunit, used to crawl the question set on the Internet public question and answer platform, or use the text matching data set published on the Internet to build the original similar sentence knowledge base;

原始数据处理子单元，用于将原始相似句子知识库中的句子进行断字处理和分词处理，从而构建用于模型训练的文本匹配知识库；The original data processing subunit is used to hyphenate and segment the sentences in the original similar sentence knowledge base, thereby constructing the text matching knowledge base for model training;

所述训练数据集生成单元包括，The training data set generating unit includes:

训练正例数据构建子单元，用于将文本匹配知识库中语义匹配的句子进行组合，并对其添加匹配标签1，构建为训练正例数据；The training positive example data construction subunit is used to combine semantically matched sentences in the text matching knowledge base, and add matching label 1 to it to construct the training positive example data;

训练负例数据构建子单元，用于先从文本匹配知识库中选取一个句子q ₁，再从文本匹配知识库中随机选择一个与句子q ₁语义不匹配的句子q ₂，将q ₁与q ₂进行组合，并对其添加匹配标签0，构建为训练负例数据； The training negative example data constructs a subunit, which is used to first select a sentence q ₁ from the text matching knowledge base, and then randomly select a _{sentence q 2} that does not match the sentence q ₁ semantically from the text matching knowledge base, and compare q ₁ with q ₂ Combine and add a matching label 0 to it, and construct it as training negative example data;

训练数据集构建子单元，用于将所有的训练正例数据与训练负例数据组合在一起，并打乱其顺序，从而构建最终的训练数据集；The training data set construction subunit is used to combine all the training positive example data and the training negative example data, and disrupt the order to construct the final training data set;

所述多粒度融合模型训练单元包括，The multi-granularity fusion model training unit includes:

损失函数构建子单元，用于构建损失函数，计算句子1和句子2间文本匹配度的误差；The loss function construction subunit is used to construct the loss function and calculate the error of the text matching degree between sentence 1 and sentence 2;

模型优化训练子单元，用于训练并调整模型训练中的参数，从而减小模型训练过程中预测的句子1与句子2间匹配度与真实匹配度之间的误差。The model optimization training subunit is used to train and adjust the parameters in the model training, thereby reducing the error between the predicted matching degree between sentence 1 and sentence 2 and the true matching degree in the model training process.

一种存储介质，其中存储有多条指令，所述指令由处理器加载，执行上述的基于多粒度融合模型的中文句子语义智能匹配方法的步骤。A storage medium stores a plurality of instructions, and the instructions are loaded by a processor to execute the steps of the above-mentioned intelligent matching method for Chinese sentence semantics based on a multi-granularity fusion model.

一种电子设备，所述电子设备包括：An electronic device, which includes:

上述的存储介质；以及The aforementioned storage medium; and

处理器，用于执行所述存储介质中的指令。The processor is configured to execute instructions in the storage medium.

本发明的基于多粒度融合模型的中文句子语义智能匹配方法及装置具有以下优点：The Chinese sentence semantic intelligent matching method and device based on the multi-granularity fusion model of the present invention have the following advantages:

(一)本发明将词语向量和字符向量整合在一起，从字符和词语两个粒度上，有效地提取中文句子的语义信息，进而提升中文句子编码的准确性；(1) The present invention integrates word vectors and character vectors, and effectively extracts the semantic information of Chinese sentences from the two granularities of characters and words, thereby improving the accuracy of Chinese sentence coding;

(二)对于中文句子从字符和词语两个粒度建模，句子的语义特征分别从字符和词语的粒度获得，句子中关键的语义信息可以从两个粒度上分别提取并强化，可极大地改善句子关键语义信息的表征；(2) For the modeling of Chinese sentences from the two granularities of characters and words, the semantic features of sentences are obtained from the granularities of characters and words respectively. The key semantic information in the sentence can be extracted and strengthened from the two granularities, which can be greatly improved. Representation of key semantic information of sentences;

(三)在工程实践任务中，本发明能够精确地实现中文语句匹配的任务；(3) In engineering practice tasks, the present invention can accurately realize the task of matching Chinese sentences;

(四)本发明使用均方误差(MSE)作为平衡因子来改善交叉熵损失函数，从而设计出平衡交叉熵损失函数；该损失函数可解决过度拟合问题，并且在训练过程中将分类边界进行模糊化处理；同时，它能够缓解正负样本之间的类别不平衡问题；(4) The present invention uses mean square error (MSE) as a balance factor to improve the cross-entropy loss function, thereby designing a balanced cross-entropy loss function; the loss function can solve the problem of overfitting, and the classification boundary is processed during the training process. Fuzzy processing; at the same time, it can alleviate the problem of category imbalance between positive and negative samples;

(五)对于输入句子，多粒度融合模型使用不同的编码方法来生成字符级句子向量和词语级句子向量；针对词语级句子向量，用两个LSTM网络进行顺序编码，然后使用注意力机制进行深度特征提取；对于字符级句子向量，除了使用与词语级句子向量相同的处理方法以外，补充了一层LSTM网络和注意力机制进行编码；词语级句子向量和字符级句子向量的编码最终均被叠加在一起，作为句子的多粒度融合编码表示，可以使句子的编码表示更加精确和全面；(5) For the input sentence, the multi-granularity fusion model uses different encoding methods to generate character-level sentence vectors and word-level sentence vectors; for word-level sentence vectors, two LSTM networks are used for sequential encoding, and then the attention mechanism is used for depth Feature extraction; for character-level sentence vectors, in addition to using the same processing method as word-level sentence vectors, a layer of LSTM network and attention mechanism are added for encoding; the encoding of word-level sentence vectors and character-level sentence vectors are finally superimposed Together, as a multi-granularity fusion coding representation of a sentence, it can make the coding representation of a sentence more accurate and comprehensive;

(六)本发明使用均方误差(MSE)作为平衡因子来改善交叉熵损失函数，在公开数据集(LCQMC)上所做的大量实验，可以证明本发明优于现有方法；(6) The present invention uses mean square error (MSE) as a balance factor to improve the cross-entropy loss function. A large number of experiments done on the public data set (LCQMC) can prove that the present invention is superior to existing methods;

(七)本发明实现了多粒度融合模型，该模型同时考虑中文词语级粒度和字符级粒度，通过集成多粒度编码以更好地捕获语义特征。(7) The present invention realizes a multi-granularity fusion model, which considers both Chinese word-level granularity and character-level granularity, and integrates multi-granularity coding to better capture semantic features.

附图说明Description of the drawings

下面结合附图对本发明进一步说明。The present invention will be further described below in conjunction with the accompanying drawings.

附图1为基于多粒度融合模型的中文句子语义智能匹配方法的流程框图；Figure 1 is a flow chart of a Chinese sentence semantic intelligent matching method based on a multi-granularity fusion model;

附图2为构建文本匹配知识库的流程框图；Figure 2 is a block diagram of the process of constructing a text matching knowledge base;

附图3为构建文本匹配模型的训练数据集的流程框图；Figure 3 is a block diagram of the process of constructing the training data set of the text matching model;

附图4为构建多粒度融合模型的流程框图；Figure 4 is a block diagram of the process of constructing a multi-granularity fusion model;

附图5为训练多粒度融合模型的流程框图；Figure 5 is a block diagram of the process of training a multi-granularity fusion model;

附图6为多粒度融合模型的示意图；Figure 6 is a schematic diagram of a multi-granularity fusion model;

附图7为多粒度嵌入层的示意图；Figure 7 is a schematic diagram of a multi-granularity embedding layer;

附图8为多粒度融合编码层的示意图；Fig. 8 is a schematic diagram of a multi-granularity fusion coding layer;

附图9为交互匹配层的示意图；Fig. 9 is a schematic diagram of an interactive matching layer;

附图10为基于多粒度融合模型的中文句子语义智能匹配的装置的结构框图。Fig. 10 is a block diagram of a device for intelligent semantic matching of Chinese sentences based on a multi-granularity fusion model.

具体实施方式Detailed ways

参照说明书附图和具体实施例对本发明的基于多粒度融合模型的中文句子语义智能匹配方法及装置作以下详细地说明。The method and device for intelligent matching of Chinese sentence semantics based on the multi-granularity fusion model of the present invention will be described in detail below with reference to the drawings and specific embodiments of the specification.

实施例1：Example 1:

如附图1所示，本发明的基于多粒度融合模型的中文句子语义智能匹配方法,该方法具体如下：As shown in FIG. 1, the intelligent matching method for Chinese sentence semantics based on the multi-granularity fusion model of the present invention is specifically as follows:

S1、构建文本匹配知识库；如附图2所示，具体如下：S1. Construct a text matching knowledge base; as shown in Figure 2, the details are as follows:

互联网上的公共问答平台中有着大量的问答数据及相似问题的推荐，这些都是面向大众开放的。因此，可以根据问答平台的特点，设计相应的爬虫程序，以此来获取语义相似的文本句子集合，从而构建原始相似句子知识库。Public Q&A platforms on the Internet have a large amount of Q&A data and recommendations for similar questions, which are open to the public. Therefore, according to the characteristics of the question and answer platform, a corresponding crawler can be designed to obtain a collection of semantically similar text sentences, thereby constructing a knowledge base of original similar sentences.

举例：银行问答平台中的相似文本示例，如下表所示：Example: An example of similar text in the bank's Q&A platform, as shown in the following table:

句子1 Sentence 1	还款期限可以延后一天吗？Can the repayment period be extended by one day?
句子2 Sentence 2	是否可以申请延期一天还款？Can I apply for a one-day extension of repayment?

或者，使用网上公开的文本匹配数据集，作为原始知识库。比如LCQMC数据集【Liu,X.,Chen,Q.,Deng,C.,Zeng,H.,Chen,J.,Li,D.,Tang,B.:LCQMC:A large-scale Chinese question matching corpus.In:Proceedings of the 27th International Conference on Computational Linguistics.pp.1952-1962(2018)】，该数据集一共有260068对标注结果，分为三部分：238766训练集、8802验证集和12500测试集，是一种专门用于文本匹配任务的中文数据集。Or, use the text matching data set publicly available on the Internet as the original knowledge base. For example, the LCQMC data set [Liu,X.,Chen,Q.,Deng,C.,Zeng,H.,Chen,J.,Li,D.,Tang,B.:LCQMC:A large-scale Chinese question matching corpus .In:Proceedings of the 27th International Conference on Computational Linguistics.pp.1952-1962(2018)], this data set has a total of 260,068 pairs of annotation results, divided into three parts: 238766 training set, 8802 verification set and 12500 test set. It is a Chinese data set specially used for text matching tasks.

S102、预处理原始数据：预处理原始相似句子知识库中的相似文本，对每个句子进行分词和断字处理，得到文本匹配知识库；S102. Preprocess the original data: preprocess the similar texts in the original similar sentence knowledge base, and perform word segmentation and hyphenation processing on each sentence to obtain a text matching knowledge base;

对步骤S101中获得的相似文本进行预处理，得到文本匹配知识库。在步骤S102中，为了避免语义信息的丢失，本发明保留了句子中的所有停用词。The similar text obtained in step S101 is preprocessed to obtain a text matching knowledge base. In step S102, in order to avoid the loss of semantic information, the present invention reserves all stop words in the sentence.

其中，分词处理是以中文里的每个词语作为基本单位，对每条数据进行分词操作；举例，以步骤S101中展示的句子2“是否可以申请延期一天还款？”为例，对其进行分词处理后得到“是否可以申请延期一天还款？”。本发明将分词处理后的句子，记为词语级粒度的句子。Among them, the word segmentation processing takes each word in Chinese as the basic unit, and performs word segmentation operations on each piece of data; for example, take the sentence 2 "Can you apply for a one-day extension of repayment?" shown in step S101 as an example. After the word segmentation is processed, you get "Can you apply for a one-day extension of repayment?". The present invention records the sentence after word segmentation processing as a sentence with word level granularity.

断字处理是以中文里的每个字作为基本单位，对每条数据进行断字操作；每个汉字之间用空格进行切分，并保留每条数据中包括的数字、标点以及特殊字符在内的所有内容；举例：以步骤S101中展示的句子2“是否可以申请延期一天还款？”为例，对其进行断字处理后得到“是否可以申请延期一天还款？”。本发明将断字处理后的句子，记为字符级粒度的句子。Hyphenation processing takes each Chinese character as the basic unit, and performs hyphenation operations on each piece of data; each Chinese character is divided by spaces, and the numbers, punctuation and special characters included in each piece of data are kept in For example, take the sentence 2 "Can you apply for a one-day repayment extension?" shown in step S101 as an example, after hyphenating it, you get "Can you apply for a one-day extension of repayment?". The present invention records the hyphenated sentence as a sentence with character level granularity.

S2、构建文本匹配模型的训练数据集：对于每一个句子，在文本匹配知识库中都会有一个与之对应的标准的语义匹配的句子，此句子可与其组合用来构建训练正例；其他不匹配的句子可自由组合用来构建训练负例；用户可根据文本匹配知识库大小来设定负例的数量，从而构建训练数据集；如附图3所示，具体如下：S2. Constructing the training data set of the text matching model: For each sentence, there will be a corresponding standard semantic matching sentence in the text matching knowledge base. This sentence can be combined with it to construct training examples; others are not Matched sentences can be freely combined to construct training negative examples; users can set the number of negative examples according to the size of the text matching knowledge base to construct a training data set; as shown in Figure 3, the details are as follows:

其中，Q1-char表示字符级粒度的句子1；Q1-word表示词语级粒度的句子 1；Q2-char表示字符级粒度的句子2；Q2-word表示词语级粒度的句子2；1表示句子1和句子2这两个文本相匹配，是正例；Among them, Q1-char represents sentence 1 at character level granularity; Q1-word represents sentence 1 at word level granularity; Q2-char represents sentence 2 at character level granularity; Q2-word represents sentence 2 at word level granularity; 1 represents sentence 1 It matches the two texts of sentence 2, which is a positive example;

举例：对步骤S101中展示的句子1和句子2，经过步骤S102的预处理后，构建的正例为：Example: For Sentence 1 and Sentence 2 shown in step S101, after preprocessing in step S102, the constructed positive example is:

(“还款期限可以延后一天吗？”,“还款期限可以延后一天吗？”,“是否可以申请延期一天还款？”,“是否可以申请延期一天还款？”,1)。("Can the repayment period be extended by one day?", "Can the repayment period be extended by one day?", "Can I apply for an extension of one day's repayment?", "Can I apply for an extension of one day's repayment?", 1).

举例：根据步骤S201中的所展示的示例数据，本发明仍然使用原问句作为Q1，再从文本匹配知识库中随机选择一个与句子Q1语义不匹配的句子Q2，将Q1与Q2进行组合，经过步骤S102的预处理后，构建的负例为：Example: According to the example data shown in step S201, the present invention still uses the original question as Q1, and then randomly selects a sentence Q2 that does not match the sentence Q1 semantically from the text matching knowledge base, and combines Q1 and Q2, After the preprocessing in step S102, the constructed negative example is:

(“还款期限可以延后一天吗？”,“还款期限可以延后一天吗？”,“为什么银行客户端登陆出现网络错误？”,“为什么银行客户端登陆出现网络错误？”,0)。("Can the repayment period be extended by one day?","Can the repayment period be extended by one day?","Why is there a network error in the bank client login?", "Why is there a network error in the bank client login?",0 ).

S3、构建多粒度融合模型：如附图6所示，本发明的核心为多粒度融合模型，主要可分为四个部分：多粒度嵌入层、多粒度融合编码层、交互匹配层、预测层；首先构建多粒度嵌入层，对句子中的词语和字符进行向量映射，得到词语级句子向量和字符级句子向量；再构建多粒度融合编码层，对词语级句子向量和字符级句子向量进行编码处理，得到句子语义特征向量；再构建交互匹配层，对句子语义特征向量进行分层比较，得到句子对的匹配表征向量；最后，经预测层的Sigmoid函数处理，判断句子对的语义匹配程度。如附图4所示，具体如下：S3. Construct a multi-granularity fusion model: as shown in Figure 6, the core of the present invention is a multi-granularity fusion model, which can be divided into four parts: multi-granularity embedding layer, multi-granularity fusion coding layer, interactive matching layer, prediction layer ; First, build a multi-granularity embedding layer to perform vector mapping on words and characters in the sentence to obtain word-level sentence vectors and character-level sentence vectors; then build a multi-granularity fusion coding layer to encode word-level sentence vectors and character-level sentence vectors After processing, the semantic feature vector of the sentence is obtained; then the interactive matching layer is constructed, and the semantic feature vector of the sentence is hierarchically compared to obtain the matching representation vector of the sentence pair; finally, the Sigmoid function of the prediction layer is processed to determine the semantic matching degree of the sentence pair. As shown in Figure 4, the details are as follows:

S301、构建字符词语映射转换表；具体如下：S301. Construct a character word mapping conversion table; the details are as follows:

S30102、字符词语表构建完成后，表中每个字符、词语均被映射为唯一的数字标识，映射规则为：以数字1为起始，随后按照每个字符、词语被录入字符词语表的顺序依次递增排序，从而形成字符词语映射转换表；S30102. After the construction of the character vocabulary table is completed, each character and word in the table is mapped to a unique digital identifier. The mapping rule is: start with the number 1, and then follow the order in which each character or word is entered into the character vocabulary table Sort in ascending order to form a character-word mapping conversion table;

举例：以步骤S102处理后的内容，“还款期限可以延后一天吗？”、“还款期限可以延后一天吗？”为例，对其构建字符词语表及字符词语映射转换表如下：Example: Taking the content processed in step S102, "Can the repayment period be extended by one day?" and "Can the repayment period be extended by one day?" as an example, construct a character word table and a character word mapping conversion table for it as follows:

S30103、其后，使用Word2Vec训练字符词语向量模型，得到字符词语向量矩阵权重embedding_matrix；S30103. Then, use Word2Vec to train the character word vector model to obtain the character word vector matrix weight embedding_matrix;

举例说明：在Keras中，对于上面描述的代码实现如下所示：For example: In Keras, the implementation of the code described above is as follows:

w2v_model＝genism.models.Word2Vec(w2v_corpus,size＝embedding_dim,w2v_model=genism.models.Word2Vec(w2v_corpus,size=embedding_dim,

window＝5,min_count＝1,sg＝1,window=5, min_count=1, sg=1,

workers＝4,seed＝1234,iter＝25)workers = 4, seed = 1234, iter = 25)

embedding_matrix＝numpy.zeros([len(tokenizer.word_index)+1,embedding_dim])embedding_matrix=numpy.zeros([len(tokenizer.word_index)+1,embedding_dim])

tokenizer＝keras.preprocessing.text.Tokenizer(num_words＝len(word_set))tokenizer=keras.preprocessing.text.Tokenizer(num_words=len(word_set))

for word,idx in tokenizer.word_index.items():for word,idx in tokenizer.word_index.items():

embedding_matrix[idx,:]＝w2v_model.wv[word]embedding_matrix[idx,:]=w2v_model.wv[word]

其中，w2v_corpus为训练语料，即文本匹配知识库中的所有数据；embedding_dim为字符词语向量维度，在本发明中embedding_dim设置为300，word_set为字词表。Among them, w2v_corpus is the training corpus, that is, all data in the text matching knowledge base; embedding_dim is the dimension of the character word vector, embedding_dim is set to 300 in the present invention, and word_set is the word list.

S302、构建输入层；具体如下：S302. Construct an input layer; the details are as follows:

举例说明：本发明使用步骤S201中展示的正例文本作为样例，以此组成一条输入数据。其结果如下所示：Illustrative example: The present invention uses the positive example text displayed in step S201 as an example to form a piece of input data. The result is as follows:

(“还款期限可以延后一天吗？”，“还款期限可以延后一天吗？”，“是否可以申请延期一天还款？”，“是否可以申请延期一天还款？”)("Can the repayment period be extended by one day?", "Can the repayment period be extended by one day?", "Can I apply for an extension of one day's repayment?", "Can I apply for an extension of one day's repayment?")

根据字符词语表中的映射将上述的输入数据转换为数值表示(假定出现在句子2中但没有出现在句子1中的字符和词语的映射分别为“是”：18，“否”：19，“申”：20，“请”：21，“是否”：22，“申请”：23，“延期”：24)，结果如下：According to the mapping in the character vocabulary table, the above input data is converted into a numerical representation (assuming that the mappings of characters and words that appear in sentence 2 but not in sentence 1 are "Yes": 18, "No": 19, "Apply": 20, "Please": 21, "Whether": 22, "Apply": 23, "Extension": 24), the results are as follows:

(“1，2，3，4，5，6，7，8，9，10，11，12”,“13，14，15，16，17，11，12”,“18，19，5，6，20，21，7，3，9，10，1，2，12”,“22，15，23，24，17，13，12”)；("1,2,3,4,5,6,7,8,9,10,11,12","13,14,15,16,17,11,12","18,19,5, 6,20,21,7,3,9,10,1,2,12","22,15,23,24,17,13,12");

S303、构建多粒度嵌入层：对句子中的词语和字符进行向量映射，得到词语级句子向量和字符级句子向量；如附图7所示，具体如下：S303. Construct a multi-granular embedding layer: perform vector mapping on words and characters in the sentence to obtain word-level sentence vectors and character-level sentence vectors; as shown in FIG. 7, the details are as follows:

S30302、针对输入句子Q1和Q2，经过多粒度嵌入层处理后得到其词语级句子向量和字符级句子向量Q1-word Emd、Q1-char Emd、Q2-word Emd、Q2-char Emd；其中，文本匹配知识库中每一个句子均能通过字符词语向量映射的方式，将文本信息转化为向量形式；本发明中设置embedding_dim为300。S30302. For the input sentences Q1 and Q2, the word-level sentence vector and character-level sentence vector Q1-word Emd, Q1-char Emd, Q2-word Emd, Q2-char Emd are obtained after multi-granular embedding layer processing; among them, text Each sentence in the matching knowledge base can transform text information into a vector form by means of character word vector mapping; embedding_dim is set to 300 in the present invention.

其中，embedding_matrix是步骤S301中训练所得的字符词语向量矩阵权重，embedding_matrix.shape[0]是字符词语向量矩阵的字词表(词典)的大小，embedding_dim是输出的字符词语向量的维度，input_length是输入序列的长度。Among them, embedding_matrix is the weight of the character word vector matrix trained in step S301, embedding_matrix.shape[0] is the size of the word table (dictionary) of the character word vector matrix, embedding_dim is the dimension of the output character word vector, and input_length is the input The length of the sequence.

相应的文本Q1和Q2，经过多粒度嵌入层处理后得到词语级句子向量和字符级句子向量Q1-word Emd、Q1-char Emd、Q2-word Emd、Q2-char Emd。The corresponding texts Q1 and Q2 are processed by the multi-granular embedding layer to obtain word-level sentence vectors and character-level sentence vectors Q1-word Emd, Q1-char Emd, Q2-word Emd, Q2-char Emd.

S304、构建多粒度融合编码层：如附图8所示，对词语级句子向量和字符级句子向量进行编码处理，得到句子语义特征向量；步骤S304中构建多粒度融合编码层是将步骤S303中多粒度嵌入层输出的词语级句子向量和字符级句子向量作为输入，从两个角度获取文本语义特征，即字符级别语义特征提取和词语级别语义特征提取；再通过按位相加的形式，对两个角度的文本语义特征进行整合，得到最终的句子语义特征向量；对于句子Q1求取最终的句子语义特征向量具体如下：S304. Construct a multi-granularity fusion coding layer: as shown in FIG. 8, the word-level sentence vector and character-level sentence vector are coded to obtain the sentence semantic feature vector; the construction of the multi-granularity fusion coding layer in step S304 is the step S303 The word-level sentence vector and character-level sentence vector output by the multi-granular embedding layer are used as input to obtain text semantic features from two perspectives, namely, character-level semantic feature extraction and word-level semantic feature extraction; The text semantic features from the two perspectives are integrated to obtain the final sentence semantic feature vector; for the sentence Q1, the final sentence semantic feature vector is obtained as follows:

S3040101、使用LSTM进行特征提取，得到特征向量

The formula is as follows:

S3040102、针对

Further use two different methods for encoding, as follows:

①、对

继续使用LSTM进行二次特征提取，得到相应特征向量

公式如下： ①, yes

The formula is as follows:

②、对

使用注意力机制Attention提取特征，得到相应特征向量

公式如下： ②, yes

The formula is as follows:

S3040103、针对

使用Attention再次进行编码提取关键特征，得到特征向量

公式如下： S3040103, for

Use Attention to code again to extract key features to obtain feature vectors

The formula is as follows:

S3040104、将

与

按位相加得到字符级别的语义特征

公式如下： S3040104, will

and

Bitwise addition to obtain character-level semantic features

The formula is as follows:

S3040201、使用LSTM进行特征提取，得到特征向量

The formula is as follows:

S3040202、针对

进一步采用LSTM进行二次特征提取，得到相应特征向量

公式如下： S3040202, for

The formula is as follows:

S3040203、针对

公式如下： S3040203, for

The formula is as follows:

以及词语级别的特征向量

在多粒度融合编码层中，本发明的编码维度统一设置为300，本发明将

和

按位相加，得到针对文本Q1的最终句子语义特征向量

And word-level feature vectors

In the multi-granularity fusion coding layer, the coding dimension of the present invention is uniformly set to 300, and the present invention sets

with

Add bitwise to get the final sentence semantic feature vector for text Q1

The formula is as follows:

对于句子Q2求取最终的句子语义特征向量

The method is the same as step S30401 to step S30403.

S305、构建交互匹配层：对句子语义特征向量进行分层比较，得到句子对的匹配表征向量；如附图9所示，具体如下：S305. Construct an interactive matching layer: perform hierarchical comparison of sentence semantic feature vectors to obtain matching characterization vectors of sentence pairs; as shown in FIG. 9, the details are as follows:

S30501、经过步骤S304处理得到Q1、Q2的句子语义特征向量

和

针对

和

进行减法、叉乘以及点乘三种操作，得到

with

against

with

Perform three operations of subtraction, cross product, and dot product to get

The formula is as follows:

同时，使用一个全连接层Dense进一步编码得到

和

with

The formula is as follows:

中每个语义特征的相应向量表示；

为针对句子语义特征向量

和

使用Dense进一步提取，得到的特征向量；

The corresponding vector representation of each semantic feature in;

For sentence semantic feature vector

with

Use Dense to further extract the feature vector obtained;

Indicates that the coding dimension is 300;

S30502、将

和

联接起来得到

公式如下： S30502, will

with

Connect to get

The formula is as follows:

同时，

和

同样进行减法、叉乘操作，公式如下： at the same time,

with

再将二者结果联接得到

公式如下： Then connect the two results to get

The formula is as follows:

S30503、将

使用两层全连接层进行特征提取得到

并将

与

进行求和，得到

公式如下： S30503, will

Use two fully connected layers for feature extraction to get

And will

and

To sum and get

The formula is as follows:

S30504、将

经过一层全连接层编码后的结果与步骤S30501中

求和，得到句子对的匹配表征向量

公式如下： S30504, will

Sum, get the matching representation vector of the sentence pair

The formula is as follows:

S306、构建预测层：经预测层的Sigmoid函数处理，判断句子对的语义匹配程度；具体如下：S306. Construct a prediction layer: After processing the Sigmoid function of the prediction layer, determine the degree of semantic matching of sentence pairs; the details are as follows:

S4、训练多粒度融合模型；如附图5所示，具体如下：S4. Train a multi-granularity fusion model; as shown in Figure 5, the details are as follows:

本发明设计了交叉熵损失函数来防止过拟合问题。在大多数现有的深度学***衡参数，以平衡正样本和负样本，从而大大提高了模型的性能。The present invention designs a cross-entropy loss function to prevent over-fitting problems. In most existing deep learning applications, cross entropy is a common loss function for training models. However, the method based on maximum likelihood estimation will generate input noise. This method may divide the training sample into 0 or 1, leading to the problem of overfitting. Moreover, according to the investigation, relatively little work has been done in designing new loss functions. The present invention proposes to use mean square error (MSE) as a balance parameter to balance positive samples and negative samples, thereby greatly improving the performance of the model.

在大多数分类任务中，交叉熵损失函数如下所示，并且这种形式通常是第一选择。In most classification tasks, the cross-entropy loss function is shown below, and this form is usually the first choice.

S402、优化训练模型：选择使用RMSprop优化函数作为本模型的优化函数，超参数均选择Keras中的默认值设置。本模型在训练数据集上进行优化训练。S402. Optimize the training model: choose to use the RMSprop optimization function as the optimization function of this model, and the hyperparameters are all set to default values in Keras. This model is optimized and trained on the training data set.

举例说明：上面描述的优化函数及其设置在Keras中使用代码表示为：For example: the optimization function and its settings described above are expressed in Keras as:

optim＝keras.optimizers.RMSprop()optim=keras.optimizers.RMSprop()

model＝keras.models.Model([Q1-char,Q1-word,Q2-char,Q2-word],[y _pred]) model=keras.models.Model([Q1-char,Q1-word,Q2-char,Q2-word],[y _pred ])

model.compile(loss＝L _loss,optimizer＝optim,metrics＝['accuracy',precision,recall,f1_score])； model.compile(loss=L _loss ,optimizer=optim,metrics=['accuracy',precision,recall,f1_score]);

其中，损失函数loss选择本步骤S401中自定义Loss；优化算法optimizer选择前文定义好的optim；Q1-char,Q1-word,Q2-char,Q2-word作为模型输入，y _pred为模型输出；评价指标metrics，本发明选取准确率accuracy,精确率precision,召回率recall,基于召回率和精确率计算的F ₁-score。 Among them, the loss function loss selects the custom Loss in this step S401; the optimization algorithm optimizer selects the previously defined optim; Q1-char, Q1-word, Q2-char, and Q2-word are the model inputs, and y _pred is the model output; evaluation Indicator metrics, the present invention selects accuracy, precision, recall, F ₁ -score calculated based on recall and precision.

本发明的模型在LCQMC公开数据集上取得了优于当前模型的结果，实验结果的对比具体见下表：The model of the present invention has achieved better results than the current model on the LCQMC public data set. The comparison of the experimental results is shown in the following table:

其中，前十四行是现有技术的模型的实验结果【Liu,X.,Chen,Q.,Deng,C.,Zeng,H.,Chen,J.,Li,D.,Tang,B.,2018.Lcqmc:A large-scale chinese question matching corpus,in:Proceedings of the 27th International Conference on Computational Linguistics,pp.1952–1962】。本发明模型和现有模型进行了比较，可见本发明方法较其他方法其性能最优。Among them, the first fourteen lines are the experimental results of the prior art model [Liu, X., Chen, Q., Deng, C., Zeng, H., Chen, J., Li, D., Tang, B. ,2018.Lcqmc:A large-scale chinese question matching corpus,in:Proceedings of the 27th International Conference on Computational Linguistics,pp.1952–1962]. Comparing the model of the present invention with the existing model, it can be seen that the method of the present invention has the best performance compared with other methods.

实施例2：Example 2:

如附图10所示，本发明的基于多粒度融合模型的中文句子语义智能匹配装置，该装置包括，As shown in FIG. 10, the intelligent matching device for Chinese sentence semantics based on the multi-granularity fusion model of the present invention includes:

文本匹配知识库构建单元，用于使用爬虫程序，在互联网公共问答平台爬取问题集，或者使用网上公开的文本匹配数据集，作为原始相似句子知识库，再对原始相似句子知识库进行预处理，主要操作为对原始相似句子知识库中的每个句子进行断字处理和分词处理，从而构建用于模型训练的文本匹配知识库；文本匹配知识库构建单元包括，The text matching knowledge base building unit is used to use crawlers to crawl question sets on the Internet public question and answer platform, or use the text matching data set published on the Internet as the original similar sentence knowledge base, and then preprocess the original similar sentence knowledge base , The main operation is to perform hyphenation and word segmentation on each sentence in the original similar sentence knowledge base, thereby constructing a text matching knowledge base for model training; the text matching knowledge base building unit includes:

训练数据集生成单元，用于根据文本匹配知识库中的句子来构建训练正例数据和训练负例数据，并且基于正例数据与负例数据来构建最终的训练数据集；训练数据集生成单元包括，The training data set generation unit is used to construct training positive example data and training negative example data according to the sentences in the text matching knowledge base, and build the final training data set based on the positive and negative example data; training data set generation unit include,

多粒度融合模型训练单元，用于构建模型训练过程中所需要的损失函数，并完成模型的优化训练；多粒度融合模型训练单元包括，The multi-granular fusion model training unit is used to construct the loss function needed in the model training process and complete the optimization training of the model; the multi-granular fusion model training unit includes:

可以将附图10所示的基于多粒度融合模型的中文句子语义智能匹配的装置集成部署到各种硬件设备中，例如：个人电脑、工作站、智能移动设备等。The device for intelligent semantic matching of Chinese sentences based on the multi-granularity fusion model shown in FIG. 10 can be integrated and deployed in various hardware devices, such as personal computers, workstations, and smart mobile devices.

实施例3：Example 3:

基于实施例1的存储介质，其中存储有多条指令，指令由处理器加载，执行实施例1的基于多粒度融合模型的中文句子语义智能匹配方法的步骤。Based on the storage medium of Embodiment 1, a plurality of instructions are stored therein, and the instructions are loaded by the processor to execute the steps of the method for intelligent semantic matching of Chinese sentences based on the multi-granularity fusion model of Embodiment 1.

实施例4：Example 4:

基于实施例3的电子设备，电子设备包括：Based on the electronic device of Embodiment 3, the electronic device includes:

实施例3的存储介质；以及The storage medium of Embodiment 3; and

处理器，用于执行存储介质中的指令。The processor is used to execute instructions in the storage medium.

最后应说明的是：以上各实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述各实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分或者全部技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions recorded in the foregoing embodiments can still be modified, or some or all of the technical features can be equivalently replaced; and these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the technical solutions of the embodiments of the present invention. Scope.

Claims

基于多粒度融合模型的中文句子语义智能匹配方法，其特征在于，该方法具体如下：A Chinese sentence semantic intelligent matching method based on a multi-granularity fusion model is characterized in that the method is specifically as follows:

S1、构建文本匹配知识库；S1. Build a text matching knowledge base;

S2、构建文本匹配模型的训练数据集；S2, construct a training data set of the text matching model;

S3、构建多粒度融合模型；具体如下：S3. Build a multi-granularity fusion model; the details are as follows:

S301、构建字符词语映射转换表；S301. Construct a character word mapping conversion table;

S302、构建输入层；S302. Construct an input layer;

S303、构建多粒度嵌入层：对句子中的词语和字符进行向量映射，得到词语级句子向量和字符级句子向量；S303. Construct a multi-granularity embedding layer: perform vector mapping on words and characters in the sentence to obtain word-level sentence vectors and character-level sentence vectors;

S304、构建多粒度融合编码层：对词语级句子向量和字符级句子向量进行编码处理，得到句子语义特征向量；S304. Construct a multi-granularity fusion coding layer: perform coding processing on word-level sentence vectors and character-level sentence vectors to obtain sentence semantic feature vectors;

S305、构建交互匹配层：对句子语义特征向量进行分层比较，得到句子对的匹配表征向量；S305. Construct an interactive matching layer: perform hierarchical comparison of sentence semantic feature vectors to obtain matching representation vectors of sentence pairs;

S306、构建预测层：经预测层的Sigmoid函数处理，判断句子对的语义匹配程度；S306. Construct a prediction layer: After the Sigmoid function of the prediction layer is processed, the degree of semantic matching of sentence pairs is judged;

S4、训练多粒度融合模型。S4. Train a multi-granularity fusion model.
根据权利要求1所述的基于多粒度融合模型的中文句子语义智能匹配方法，其特征在于，所述步骤S1中构建文本匹配知识库具体如下：The Chinese sentence semantic intelligent matching method based on a multi-granularity fusion model according to claim 1, wherein the construction of a text matching knowledge base in step S1 is specifically as follows:

S101、使用爬虫获取原始数据：在互联网公共问答平台爬取问题集，得到原始相似句子知识库；或者使用网上公开的句子匹配数据集，作为原始相似句子知识库；S101. Use crawlers to obtain original data: Crawl the question set on the Internet public question and answer platform to obtain the original similar sentence knowledge base; or use the sentence matching data set disclosed on the Internet as the original similar sentence knowledge base;

S102、预处理原始数据：预处理原始相似句子知识库中的相似文本，对每个句子进行分词和断字处理，得到文本匹配知识库；其中，分词处理是以中文里的每个词语作为基本单位，对每条数据进行分词操作；断字处理是以中文里的每个字作为基本单位，对每条数据进行断字操作；每个汉字或词语之间用空格进行切分，并保留每条数据中包括的数字、标点以及特殊字符在内的所有内容；S102. Preprocess the original data: preprocess the similar text in the original similar sentence knowledge base, and perform word segmentation and hyphenation processing on each sentence to obtain a text matching knowledge base; wherein, the word segmentation processing is based on each word in Chinese Unit, perform word segmentation operations on each piece of data; hyphenation processing uses each character in Chinese as the basic unit to perform hyphenation operations on each piece of data; each Chinese character or word is segmented with a space, and each word is reserved. All contents including numbers, punctuation and special characters in the data;

所述步骤S2中构建文本匹配模型的训练数据集具体如下：The training data set for constructing the text matching model in the step S2 is specifically as follows:

S201、构建训练正例：将句子与其对应的语义匹配的句子进行组合，构建训练正例，形式化为：(Q1-char,Q1-word,Q2-char,Q2-word,1)；S201. Construct a training positive example: Combine a sentence with its corresponding semantically matched sentence to construct a training positive example, which is formalized as: (Q1-char, Q1-word, Q2-char, Q2-word, 1);

其中，Q1-char表示字符级粒度的句子1；Q1-word表示词语级粒度的句子1；Q2-char表示字符级粒度的句子2；Q2-word表示词语级粒度的句子2；1表示句子1和句子2这两个文本相匹配，是正例；Among them, Q1-char represents sentence 1 at character level granularity; Q1-word represents sentence 1 at word level granularity; Q2-char represents sentence 2 at character level granularity; Q2-word represents sentence 2 at word level granularity; 1 represents sentence 1 It matches the two texts of sentence 2, which is a positive example;

S202、构建训练负例：选中一个句子Q1，再从文本匹配知识库中随机选择一个与句子Q1不匹配的句子Q2，将Q1与Q2进行组合，构建负例，形式化为：(Q1-char,Q1-word,Q2-char,Q2-word,0)；S202. Construct a training negative example: select a sentence Q1, then randomly select a sentence Q2 that does not match the sentence Q1 from the text matching knowledge base, combine Q1 and Q2 to construct a negative example, which is formalized as: (Q1-char ,Q1-word,Q2-char,Q2-word,0);

其中，Q1-char表示字符级粒度的句子1；Q1-word表示词语级粒度的句子1；Q2-char表示字符级粒度的句子2；Q2-word表示词语级粒度的句子2；0表示句子Q1和句子Q2这两个文本不匹配，是负例；Among them, Q1-char represents sentence 1 at character level granularity; Q1-word represents sentence 1 at word level granularity; Q2-char represents sentence 2 at character level granularity; Q2-word represents sentence 2 at word level granularity; 0 represents sentence Q1 It does not match the two texts of sentence Q2, which is a negative example;

S203、构建训练数据集：将经过步骤S201和步骤S202操作后所获得的全部的正例样本和负例样本进行组合，并打乱其顺序，构建最终的训练数据集；其中，无论是正例数据还是负例数据均包含五个维度，即Q1-char、Q1-word、Q2-char、Q2-word、0或1。S203. Construct a training data set: Combine all the positive samples and negative samples obtained after the operations of step S201 and step S202, and disrupt their order to construct a final training data set; among them, whether it is positive data Still negative data includes five dimensions, namely Q1-char, Q1-word, Q2-char, Q2-word, 0 or 1.
根据权利要求1或2所述的基于多粒度融合模型的中文句子语义智能匹配方法，其特征在于，所述步骤S301中构建字符词语映射转换表具体如下：The Chinese sentence semantic intelligent matching method based on the multi-granularity fusion model according to claim 1 or 2, wherein the construction of the character word mapping conversion table in step S301 is specifically as follows:

S30101、字符词语表通过预处理后得到的文本匹配知识库来构建；S30101. The character vocabulary table is constructed by the text matching knowledge base obtained after preprocessing;

S30102、字符词语表构建完成后，表中每个字符、词语均被映射为唯一的数字标识，映射规则为：以数字1为起始，随后按照每个字符、词语被录入字符词语表的顺序依次递增排序，从而形成字符词语映射转换表；S30102. After the construction of the character vocabulary table is completed, each character and word in the table is mapped to a unique digital identifier. The mapping rule is: start with the number 1, and then follow the order in which each character or word is entered into the character vocabulary table Sort in ascending order to form a character-word mapping conversion table;

S30103、使用Word2Vec训练字符词语向量模型，得到字符词语向量矩阵权重embedding_matrix；S30103. Use Word2Vec to train the character word vector model to obtain the weight embedding_matrix of the character word vector matrix;

所述步骤S302中构建输入层具体如下：The construction of the input layer in the step S302 is specifically as follows:

S30201、输入层包括四个输入，对两个待匹配的句子进行预处理分别获取Q1-char、Q1-word、Q2-char、Q2-word，将其形式化为：(Q1-char,Q1-word,Q2-char,Q2-word)；S30201. The input layer includes four inputs. The two sentences to be matched are preprocessed to obtain Q1-char, Q1-word, Q2-char, and Q2-word respectively, which are formalized as: (Q1-char, Q1- word,Q2-char,Q2-word);

S30202、对于输入句子中的每个字符和词语均按照步骤S301中构建完成的字符词语映射转换表将其转化为相应的数字标识。S30202: For each character and word in the input sentence, it is converted into a corresponding digital identifier according to the character-word mapping conversion table constructed in step S301.
根据权利要求3所述的基于多粒度融合模型的中文句子语义智能匹配方法，其特征在于，所述步骤S303中构建多粒度嵌入层具体如下：The Chinese sentence semantic intelligent matching method based on a multi-granularity fusion model according to claim 3, wherein the construction of the multi-granularity embedding layer in step S303 is specifically as follows:

S30301、通过加载步骤S301中训练所得的字符词语向量矩阵权重来初始化当前层的权重参数；S30301: Initialize the weight parameters of the current layer by loading the weights of the character word vector matrix obtained through training in step S301;

S30302、针对输入句子Q1和Q2，经过多粒度嵌入层处理后得到其词语级句子向量和字符级句子向量Q1-word Emd、Q1-char Emd、Q2-word Emd、Q2-char Emd；其中，文本匹配知识库中每一个句子均能通过字符词语向量映射的方式，将文本信息转化为向量形式；S30302. For the input sentences Q1 and Q2, the word-level sentence vector and character-level sentence vector Q1-word Emd, Q1-char Emd, Q2-word Emd, Q2-char Emd are obtained after multi-granular embedding layer processing; among them, text Each sentence in the matching knowledge base can transform text information into vector form through character word vector mapping;

所述步骤S304中构建多粒度融合编码层是将步骤S303中多粒度嵌入层输出的词语级句子向量和字符级句子向量作为输入，从两个角度获取文本语义特征，即字符级别语义特征提取和词语级别语义特征提取；再通过按位相加的形式，对两个角度的文本语义特征进行整合，得到最终的句子语义特征向量；对于句子Q1求取最终的句子语义特征向量具体如下：The construction of the multi-granularity fusion coding layer in step S304 takes the word-level sentence vector and character-level sentence vector output by the multi-granularity embedding layer in step S303 as input, and obtains text semantic features from two perspectives, namely, character-level semantic feature extraction and Word-level semantic feature extraction; then through the form of bitwise addition, the text semantic features from the two perspectives are integrated to obtain the final sentence semantic feature vector; for sentence Q1, the final sentence semantic feature vector is obtained as follows:

S30401、针对字符级别语义特征提取，具体如下：S30401. For character-level semantic feature extraction, the details are as follows:

S3040101、使用LSTM进行特征提取，得到特征向量
公式如下： S3040101, use LSTM for feature extraction to obtain feature vectors
The formula is as follows:

S3040102、针对
进一步采用两种不同的方法进行编码，具体如下： S3040102, for
Further use two different methods for encoding, as follows:

①、对
继续使用LSTM进行二次特征提取，得到相应特征向量
公式如下： ①, yes
Continue to use LSTM for secondary feature extraction to obtain the corresponding feature vector
The formula is as follows:

②、对
使用注意力机制Attention提取特征，得到相应特征向量
公式如下： ②, yes
Use the attention mechanism Attention to extract features and get the corresponding feature vector
The formula is as follows:

S3040103、针对
使用Attention再次进行编码提取关键特征，得到特征向量
公式如下： S3040103, for
Use Attention to code again to extract key features to obtain feature vectors
The formula is as follows:

S3040104、将
与
按位相加得到字符级别的语义特征
公式如下： S3040104, will
and
Bitwise addition to obtain character-level semantic features
The formula is as follows:

其中，i表示相应字符向量在句子中的相对位置，Q _i为句子Q1中每个字符的相应向量表示；Q′ _i为经过初次LSTM编码后每个字符的相应向量表示；Q″ _i为经过第二次LSTM编码后每个字符的相应向量表示； Among them, i represents the relative position of the corresponding character vector in the sentence, and Q _i is the corresponding vector representation of each character in the sentence Q1; Q′ _i is the corresponding vector representation of each character after the initial LSTM encoding; Q″ _i is the corresponding vector representation of each character after the initial LSTM encoding. The corresponding vector representation of each character after the second LSTM encoding;

S30402、针对词语级别语义特征提取，具体如下：S30402. For word-level semantic feature extraction, the details are as follows:

S3040201、使用LSTM进行特征提取，得到特征向量
公式如下： S3040201 Use LSTM for feature extraction to obtain feature vectors
The formula is as follows:

S3040202、针对
进一步采用LSTM进行二次特征提取，得到相应特征向量
公式如下： S3040202, for
Further use LSTM for secondary feature extraction to obtain the corresponding feature vector
The formula is as follows:

S3040203、针对
使用Attention再次进行编码提取关键特征，得到词语级别特征向量
公式如下： S3040203, for
Use Attention to code again to extract key features to obtain word-level feature vectors
The formula is as follows:

其中，i′表示相应词语向量在句子中的相对位置；Q _i′为句子Q1中每个词语的相应向量表示；Q′ _i′为经过初次LSTM编码后每个词语的相应向量表示；Q″ _i′为经过第二次LSTM编码后每个词语的相应向量表示； Among them, i′ represents the relative position of the corresponding word vector in the sentence; Q _i′ is the corresponding vector representation of each word in the sentence Q1; Q′ _i′ is the corresponding vector representation of each word after the initial LSTM encoding; Q” _i′ is the corresponding vector representation of each word after the second LSTM encoding;

S30403、经过步骤S30401和步骤S30402得到相应字符级别的特征向量
以及词语级别的特征向量
将
和
按位相加，得到针对文本Q1的最终句子语义特征向量
公式如下： S30403: After step S30401 and step S30402, the feature vector of the corresponding character level is obtained
And word-level feature vectors
will
with
Add bitwise to get the final sentence semantic feature vector for text Q1
The formula is as follows:

对于句子Q2求取最终的句子语义特征向量
的方法，同步骤S30401到步骤S30403。 For sentence Q2, obtain the final sentence semantic feature vector
The method is the same as step S30401 to step S30403.
根据权利要求4所述的基于多粒度融合模型的中文句子语义智能匹配方法，其特征在于，所述步骤S305构建交互匹配层具体如下：The Chinese sentence semantic intelligent matching method based on a multi-granularity fusion model according to claim 4, wherein the step S305 constructs an interactive matching layer specifically as follows:

S30501、经过步骤S304处理得到Q1、Q2的句子语义特征向量
和
针对
和
进行减法、叉乘以及点乘三种操作，得到
公式如下： S30501: After processing in step S304, the sentence semantic feature vectors of Q1 and Q2 are obtained
with
against
with
Perform three operations of subtraction, cross product, and dot product to get
The formula is as follows:

同时，使用一个全连接层Dense进一步编码得到
和
公式如下： At the same time, use a fully connected layer Dense to further encode
with
The formula is as follows:

其中，i表示相应语义特征在句子中的相对位置；Q1 _i为文本Q1经过步骤S304特征提取得到的
中每个语义特征的相应向量表示；Q2 _i为文本Q2经过步骤S304特征提取得到的
中每个语义特征的相应向量表示；
为针对句子语义特征向量
和
使用Dense进一步提取，得到的特征向量；
表示编码维度为300； Among them, i represents the relative position of the corresponding semantic feature in the sentence; Q1 _i is the text Q1 obtained by feature extraction in step S304
Each respective feature vector is the semantic representation; _I Q2 Q2 through the text feature extraction step S304 obtained
The corresponding vector representation of each semantic feature in;
For sentence semantic feature vector
with
Use Dense to further extract the feature vector obtained;
Indicates that the coding dimension is 300;

S30502、将
和
联接起来得到
公式如下： S30502, will
with
Connect to get
The formula is as follows:

同时，
和
同样进行减法、叉乘操作，公式如下： at the same time,
with
The same subtraction and cross multiplication operations are performed, and the formula is as follows:

再将二者结果联接得到
公式如下： Then connect the two results to get
The formula is as follows:

S30503、将
使用两层全连接层进行特征提取得到
并将
与
进行求和，得到
公式如下： S30503, will
Use two fully connected layers for feature extraction to get
And will
and
To sum and get
The formula is as follows:

S30504、将
经过一层全连接层编码后的结果与步骤S30501中
求和，得到句子对的匹配表征向量
公式如下： S30504, will
The result after a layer of fully connected layer encoding is the same as in step S30501
Sum, get the matching representation vector of the sentence pair
The formula is as follows:

所述步骤S306中构建预测层具体如下：The construction of the prediction layer in step S306 is specifically as follows:

S30601、预测层接收步骤S305输出的匹配表征向量，使用Sigmoid函数进行计算，得到处于[0,1]之间的匹配度表示y _pred； S30601: The prediction layer receives the matching representation vector output in step S305, and uses the Sigmoid function for calculation to obtain a matching degree representation y _pred between [0, 1];

S30602、将y _pred与设立的阈值进行比较来判别句子对的匹配程度，具体如下： S30602. Compare y _pred with the established threshold to determine the matching degree of the sentence pair, which is specifically as follows:

①、当y _pred≥0.5时，表示句子Q1以及句子Q2相匹配； ①. When y _pred ≥0.5, it means that sentence Q1 and sentence Q2 match;

②、当y _pred<0.5时，表示句子Q1以及句子Q2不匹配。 ②. When y _pred <0.5, it means that sentence Q1 and sentence Q2 do not match.
根据权利要求1所述的基于多粒度融合模型的中文句子语义智能匹配方法，其特征在于，所述步骤S4中训练多粒度融合模型具体如下：The Chinese sentence semantic intelligent matching method based on a multi-granularity fusion model according to claim 1, wherein the training of the multi-granularity fusion model in step S4 is specifically as follows:

S401、构建损失函数：通过将均方误差设置为交叉熵的平衡因子，设计出平衡交叉熵，其中均方误差的公式如下：S401. Construct a loss function: design a balanced cross entropy by setting the mean square error as the balance factor of the cross entropy, and the formula for the mean square error is as follows:

其中，y _true表示真实标签，即每条训练样例中表示匹配与否的0、1标志；y _pred表示预测结果； Among them, y _true represents the true label, that is, the 0 and 1 flags that indicate match or not in each training example; y _pred represents the prediction result;

当分类边界模糊时，平衡交叉熵的使用能够自动平衡正负样本，并提高分类的准确性；其将交叉熵与均方误差融合，公式如下：When the classification boundary is blurred, the use of balanced cross entropy can automatically balance the positive and negative samples and improve the accuracy of classification; it fuses the cross entropy with the mean square error, and the formula is as follows:

S402、优化训练模型：选择使用RMSprop优化函数作为本模型的优化函数，超参数均选择Keras中的默认值设置。S402. Optimize the training model: choose to use the RMSprop optimization function as the optimization function of this model, and the hyperparameters are all set to default values in Keras.
一种基于多粒度融合模型的中文句子语义智能匹配装置，其特征在于，该装置包括，A Chinese sentence semantic intelligent matching device based on a multi-granularity fusion model is characterized in that the device includes:

文本匹配知识库构建单元，用于使用爬虫程序，在互联网公共问答平台爬取问题集，或者使用网上公开的文本匹配数据集，作为原始相似句子知识库，再对原始相似句子知识库进行预处理，主要操作为对原始相似句子知识库中的每个句子进行断字处理和分词处理，从而构建用于模型训练的文本匹配知识库；The text matching knowledge base building unit is used to use crawlers to crawl question sets on the Internet public question and answer platform, or use text matching data sets published on the Internet as the original similar sentence knowledge base, and then preprocess the original similar sentence knowledge base , The main operation is to perform hyphenation and word segmentation on each sentence in the original similar sentence knowledge base, so as to construct a text matching knowledge base for model training;

训练数据集生成单元，用于根据文本匹配知识库中的句子来构建训练正例数据和训练负例数据，并且基于正例数据与负例数据来构建最终的训练数据集；The training data set generation unit is used to construct training positive example data and training negative example data according to the sentences in the text matching knowledge base, and construct the final training data set based on the positive example data and the negative example data;

多粒度融合模型构建单元，用于构建字符词语映射转换表，并同时构建输入层、多粒度嵌入层、多粒度融合编码层、交互匹配层、预测层；其中，多粒度融合模型构建单元包括，The multi-granularity fusion model building unit is used to construct the character word mapping conversion table, and to construct the input layer, the multi-granularity embedding layer, the multi-granularity fusion coding layer, the interactive matching layer, and the prediction layer at the same time; among them, the multi-granularity fusion model building unit includes:

字符词语映射转换表构建子单元，用于对文本匹配知识库中的每个句子按字符和词语进行切分，并将每个字符和词语依次存入一个列表中，从而得到一个字符词语表，随后以数字1为起始，按照每个字符和词语被录入字符词语表的顺序依次递增排序，从而形成本发明所需的字符词语映射转换表；字符词语映射转换表构建完成后，表中每个字符和词语均被映射为唯一的数字标识；其后，使用Word2Vec训练字符词语向量模型，得到字符词语向量矩阵权重；The character word mapping conversion table constructs a subunit, which is used to segment each sentence in the text matching knowledge base by character and word, and store each character and word in a list in turn to obtain a character word table. Then, starting with the number 1, the characters and words are entered into the character word table in ascending order, thereby forming the character word mapping conversion table required by the present invention; after the character word mapping conversion table is constructed, each character in the table Characters and words are mapped to unique digital identifiers; after that, use Word2Vec to train the character word vector model to obtain the weight of the character word vector matrix;

输入层构建子单元，用于根据字符词语映射转换表，将输入句子中的每个字符和词语转化为相应的数字标识，从而完成数据的输入，具体来说就是分别获取q1与q2，将其形式化为：(q1-char,q1-word,q2-char,q2-word)；The input layer constructs sub-units, which are used to convert each character and word in the input sentence into a corresponding digital identifier according to the character-word mapping conversion table, thereby completing the data input, specifically to obtain q1 and q2 respectively, and convert them Formalized as: (q1-char,q1-word,q2-char,q2-word);

多粒度嵌入层构建子单元，用于加载预训练好的字符词语向量权重，将输入句子中的字符词语转换为字符词语向量形式，进而构成完整的句子向量表示；该操作根据字符词语的数字标识查找字符词语向量矩阵而完成；The multi-granular embedding layer constructs sub-units, which are used to load pre-trained character word vector weights, convert the character words in the input sentence into the character word vector form, and then form a complete sentence vector representation; this operation is based on the digital identification of the character words Completed by looking up the character word vector matrix;

多粒度融合编码层构建子单元，用于将多粒度嵌入层输出的词语级句子向量和字符级句子向量作为输入；先从两个角度来获取文本语义特征，即字符级别语义特征提取和词语级别语义特征提取；再通过按位相加的形式，对两个角度的文本语义特征进行整合，得到最终的句子语义特征向量；The multi-granularity fusion coding layer constructs sub-units, which are used to take the word-level sentence vector and character-level sentence vector output by the multi-granularity embedding layer as input; first obtain text semantic features from two perspectives, namely, character-level semantic feature extraction and word-level Semantic feature extraction; then through the form of bitwise addition, the semantic features of the text from the two perspectives are integrated to obtain the final sentence semantic feature vector;

交互匹配层构建子单元，用于将输入的两个句子语义特征向量，经过分层匹配计算，得到句子对的匹配表征向量；The interactive matching layer constructs sub-units, which are used to perform hierarchical matching calculations on the semantic feature vectors of the two input sentences to obtain the matching representation vectors of the sentence pairs;

预测层构建子单元，用于接收交互匹配层输出的匹配表征向量，使用Sigmoid函数进行计算，得到处于[0,1]之间的匹配度，最终通过与设立的阈值进行比较来判别句子对的匹配程度；The prediction layer constructs a sub-unit for receiving the matching representation vector output by the interactive matching layer, using the Sigmoid function to calculate, and obtaining the matching degree between [0,1], and finally judging the sentence pair by comparing with the established threshold Matching degree;

多粒度融合模型训练单元，用于构建模型训练过程中所需要的损失函数，并完成模型的优化训练。The multi-granularity fusion model training unit is used to construct the loss function needed in the model training process and complete the optimization training of the model.
根据权利要求7所述的基于多粒度融合模型的中文句子语义智能匹配装置，其特征在于，所述文本匹配知识库构建单元包括，The Chinese sentence semantic intelligent matching device based on a multi-granularity fusion model according to claim 7, wherein the text matching knowledge base building unit comprises:

爬取原始数据子单元，用于在互联网公共问答平台爬取问题集，或者使用网上公开的文本匹配数据集，构建原始相似句子知识库；Crawling the original data subunit, used to crawl the question set on the Internet public question and answer platform, or use the text matching data set published on the Internet to build the original similar sentence knowledge base;

原始数据处理子单元，用于将原始相似句子知识库中的句子进行断字处理和分词处理，从而构建用于模型训练的文本匹配知识库；The original data processing subunit is used to hyphenate and segment the sentences in the original similar sentence knowledge base, thereby constructing the text matching knowledge base for model training;

所述训练数据集生成单元包括，The training data set generating unit includes:

训练正例数据构建子单元，用于将文本匹配知识库中语义匹配的句子进行组合，并对其添加匹配标签1，构建为训练正例数据；The training positive example data construction subunit is used to combine semantically matched sentences in the text matching knowledge base, and add matching label 1 to it to construct the training positive example data;

训练负例数据构建子单元，用于先从文本匹配知识库中选取一个句子q ₁，再从文本匹配知识库中随机选择一个与句子q ₁语义不匹配的句子q ₂，将q ₁与q ₂进行组合，并对其添加匹配标签0，构建为训练负例数据； The training negative example data constructs a subunit, which is used to first select a sentence q ₁ from the text matching knowledge base, and then randomly select a _{sentence q 2} that does not match the sentence q ₁ semantically from the text matching knowledge base, and compare q ₁ with q ₂ Combine and add a matching label 0 to it, and construct it as training negative example data;

训练数据集构建子单元，用于将所有的训练正例数据与训练负例数据组合在一起，并打乱其顺序，从而构建最终的训练数据集；The training data set construction subunit is used to combine all the training positive example data and the training negative example data, and disrupt the order to construct the final training data set;

所述多粒度融合模型训练单元包括，The multi-granularity fusion model training unit includes:

损失函数构建子单元，用于构建损失函数，计算句子1和句子2间文本匹配度的误差；The loss function construction subunit is used to construct the loss function and calculate the error of the text matching degree between sentence 1 and sentence 2;

模型优化训练子单元，用于训练并调整模型训练中的参数，从而减小模型训练过程中预测的句子1与句子2间匹配度与真实匹配度之间的误差。The model optimization training subunit is used to train and adjust the parameters in the model training, thereby reducing the error between the predicted matching degree between sentence 1 and sentence 2 and the true matching degree in the model training process.
一种存储介质，其中存储有多条指令，其特征在于，所述指令由处理器加载，执行权利要求1-6中所述的基于多粒度融合模型的中文句子语义智能匹配方法的步骤。A storage medium storing a plurality of instructions, wherein the instructions are loaded by a processor to execute the steps of the Chinese sentence semantic intelligent matching method based on the multi-granularity fusion model described in claims 1-6.
一种电子设备，其特征在于，所述电子设备包括：An electronic device, characterized in that, the electronic device includes:

权利要求9所述的存储介质；以及The storage medium of claim 9; and

处理器，用于执行所述存储介质中的指令。The processor is configured to execute instructions in the storage medium.