WO2021164199A1 - Multi-granularity fusion model-based intelligent semantic chinese sentence matching method, and device - Google Patents

Multi-granularity fusion model-based intelligent semantic chinese sentence matching method, and device Download PDF

Info

Publication number
WO2021164199A1
WO2021164199A1 PCT/CN2020/104723 CN2020104723W WO2021164199A1 WO 2021164199 A1 WO2021164199 A1 WO 2021164199A1 CN 2020104723 W CN2020104723 W CN 2020104723W WO 2021164199 A1 WO2021164199 A1 WO 2021164199A1
Authority
WO
WIPO (PCT)
Prior art keywords
sentence
word
character
matching
vector
Prior art date
Application number
PCT/CN2020/104723
Other languages
French (fr)
Chinese (zh)
Inventor
鹿文鹏
王荣耀
张旭
贾瑞祥
郭韦钰
张维玉
Original Assignee
齐鲁工业大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 齐鲁工业大学 filed Critical 齐鲁工业大学
Publication of WO2021164199A1 publication Critical patent/WO2021164199A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Definitions

  • the invention relates to the field of artificial intelligence and natural language processing, in particular to a method and device for intelligently matching Chinese sentence semantics based on a multi-granularity fusion model.
  • Sentence semantic matching plays a key role in many natural language processing tasks, such as question answering (QA), natural language inference (NLI), machine translation (MT), and so on.
  • the key to sentence semantic matching is to calculate the degree of matching between the semantics of a given sentence pair.
  • Sentences can be segmented at different granularities, such as characters, words, and phrases.
  • the commonly used text segmentation granularity is words, especially in the Chinese field.
  • the patent document with the patent number CN106569999A discloses a multi-granularity short text semantic similarity comparison method, which includes the following steps: S1, short text preprocessing; the preprocessing includes Chinese word segmentation and part-of-speech tagging; S2. Feature selection is performed on the preprocessed short text; S3, distance measurement is performed on the vector set after feature selection to determine the similarity of the short text.
  • S1 short text preprocessing
  • the preprocessing includes Chinese word segmentation and part-of-speech tagging
  • S2 Feature selection is performed on the preprocessed short text
  • S3 distance measurement is performed on the vector set after feature selection to determine the similarity of the short text.
  • the technical task of the present invention is to provide a Chinese sentence semantic intelligent matching method and device based on a multi-granularity fusion model to solve the problems of incomplete semantic analysis of a single-granularity model and imprecise sentence matching.
  • the intelligent matching method for Chinese sentence semantics based on a multi-granularity fusion model is specifically as follows:
  • S303 Construct a multi-granularity embedding layer: perform vector mapping on words and characters in the sentence to obtain word-level sentence vectors and character-level sentence vectors;
  • S304 Construct a multi-granularity fusion coding layer: perform coding processing on word-level sentence vectors and character-level sentence vectors to obtain sentence semantic feature vectors;
  • S305 Construct an interactive matching layer: perform hierarchical comparison of sentence semantic feature vectors to obtain matching representation vectors of sentence pairs;
  • the construction of the text matching knowledge base in step S1 is specifically as follows:
  • Preprocess the original data preprocess the similar text in the original similar sentence knowledge base, and perform word segmentation and hyphenation processing on each sentence to obtain a text matching knowledge base; wherein, the word segmentation processing is based on each word in Chinese Unit, perform word segmentation operations on each piece of data; hyphenation processing is based on each character in Chinese as the basic unit, and each piece of data is hyphenated; each Chinese character and word is divided by a space, and each word is reserved. All contents including numbers, punctuation and special characters in the data;
  • the training data set for constructing the text matching model in the step S2 is specifically as follows:
  • Q1-char represents sentence 1 at character level granularity
  • Q1-word represents sentence 1 at word level granularity
  • Q2-char represents sentence 2 at character level granularity
  • Q2-word represents sentence 2 at word level granularity
  • 1 represents sentence 1 It matches the two texts of sentence 2, which is a positive example
  • Q1-char represents sentence 1 at character level granularity
  • Q1-word represents sentence 1 at word level granularity
  • Q2-char represents sentence 2 at character level granularity
  • Q2-word represents sentence 2 at word level granularity
  • 0 represents sentence Q1 It does not match the two texts of sentence Q2, which is a negative example
  • step S203 Construct a training data set: Combine all the positive samples and negative samples obtained after the operations of step S201 and step S202, and disrupt their order to construct a final training data set; among them, whether it is positive data
  • Still negative data includes five dimensions, namely Q1-char, Q1-word, Q2-char, Q2-word, 0 or 1.
  • step S301 the construction of the character word mapping conversion table in step S301 is specifically as follows:
  • the character vocabulary table is constructed by the text matching knowledge base obtained after preprocessing;
  • each character and word in the table is mapped to a unique digital identifier.
  • the mapping rule is: start with the number 1, and then follow the order in which each character and word is entered into the character vocabulary table Sort in ascending order to form a character-word mapping conversion table;
  • the construction of the input layer in the step S302 is specifically as follows:
  • the input layer includes four inputs.
  • the two sentences to be matched are preprocessed to obtain Q1-char, Q1-word, Q2-char, and Q2-word respectively, which are formalized as: (Q1-char, Q1- word,Q2-char,Q2-word);
  • step S303 the construction of the multi-granularity embedding layer in step S303 is specifically as follows:
  • the word-level sentence vector and character-level sentence vector Q1-word Emd, Q1-char Emd, Q2-word Emd, Q2-char Emd are obtained after multi-granular embedding layer processing; among them, text
  • Each sentence in the matching knowledge base can transform text information into vector form through character word vector mapping;
  • the construction of the multi-granularity fusion coding layer in step S304 takes the word-level sentence vector and character-level sentence vector output by the multi-granularity embedding layer in step S303 as input, and obtains text semantic features from two perspectives, namely, character-level semantic feature extraction and Word-level semantic feature extraction; then through the form of bitwise addition, the text semantic features from the two perspectives are integrated to obtain the final sentence semantic feature vector; for sentence Q1, the final sentence semantic feature vector is obtained as follows:
  • i represents the relative position of the corresponding character vector in the sentence
  • Q i is the corresponding vector representation of each character in the sentence Q1
  • Q′ i is the corresponding vector representation of each character after the initial LSTM encoding
  • Q′′ i is the corresponding vector representation of each character after the initial LSTM encoding.
  • i ' represents the relative position of the corresponding word in the sentence vector;
  • Q i' expressed as Q1 respective sentence vectors of each word;
  • Q 'i' expressed as a respective vector for each of words after the initial coding LSTM;
  • Q " i′ is the corresponding vector representation of each word after the second LSTM encoding;
  • step S30403 After step S30401 and step S30402, the feature vector of the corresponding character level is obtained And word-level feature vectors will with Add bitwise to get the final sentence semantic feature vector for text Q1
  • the formula is as follows:
  • step S30401 For sentence Q2, obtain the final sentence semantic feature vector
  • step S305 the construction of the interactive matching layer in step S305 is specifically as follows:
  • step S30501 After processing in step S304, the sentence semantic feature vectors of Q1 and Q2 are obtained with against with Perform three operations of subtraction, cross product, and dot product to get The formula is as follows:
  • dot product also called the quantified product
  • the result is the length of a vector projected in the direction of another vector, which is a scalar
  • cross product also called the vector product
  • the result is a vector that is perpendicular to the two existing vectors
  • i represents the relative position of the corresponding semantic feature in the sentence;
  • Q1 i is the text Q1 obtained by feature extraction in step S304
  • Each respective feature vector is the semantic representation;
  • I Q2 Q2 through the text feature extraction step S304 obtained The corresponding vector representation of each semantic feature in;
  • For sentence semantic feature vector with Use Dense to further extract the feature vector obtained; Indicates that the coding dimension is 300;
  • step S30504 will The result after a layer of fully connected layer encoding is the same as in step S30501 Sum, get the matching representation vector of the sentence pair
  • the formula is as follows:
  • step S306 The construction of the prediction layer in step S306 is specifically as follows:
  • the prediction layer receives the matching representation vector output in step S305, and uses the Sigmoid function for calculation to obtain a matching degree representation y pred between [0, 1];
  • the training of the multi-granularity fusion model in step S4 is specifically as follows:
  • y true represents the true label, that is, the 0 and 1 flags that indicate match or not in each training example;
  • y pred represents the prediction result;
  • the use of balanced cross entropy can automatically balance the positive and negative samples and improve the accuracy of classification; it fuses the cross entropy with the mean square error, and the formula is as follows:
  • a Chinese sentence semantic intelligent matching device based on a multi-granularity fusion model comprising:
  • the text matching knowledge base building unit is used to use crawlers to crawl question sets on the Internet public question and answer platform, or use text matching data sets published on the Internet as the original similar sentence knowledge base, and then preprocess the original similar sentence knowledge base ,
  • the main operation is to perform hyphenation and word segmentation on each sentence in the original similar sentence knowledge base, so as to construct a text matching knowledge base for model training;
  • the training data set generation unit is used to construct training positive example data and training negative example data according to the sentences in the text matching knowledge base, and construct the final training data set based on the positive example data and the negative example data;
  • the multi-granularity fusion model building unit is used to construct the character word mapping conversion table, and to construct the input layer, the multi-granularity embedding layer, the multi-granularity fusion coding layer, the interactive matching layer, and the prediction layer at the same time; among them, the multi-granularity fusion model building unit includes:
  • the character word mapping conversion table constructs a subunit, which is used to segment each sentence in the text matching knowledge base according to characters and words, and store each character and word in a list in turn to obtain a character word table. Subsequently, starting with the number 1, the characters and words are entered in the character word table in ascending order, thereby forming the character word mapping conversion table required by the present invention; after the character word mapping conversion table is constructed, each character in the table Characters and words are mapped to unique digital identifiers; after that, use Word2Vec to train the character word vector model to obtain the weight of the character word vector matrix;
  • the input layer constructs sub-units, which are used to convert each character and word in the input sentence into a corresponding digital identifier according to the character-word mapping conversion table, thereby completing the data input, specifically to obtain q1 and q2 respectively, and convert them Formalized as: (q1-char,q1-word,q2-char,q2-word);
  • the multi-granular embedding layer constructs sub-units, which are used to load pre-trained character word vector weights, convert the character words in the input sentence into the character word vector form, and then form a complete sentence vector representation; this operation is based on the digital identification of the character words Completed by looking up the character word vector matrix;
  • the multi-granularity fusion coding layer constructs sub-units, which are used to take the word-level sentence vector and character-level sentence vector output by the multi-granularity embedding layer as input; first obtain text semantic features from two perspectives, namely, character-level semantic feature extraction and word-level Semantic feature extraction; then through the form of bitwise addition, the semantic features of the text from the two perspectives are integrated to obtain the final sentence semantic feature vector;
  • the interactive matching layer constructs sub-units, which are used to perform hierarchical matching calculations on the semantic feature vectors of the two input sentences to obtain the matching representation vectors of the sentence pairs;
  • the prediction layer constructs a sub-unit for receiving the matching representation vector output by the interactive matching layer, using the Sigmoid function to calculate, and obtaining the matching degree between [0,1], and finally judging the sentence pair by comparing with the established threshold Matching degree;
  • the multi-granularity fusion model training unit is used to construct the loss function needed in the model training process and complete the optimization training of the model.
  • the text matching knowledge base building unit includes:
  • the original data processing subunit is used to hyphenate and segment the sentences in the original similar sentence knowledge base, thereby constructing the text matching knowledge base for model training;
  • the training data set generating unit includes:
  • the training positive example data construction subunit is used to combine semantically matched sentences in the text matching knowledge base, and add matching label 1 to it to construct the training positive example data;
  • the training negative example data constructs a subunit, which is used to first select a sentence q 1 from the text matching knowledge base, and then randomly select a sentence q 2 that does not match the sentence q 1 semantically from the text matching knowledge base, and compare q 1 with q 2 Combine and add a matching label 0 to it, and construct it as training negative example data;
  • the training data set construction subunit is used to combine all the training positive example data and the training negative example data, and disrupt the order to construct the final training data set;
  • the multi-granularity fusion model training unit includes:
  • the loss function construction subunit is used to construct the loss function and calculate the error of the text matching degree between sentence 1 and sentence 2;
  • the model optimization training subunit is used to train and adjust the parameters in the model training, thereby reducing the error between the predicted matching degree between sentence 1 and sentence 2 and the true matching degree in the model training process.
  • a storage medium stores a plurality of instructions, and the instructions are loaded by a processor to execute the steps of the above-mentioned intelligent matching method for Chinese sentence semantics based on a multi-granularity fusion model.
  • An electronic device which includes:
  • the processor is configured to execute instructions in the storage medium.
  • the present invention integrates word vectors and character vectors, and effectively extracts the semantic information of Chinese sentences from the two granularities of characters and words, thereby improving the accuracy of Chinese sentence coding;
  • the present invention can accurately realize the task of matching Chinese sentences
  • the present invention uses mean square error (MSE) as a balance factor to improve the cross-entropy loss function, thereby designing a balanced cross-entropy loss function; the loss function can solve the problem of overfitting, and the classification boundary is processed during the training process. Fuzzy processing; at the same time, it can alleviate the problem of category imbalance between positive and negative samples;
  • MSE mean square error
  • the multi-granularity fusion model uses different encoding methods to generate character-level sentence vectors and word-level sentence vectors; for word-level sentence vectors, two LSTM networks are used for sequential encoding, and then the attention mechanism is used for depth Feature extraction; for character-level sentence vectors, in addition to using the same processing method as word-level sentence vectors, a layer of LSTM network and attention mechanism are added for encoding; the encoding of word-level sentence vectors and character-level sentence vectors are finally superimposed Together, as a multi-granularity fusion coding representation of a sentence, it can make the coding representation of a sentence more accurate and comprehensive;
  • the present invention uses mean square error (MSE) as a balance factor to improve the cross-entropy loss function.
  • MSE mean square error
  • LCQMC public data set
  • the present invention realizes a multi-granularity fusion model, which considers both Chinese word-level granularity and character-level granularity, and integrates multi-granularity coding to better capture semantic features.
  • Figure 1 is a flow chart of a Chinese sentence semantic intelligent matching method based on a multi-granularity fusion model
  • Figure 2 is a block diagram of the process of constructing a text matching knowledge base
  • Figure 3 is a block diagram of the process of constructing the training data set of the text matching model
  • Figure 4 is a block diagram of the process of constructing a multi-granularity fusion model
  • Figure 5 is a block diagram of the process of training a multi-granularity fusion model
  • Figure 6 is a schematic diagram of a multi-granularity fusion model
  • Figure 7 is a schematic diagram of a multi-granularity embedding layer
  • Fig. 8 is a schematic diagram of a multi-granularity fusion coding layer
  • Fig. 9 is a schematic diagram of an interactive matching layer
  • Fig. 10 is a block diagram of a device for intelligent semantic matching of Chinese sentences based on a multi-granularity fusion model.
  • the intelligent matching method for Chinese sentence semantics based on the multi-granularity fusion model of the present invention is specifically as follows:
  • the text matching data set publicly available on the Internet as the original knowledge base.
  • the LCQMC data set [Liu,X.,Chen,Q.,Deng,C.,Zeng,H.,Chen,J.,Li,D.,Tang,B.:LCQMC:A large-scale Chinese question matching corpus .In:Proceedings of the 27th International Conference on Computational Linguistics.pp.1952-1962(2018)]
  • this data set has a total of 260,068 pairs of annotation results, divided into three parts: 238766 training set, 8802 verification set and 12500 test set. It is a Chinese data set specially used for text matching tasks.
  • Preprocess the original data preprocess the similar texts in the original similar sentence knowledge base, and perform word segmentation and hyphenation processing on each sentence to obtain a text matching knowledge base;
  • step S101 The similar text obtained in step S101 is preprocessed to obtain a text matching knowledge base.
  • step S102 in order to avoid the loss of semantic information, the present invention reserves all stop words in the sentence.
  • the word segmentation processing takes each word in Chinese as the basic unit, and performs word segmentation operations on each piece of data; for example, take the sentence 2 "Can you apply for a one-day extension of repayment?" shown in step S101 as an example. After the word segmentation is processed, you get "Can you apply for a one-day extension of repayment?".
  • the present invention records the sentence after word segmentation processing as a sentence with word level granularity.
  • Hyphenation processing takes each Chinese character as the basic unit, and performs hyphenation operations on each piece of data; each Chinese character is divided by spaces, and the numbers, punctuation and special characters included in each piece of data are kept in For example, take the sentence 2 "Can you apply for a one-day repayment extension?" shown in step S101 as an example, after hyphenating it, you get "Can you apply for a one-day extension of repayment?".
  • the present invention records the hyphenated sentence as a sentence with character level granularity.
  • Q1-char represents sentence 1 at character level granularity
  • Q1-word represents sentence 1 at word level granularity
  • Q2-char represents sentence 2 at character level granularity
  • Q2-word represents sentence 2 at word level granularity
  • 1 represents sentence 1 It matches the two texts of sentence 2, which is a positive example
  • Q1-char represents sentence 1 at character level granularity
  • Q1-word represents sentence 1 at word level granularity
  • Q2-char represents sentence 2 at character level granularity
  • Q2-word represents sentence 2 at word level granularity
  • 0 represents sentence Q1 It does not match the two texts of sentence Q2, which is a negative example
  • step S203 Construct a training data set: Combine all the positive samples and negative samples obtained after the operations of step S201 and step S202, and disrupt their order to construct a final training data set; among them, whether it is positive data
  • Still negative data includes five dimensions, namely Q1-char, Q1-word, Q2-char, Q2-word, 0 or 1.
  • the core of the present invention is a multi-granularity fusion model, which can be divided into four parts: multi-granularity embedding layer, multi-granularity fusion coding layer, interactive matching layer, prediction layer ;
  • multi-granularity embedding layer to perform vector mapping on words and characters in the sentence to obtain word-level sentence vectors and character-level sentence vectors;
  • build a multi-granularity fusion coding layer to encode word-level sentence vectors and character-level sentence vectors
  • the semantic feature vector of the sentence is obtained; then the interactive matching layer is constructed, and the semantic feature vector of the sentence is hierarchically compared to obtain the matching representation vector of the sentence pair; finally, the Sigmoid function of the prediction layer is processed to determine the semantic matching degree of the sentence pair.
  • the details are as follows:
  • the character vocabulary table is constructed by the text matching knowledge base obtained after preprocessing;
  • each character and word in the table is mapped to a unique digital identifier.
  • the mapping rule is: start with the number 1, and then follow the order in which each character or word is entered into the character vocabulary table Sort in ascending order to form a character-word mapping conversion table;
  • embedding_matrix numpy.zeros([len(tokenizer.word_index)+1,embedding_dim])
  • w2v_corpus is the training corpus, that is, all data in the text matching knowledge base; embedding_dim is the dimension of the character word vector, embedding_dim is set to 300 in the present invention, and word_set is the word list.
  • the input layer includes four inputs.
  • the two sentences to be matched are preprocessed to obtain Q1-char, Q1-word, Q2-char, and Q2-word respectively, which are formalized as: (Q1-char, Q1- word,Q2-char,Q2-word);
  • the present invention uses the positive example text displayed in step S201 as an example to form a piece of input data.
  • the result is as follows:
  • mapping in the character vocabulary table the above input data is converted into a numerical representation (assuming that the mappings of characters and words that appear in sentence 2 but not in sentence 1 are "Yes”: 18, “No”: 19, “Apply”: 20, “Please”: 21, “Whether”: 22, “Apply”: 23, “Extension”: 24), the results are as follows:
  • the word-level sentence vector and character-level sentence vector Q1-word Emd, Q1-char Emd, Q2-word Emd, Q2-char Emd are obtained after multi-granular embedding layer processing; among them, text
  • Each sentence in the matching knowledge base can transform text information into a vector form by means of character word vector mapping; embedding_dim is set to 300 in the present invention.
  • embedding_matrix is the weight of the character word vector matrix trained in step S301
  • embedding_matrix.shape[0] is the size of the word table (dictionary) of the character word vector matrix
  • embedding_dim is the dimension of the output character word vector
  • input_length is the input The length of the sequence.
  • the corresponding texts Q1 and Q2 are processed by the multi-granular embedding layer to obtain word-level sentence vectors and character-level sentence vectors Q1-word Emd, Q1-char Emd, Q2-word Emd, Q2-char Emd.
  • step S304 Construct a multi-granularity fusion coding layer: as shown in FIG. 8, the word-level sentence vector and character-level sentence vector are coded to obtain the sentence semantic feature vector; the construction of the multi-granularity fusion coding layer in step S304 is the step S303
  • the word-level sentence vector and character-level sentence vector output by the multi-granular embedding layer are used as input to obtain text semantic features from two perspectives, namely, character-level semantic feature extraction and word-level semantic feature extraction;
  • the text semantic features from the two perspectives are integrated to obtain the final sentence semantic feature vector; for the sentence Q1, the final sentence semantic feature vector is obtained as follows:
  • i represents the relative position of the corresponding character vector in the sentence
  • Q i is the corresponding vector representation of each character in the sentence Q1
  • Q′ i is the corresponding vector representation of each character after the initial LSTM encoding
  • Q′′ i is the corresponding vector representation of each character after the initial LSTM encoding.
  • i ' represents the relative position of the corresponding word in the sentence vector;
  • Q i' expressed as Q1 respective sentence vectors of each word;
  • Q 'i' expressed as a respective vector for each of words after the initial coding LSTM;
  • Q " i′ is the corresponding vector representation of each word after the second LSTM encoding;
  • step S30403 After step S30401 and step S30402, the feature vector of the corresponding character level is obtained And word-level feature vectors
  • the coding dimension of the present invention is uniformly set to 300, and the present invention sets with Add bitwise to get the final sentence semantic feature vector for text Q1
  • the formula is as follows:
  • step S30401 For sentence Q2, obtain the final sentence semantic feature vector
  • step S30501 After processing in step S304, the sentence semantic feature vectors of Q1 and Q2 are obtained with against with Perform three operations of subtraction, cross product, and dot product to get The formula is as follows:
  • dot product also called the quantified product
  • the result is the length of a vector projected in the direction of another vector, which is a scalar
  • cross product also called the vector product
  • the result is a vector that is perpendicular to the two existing vectors
  • i represents the relative position of the corresponding semantic feature in the sentence;
  • Q1 i is the text Q1 obtained by feature extraction in step S304
  • Each respective feature vector is the semantic representation;
  • I Q2 Q2 through the text feature extraction step S304 obtained The corresponding vector representation of each semantic feature in;
  • For sentence semantic feature vector with Use Dense to further extract the feature vector obtained; Indicates that the coding dimension is 300;
  • step S30504 will The result after a layer of fully connected layer encoding is the same as in step S30501 Sum, get the matching representation vector of the sentence pair
  • the formula is as follows:
  • the prediction layer receives the matching representation vector output in step S305, and uses the Sigmoid function for calculation to obtain a matching degree representation y pred between [0, 1];
  • y true represents the true label, that is, the 0 and 1 flags that indicate match or not in each training example;
  • y pred represents the prediction result;
  • the use of balanced cross entropy can automatically balance the positive and negative samples and improve the accuracy of classification; it fuses the cross entropy with the mean square error, and the formula is as follows:
  • the present invention designs a cross-entropy loss function to prevent over-fitting problems.
  • cross entropy is a common loss function for training models.
  • the method based on maximum likelihood estimation will generate input noise. This method may divide the training sample into 0 or 1, leading to the problem of overfitting.
  • the present invention proposes to use mean square error (MSE) as a balance parameter to balance positive samples and negative samples, thereby greatly improving the performance of the model.
  • MSE mean square error
  • model keras.models.Model([Q1-char,Q1-word,Q2-char,Q2-word],[y pred ])
  • the loss function loss selects the custom Loss in this step S401; the optimization algorithm optimizer selects the previously defined optim; Q1-char, Q1-word, Q2-char, and Q2-word are the model inputs, and y pred is the model output; evaluation Indicator metrics, the present invention selects accuracy, precision, recall, F 1 -score calculated based on recall and precision.
  • the model of the present invention has achieved better results than the current model on the LCQMC public data set.
  • the comparison of the experimental results is shown in the following table:
  • the first fourteen lines are the experimental results of the prior art model [Liu, X., Chen, Q., Deng, C., Zeng, H., Chen, J., Li, D., Tang, B. ,2018.Lcqmc:A large-scale chinese question matching corpus,in:Proceedings of the 27th International Conference on Computational Linguistics,pp.1952–1962]. Comparing the model of the present invention with the existing model, it can be seen that the method of the present invention has the best performance compared with other methods.
  • the intelligent matching device for Chinese sentence semantics based on the multi-granularity fusion model of the present invention includes:
  • the text matching knowledge base building unit is used to use crawlers to crawl question sets on the Internet public question and answer platform, or use the text matching data set published on the Internet as the original similar sentence knowledge base, and then preprocess the original similar sentence knowledge base ,
  • the main operation is to perform hyphenation and word segmentation on each sentence in the original similar sentence knowledge base, thereby constructing a text matching knowledge base for model training;
  • the text matching knowledge base building unit includes:
  • the original data processing subunit is used to hyphenate and segment the sentences in the original similar sentence knowledge base, thereby constructing the text matching knowledge base for model training;
  • the training data set generation unit is used to construct training positive example data and training negative example data according to the sentences in the text matching knowledge base, and build the final training data set based on the positive and negative example data; training data set generation unit include,
  • the training positive example data construction subunit is used to combine semantically matched sentences in the text matching knowledge base, and add matching label 1 to it to construct the training positive example data;
  • the training negative example data constructs a subunit, which is used to first select a sentence q 1 from the text matching knowledge base, and then randomly select a sentence q 2 that does not match the sentence q 1 semantically from the text matching knowledge base, and compare q 1 with q 2 Combine and add a matching label 0 to it, and construct it as training negative example data;
  • the training data set construction subunit is used to combine all the training positive example data and the training negative example data, and disrupt the order to construct the final training data set;
  • the multi-granularity fusion model building unit is used to construct the character word mapping conversion table, and to construct the input layer, the multi-granularity embedding layer, the multi-granularity fusion coding layer, the interactive matching layer, and the prediction layer at the same time; among them, the multi-granularity fusion model building unit includes:
  • the character word mapping conversion table constructs a subunit, which is used to segment each sentence in the text matching knowledge base according to characters and words, and store each character and word in a list in turn to obtain a character word table. Subsequently, starting with the number 1, the characters and words are entered in the character word table in ascending order, thereby forming the character word mapping conversion table required by the present invention; after the character word mapping conversion table is constructed, each character in the table Characters and words are mapped to unique digital identifiers; after that, use Word2Vec to train the character word vector model to obtain the weight of the character word vector matrix;
  • the input layer constructs sub-units, which are used to convert each character and word in the input sentence into a corresponding digital identifier according to the character-word mapping conversion table, thereby completing the data input, specifically to obtain q1 and q2 respectively, and convert them Formalized as: (q1-char,q1-word,q2-char,q2-word);
  • the multi-granular embedding layer constructs sub-units, which are used to load pre-trained character word vector weights, convert the character words in the input sentence into the character word vector form, and then form a complete sentence vector representation; this operation is based on the digital identification of the character words Completed by looking up the character word vector matrix;
  • the multi-granularity fusion coding layer constructs sub-units, which are used to take the word-level sentence vector and character-level sentence vector output by the multi-granularity embedding layer as input; first obtain text semantic features from two perspectives, namely, character-level semantic feature extraction and word-level Semantic feature extraction; then through the form of bitwise addition, the semantic features of the text from the two perspectives are integrated to obtain the final sentence semantic feature vector;
  • the interactive matching layer constructs sub-units, which are used to perform hierarchical matching calculations on the semantic feature vectors of the two input sentences to obtain the matching representation vectors of the sentence pairs;
  • the prediction layer constructs a sub-unit for receiving the matching representation vector output by the interactive matching layer, using the Sigmoid function to calculate, and obtaining the matching degree between [0,1], and finally judging the sentence pair by comparing with the established threshold Matching degree;
  • the multi-granular fusion model training unit is used to construct the loss function needed in the model training process and complete the optimization training of the model; the multi-granular fusion model training unit includes:
  • the loss function construction subunit is used to construct the loss function and calculate the error of the text matching degree between sentence 1 and sentence 2;
  • the model optimization training subunit is used to train and adjust the parameters in the model training, thereby reducing the error between the predicted matching degree between sentence 1 and sentence 2 and the true matching degree in the model training process.
  • the device for intelligent semantic matching of Chinese sentences based on the multi-granularity fusion model shown in FIG. 10 can be integrated and deployed in various hardware devices, such as personal computers, workstations, and smart mobile devices.
  • a plurality of instructions are stored therein, and the instructions are loaded by the processor to execute the steps of the method for intelligent semantic matching of Chinese sentences based on the multi-granularity fusion model of Embodiment 1.
  • the electronic device includes:
  • the processor is used to execute instructions in the storage medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

Disclosed are a multi-granularity fusion model-based intelligent semantic Chinese sentence matching method and a device, pertaining to the field of artificial intelligence and the field of natural language processing. The present invention addresses the technical problems of non-comprehensive semantic analysis and inaccurate sentence matching of single granularity-based models. The method is specifically as follows: S1, constructing a text matching knowledge database; S2, constructing a training data set of a text matching model; S3, constructing a multi-granularity fusion model, which is specifically as follows: S301, constructing a character word mapping conversion table; S302, constructing an input layer; S303, constructing a multi-granularity embedding layer; S304, constructing a multi-granularity fusion encoding layer; S305, constructing an interaction matching layer, and S306, constructing a prediction layer; and S4, training the multi-granularity fusion model. The device comprises a text matching knowledge database construction unit, a training data set construction unit for a text matching model, a multi-granularity fusion model construction unit, and a multi-granularity fusion model training unit.

Description

基于多粒度融合模型的中文句子语义智能匹配方法及装置Chinese sentence semantic intelligent matching method and device based on multi-granularity fusion model 技术领域Technical field
本发明涉及人工智能领域和自然语言处理领域,具体地说是一种基于多粒度融合模型的中文句子语义智能匹配方法及装置。The invention relates to the field of artificial intelligence and natural language processing, in particular to a method and device for intelligently matching Chinese sentence semantics based on a multi-granularity fusion model.
背景技术Background technique
句子语义匹配在许多自然语言处理任务中扮演着关键角色,例如问答(QA)、自然语言推理(NLI)、机器翻译(MT)等。句子语义匹配的关键是计算给定句子对的语义之间的匹配程度。句子可以从不同的粒度上进行分割,例如字符、词语和短语等。当前,常用的文本分割粒度是词语,特别是在中文领域中更为普遍。Sentence semantic matching plays a key role in many natural language processing tasks, such as question answering (QA), natural language inference (NLI), machine translation (MT), and so on. The key to sentence semantic matching is to calculate the degree of matching between the semantics of a given sentence pair. Sentences can be segmented at different granularities, such as characters, words, and phrases. Currently, the commonly used text segmentation granularity is words, especially in the Chinese field.
目前,中文句子语义匹配模型多数是面向词语粒度的,而忽略了其它分割粒度。这些模型无法完全捕获嵌入在句子中的语义特征,有时甚至会产生噪音,这会影响句子匹配的准确性。目前,该领域的研究人员逐渐倾向于从句子的多种不同角度或粒度考虑语义匹配,比较成功的模型方法有MultiGranCNN、MV-LSTM、MPCM、BiMPM、DIIN等。尽管这些模型在一定程度上缓解了词语粒度上建模的局限性,但仍无法彻底解决句子语义的精准匹配问题,这在具有丰富语义特征的中文上表现更为突出。At present, most Chinese sentence semantic matching models are oriented to word granularity, while ignoring other segmentation granularities. These models cannot fully capture the semantic features embedded in the sentence, and sometimes even generate noise, which will affect the accuracy of sentence matching. At present, researchers in this field gradually tend to consider semantic matching from a variety of different perspectives or granularities of sentences. The more successful model methods include MultiGranCNN, MV-LSTM, MPCM, BiMPM, DIIN, etc. Although these models alleviate the limitations of word granularity modeling to a certain extent, they still cannot completely solve the problem of precise matching of sentence semantics, which is more prominent in Chinese with rich semantic features.
专利号为CN106569999A的专利文献公开了一种多粒度短文本语义相似度比较方法,其包括如下步骤:S1、对短文本进行预处理;所述预处理包括中文分词以及词性标注;S2、对经过预处理的短文本进行特征选择;S3、对经过特征选择的向量集进行距离测量以确定短文本的相似度。但是该技术方案无法彻底解决句子语义的精准匹配问题。The patent document with the patent number CN106569999A discloses a multi-granularity short text semantic similarity comparison method, which includes the following steps: S1, short text preprocessing; the preprocessing includes Chinese word segmentation and part-of-speech tagging; S2. Feature selection is performed on the preprocessed short text; S3, distance measurement is performed on the vector set after feature selection to determine the similarity of the short text. However, this technical solution cannot completely solve the problem of precise matching of sentence semantics.
发明内容Summary of the invention
本发明的技术任务是提供一种基于多粒度融合模型的中文句子语义智能匹配方法及装置,来解决单粒度模型语义分析不全面和句子匹配不精确的问题。The technical task of the present invention is to provide a Chinese sentence semantic intelligent matching method and device based on a multi-granularity fusion model to solve the problems of incomplete semantic analysis of a single-granularity model and imprecise sentence matching.
本发明的技术任务是按以下方式实现的,基于多粒度融合模型的中文句子语义智能匹配方法,该方法具体如下:The technical task of the present invention is achieved in the following manner. The intelligent matching method for Chinese sentence semantics based on a multi-granularity fusion model is specifically as follows:
S1、构建文本匹配知识库;S1. Build a text matching knowledge base;
S2、构建文本匹配模型的训练数据集:对于每一个句子,在文本匹配知识库中都会有一个与之对应的标准的语义匹配的句子,此句子可与其组合用来构 建训练正例;其他不匹配的句子可自由组合用来构建训练负例;用户可根据文本匹配知识库大小来设定负例的数量,从而构建训练数据集;S2. Constructing the training data set of the text matching model: For each sentence, there will be a corresponding standard semantic matching sentence in the text matching knowledge base. This sentence can be combined with it to construct training examples; others are not Matched sentences can be freely combined to construct training negative examples; users can set the number of negative examples according to the size of the text matching knowledge base to construct a training data set;
S3、构建多粒度融合模型;具体如下:S3. Build a multi-granularity fusion model; the details are as follows:
S301、构建字符词语映射转换表;S301. Construct a character word mapping conversion table;
S302、构建输入层;S302. Construct an input layer;
S303、构建多粒度嵌入层:对句子中的词语和字符进行向量映射,得到词语级句子向量和字符级句子向量;S303. Construct a multi-granularity embedding layer: perform vector mapping on words and characters in the sentence to obtain word-level sentence vectors and character-level sentence vectors;
S304、构建多粒度融合编码层:对词语级句子向量和字符级句子向量进行编码处理,得到句子语义特征向量;S304. Construct a multi-granularity fusion coding layer: perform coding processing on word-level sentence vectors and character-level sentence vectors to obtain sentence semantic feature vectors;
S305、构建交互匹配层:对句子语义特征向量进行分层比较,得到句子对的匹配表征向量;S305. Construct an interactive matching layer: perform hierarchical comparison of sentence semantic feature vectors to obtain matching representation vectors of sentence pairs;
S306、构建预测层:经预测层的Sigmoid函数处理,判断句子对的语义匹配程度;S306. Construct a prediction layer: After the Sigmoid function of the prediction layer is processed, the degree of semantic matching of sentence pairs is judged;
S4、训练多粒度融合模型。S4. Train a multi-granularity fusion model.
作为优选,所述步骤S1中构建文本匹配知识库具体如下:Preferably, the construction of the text matching knowledge base in step S1 is specifically as follows:
S101、使用爬虫获取原始数据:在互联网公共问答平台爬取问题集,得到原始相似句子知识库;或者使用网上公开的句子匹配数据集,作为原始相似句子知识库;S101. Use crawlers to obtain original data: Crawl the question set on the Internet public question and answer platform to obtain the original similar sentence knowledge base; or use the sentence matching data set disclosed on the Internet as the original similar sentence knowledge base;
S102、预处理原始数据:预处理原始相似句子知识库中的相似文本,对每个句子进行分词和断字处理,得到文本匹配知识库;其中,分词处理是以中文里的每个词语作为基本单位,对每条数据进行分词操作;断字处理是以中文里的每个字作为基本单位,对每条数据进行断字操作;每个汉字和词语之间用空格进行切分,并保留每条数据中包括的数字、标点以及特殊字符在内的所有内容;S102. Preprocess the original data: preprocess the similar text in the original similar sentence knowledge base, and perform word segmentation and hyphenation processing on each sentence to obtain a text matching knowledge base; wherein, the word segmentation processing is based on each word in Chinese Unit, perform word segmentation operations on each piece of data; hyphenation processing is based on each character in Chinese as the basic unit, and each piece of data is hyphenated; each Chinese character and word is divided by a space, and each word is reserved. All contents including numbers, punctuation and special characters in the data;
所述步骤S2中构建文本匹配模型的训练数据集具体如下:The training data set for constructing the text matching model in the step S2 is specifically as follows:
S201、构建训练正例:将句子与其对应的语义匹配的句子进行组合,构建训练正例,形式化为:(Q1-char,Q1-word,Q2-char,Q2-word,1);S201. Construct a training positive example: Combine a sentence with its corresponding semantically matched sentence to construct a training positive example, which is formalized as: (Q1-char, Q1-word, Q2-char, Q2-word, 1);
其中,Q1-char表示字符级粒度的句子1;Q1-word表示词语级粒度的句子1;Q2-char表示字符级粒度的句子2;Q2-word表示词语级粒度的句子2;1表示句子1和句子2这两个文本相匹配,是正例;Among them, Q1-char represents sentence 1 at character level granularity; Q1-word represents sentence 1 at word level granularity; Q2-char represents sentence 2 at character level granularity; Q2-word represents sentence 2 at word level granularity; 1 represents sentence 1 It matches the two texts of sentence 2, which is a positive example;
S202、构建训练负例:选中一个句子Q1,再从文本匹配知识库中随机选择 一个与句子Q1不匹配的句子Q2,将Q1与Q2进行组合,构建负例,形式化为:(Q1-char,Q1-word,Q2-char,Q2-word,0);S202. Construct a training negative example: select a sentence Q1, then randomly select a sentence Q2 that does not match the sentence Q1 from the text matching knowledge base, combine Q1 and Q2 to construct a negative example, which is formalized as: (Q1-char ,Q1-word,Q2-char,Q2-word,0);
其中,Q1-char表示字符级粒度的句子1;Q1-word表示词语级粒度的句子1;Q2-char表示字符级粒度的句子2;Q2-word表示词语级粒度的句子2;0表示句子Q1和句子Q2这两个文本不匹配,是负例;Among them, Q1-char represents sentence 1 at character level granularity; Q1-word represents sentence 1 at word level granularity; Q2-char represents sentence 2 at character level granularity; Q2-word represents sentence 2 at word level granularity; 0 represents sentence Q1 It does not match the two texts of sentence Q2, which is a negative example;
S203、构建训练数据集:将经过步骤S201和步骤S202操作后所获得的全部的正例样本和负例样本进行组合,并打乱其顺序,构建最终的训练数据集;其中,无论是正例数据还是负例数据均包含五个维度,即Q1-char、Q1-word、Q2-char、Q2-word、0或1。S203. Construct a training data set: Combine all the positive samples and negative samples obtained after the operations of step S201 and step S202, and disrupt their order to construct a final training data set; among them, whether it is positive data Still negative data includes five dimensions, namely Q1-char, Q1-word, Q2-char, Q2-word, 0 or 1.
更优地,所述步骤S301中构建字符词语映射转换表具体如下:More preferably, the construction of the character word mapping conversion table in step S301 is specifically as follows:
S30101、字符词语表通过预处理后得到的文本匹配知识库来构建;S30101. The character vocabulary table is constructed by the text matching knowledge base obtained after preprocessing;
S30102、字符词语表构建完成后,表中每个字符和词语均被映射为唯一的数字标识,映射规则为:以数字1为起始,随后按照每个字符、词语被录入字符词语表的顺序依次递增排序,从而形成字符词语映射转换表;S30102. After the construction of the character vocabulary table is completed, each character and word in the table is mapped to a unique digital identifier. The mapping rule is: start with the number 1, and then follow the order in which each character and word is entered into the character vocabulary table Sort in ascending order to form a character-word mapping conversion table;
S30103、使用Word2Vec训练字符词语向量模型,得到字符词语向量矩阵权重embedding_matrix;S30103. Use Word2Vec to train the character word vector model to obtain the weight embedding_matrix of the character word vector matrix;
所述步骤S302中构建输入层具体如下:The construction of the input layer in the step S302 is specifically as follows:
S30201、输入层包括四个输入,对两个待匹配的句子进行预处理分别获取Q1-char、Q1-word、Q2-char、Q2-word,将其形式化为:(Q1-char,Q1-word,Q2-char,Q2-word);S30201. The input layer includes four inputs. The two sentences to be matched are preprocessed to obtain Q1-char, Q1-word, Q2-char, and Q2-word respectively, which are formalized as: (Q1-char, Q1- word,Q2-char,Q2-word);
S30202、对于输入句子中的每个字符和词语均按照步骤S301中构建完成的字符词语映射转换表将其转化为相应的数字标识。S30202: For each character and word in the input sentence, it is converted into a corresponding digital identifier according to the character-word mapping conversion table constructed in step S301.
更优地,所述步骤S303中构建多粒度嵌入层具体如下:More preferably, the construction of the multi-granularity embedding layer in step S303 is specifically as follows:
S30301、通过加载步骤S301中训练所得的字符词语向量矩阵权重来初始化当前层的权重参数;S30301: Initialize the weight parameters of the current layer by loading the weights of the character word vector matrix obtained through training in step S301;
S30302、针对输入句子Q1和Q2,经过多粒度嵌入层处理后得到其词语级句子向量和字符级句子向量Q1-word Emd、Q1-char Emd、Q2-word Emd、Q2-char Emd;其中,文本匹配知识库中每一个句子均能通过字符词语向量映射的方式,将文本信息转化为向量形式;S30302. For the input sentences Q1 and Q2, the word-level sentence vector and character-level sentence vector Q1-word Emd, Q1-char Emd, Q2-word Emd, Q2-char Emd are obtained after multi-granular embedding layer processing; among them, text Each sentence in the matching knowledge base can transform text information into vector form through character word vector mapping;
所述步骤S304中构建多粒度融合编码层是将步骤S303中多粒度嵌入层输出的词语级句子向量和字符级句子向量作为输入,从两个角度获取文本语义特征,即字符级别语义特征提取和词语级别语义特征提取;再通过按位相加的形式,对两个角度的文本语义特征进行整合,得到最终的句子语义特征向量;对于句子Q1求取最终的句子语义特征向量具体如下:The construction of the multi-granularity fusion coding layer in step S304 takes the word-level sentence vector and character-level sentence vector output by the multi-granularity embedding layer in step S303 as input, and obtains text semantic features from two perspectives, namely, character-level semantic feature extraction and Word-level semantic feature extraction; then through the form of bitwise addition, the text semantic features from the two perspectives are integrated to obtain the final sentence semantic feature vector; for sentence Q1, the final sentence semantic feature vector is obtained as follows:
S30401、针对字符级别语义特征提取,具体如下:S30401. For character-level semantic feature extraction, the details are as follows:
S3040101、使用LSTM进行特征提取,得到特征向量
Figure PCTCN2020104723-appb-000001
公式如下:
S3040101, use LSTM for feature extraction to obtain feature vectors
Figure PCTCN2020104723-appb-000001
The formula is as follows:
Figure PCTCN2020104723-appb-000002
Figure PCTCN2020104723-appb-000002
S3040102、针对
Figure PCTCN2020104723-appb-000003
进一步采用两种不同的方法进行编码,具体如下:
S3040102, for
Figure PCTCN2020104723-appb-000003
Further use two different methods for encoding, as follows:
①、对
Figure PCTCN2020104723-appb-000004
继续使用LSTM进行二次特征提取,得到相应特征向量
Figure PCTCN2020104723-appb-000005
公式如下:
①, yes
Figure PCTCN2020104723-appb-000004
Continue to use LSTM for secondary feature extraction to obtain the corresponding feature vector
Figure PCTCN2020104723-appb-000005
The formula is as follows:
Figure PCTCN2020104723-appb-000006
Figure PCTCN2020104723-appb-000006
②、对
Figure PCTCN2020104723-appb-000007
使用注意力机制Attention提取特征,得到相应特征向量
Figure PCTCN2020104723-appb-000008
公式如下:
②, yes
Figure PCTCN2020104723-appb-000007
Use the attention mechanism Attention to extract features and get the corresponding feature vector
Figure PCTCN2020104723-appb-000008
The formula is as follows:
Figure PCTCN2020104723-appb-000009
Figure PCTCN2020104723-appb-000009
S3040103、针对
Figure PCTCN2020104723-appb-000010
使用Attention再次进行编码提取关键特征,得到特征向量
Figure PCTCN2020104723-appb-000011
公式如下:
S3040103, for
Figure PCTCN2020104723-appb-000010
Use Attention to code again to extract key features to obtain feature vectors
Figure PCTCN2020104723-appb-000011
The formula is as follows:
Figure PCTCN2020104723-appb-000012
Figure PCTCN2020104723-appb-000012
S3040104、将
Figure PCTCN2020104723-appb-000013
Figure PCTCN2020104723-appb-000014
按位相加得到字符级别的语义特征
Figure PCTCN2020104723-appb-000015
公式如下:
S3040104, will
Figure PCTCN2020104723-appb-000013
and
Figure PCTCN2020104723-appb-000014
Bitwise addition to obtain character-level semantic features
Figure PCTCN2020104723-appb-000015
The formula is as follows:
Figure PCTCN2020104723-appb-000016
Figure PCTCN2020104723-appb-000016
其中,i表示相应字符向量在句子中的相对位置,Q i为句子Q1中每个字符的相应向量表示;Q′ i为经过初次LSTM编码后每个字符的相应向量表示;Q″ i为经过第二次LSTM编码后每个字符的相应向量表示; Among them, i represents the relative position of the corresponding character vector in the sentence, and Q i is the corresponding vector representation of each character in the sentence Q1; Q′ i is the corresponding vector representation of each character after the initial LSTM encoding; Q″ i is the corresponding vector representation of each character after the initial LSTM encoding. The corresponding vector representation of each character after the second LSTM encoding;
S30402、针对词语级别语义特征提取,具体如下:S30402. For word-level semantic feature extraction, the details are as follows:
S3040201、使用LSTM进行特征提取,得到特征向量
Figure PCTCN2020104723-appb-000017
公式如下:
S3040201 Use LSTM for feature extraction to obtain feature vectors
Figure PCTCN2020104723-appb-000017
The formula is as follows:
Figure PCTCN2020104723-appb-000018
Figure PCTCN2020104723-appb-000018
S3040202、针对
Figure PCTCN2020104723-appb-000019
进一步采用LSTM进行二次特征提取,得到相应特征向量
Figure PCTCN2020104723-appb-000020
公式如下:
S3040202, for
Figure PCTCN2020104723-appb-000019
Further use LSTM for secondary feature extraction to obtain the corresponding feature vector
Figure PCTCN2020104723-appb-000020
The formula is as follows:
Figure PCTCN2020104723-appb-000021
Figure PCTCN2020104723-appb-000021
S3040203、针对
Figure PCTCN2020104723-appb-000022
使用Attention再次进行编码提取关键特征,得到词语级别特征向量
Figure PCTCN2020104723-appb-000023
公式如下:
S3040203, for
Figure PCTCN2020104723-appb-000022
Use Attention to code again to extract key features to obtain word-level feature vectors
Figure PCTCN2020104723-appb-000023
The formula is as follows:
Figure PCTCN2020104723-appb-000024
Figure PCTCN2020104723-appb-000024
其中,i'表示相应词语向量在句子中的相对位置;Q i′为句子Q1中每个词语的相应向量表示;Q′ i′为经过初次LSTM编码后每个词语的相应向量表示;Q″ i′为经过第二次LSTM编码后每个词语的相应向量表示; Wherein, i 'represents the relative position of the corresponding word in the sentence vector; Q i' expressed as Q1 respective sentence vectors of each word; Q 'i' expressed as a respective vector for each of words after the initial coding LSTM; Q " i′ is the corresponding vector representation of each word after the second LSTM encoding;
S30403、经过步骤S30401和步骤S30402得到相应字符级别的特征向量
Figure PCTCN2020104723-appb-000025
以及词语级别的特征向量
Figure PCTCN2020104723-appb-000026
Figure PCTCN2020104723-appb-000027
Figure PCTCN2020104723-appb-000028
按位相加,得到针对文本Q1的最终句子语义特征向量
Figure PCTCN2020104723-appb-000029
公式如下:
S30403: After step S30401 and step S30402, the feature vector of the corresponding character level is obtained
Figure PCTCN2020104723-appb-000025
And word-level feature vectors
Figure PCTCN2020104723-appb-000026
will
Figure PCTCN2020104723-appb-000027
with
Figure PCTCN2020104723-appb-000028
Add bitwise to get the final sentence semantic feature vector for text Q1
Figure PCTCN2020104723-appb-000029
The formula is as follows:
Figure PCTCN2020104723-appb-000030
Figure PCTCN2020104723-appb-000030
对于句子Q2求取最终的句子语义特征向量
Figure PCTCN2020104723-appb-000031
的方法,同步骤S30401到步骤S30403。
For sentence Q2, obtain the final sentence semantic feature vector
Figure PCTCN2020104723-appb-000031
The method is the same as step S30401 to step S30403.
更优地,所述步骤S305构建交互匹配层具体如下:More preferably, the construction of the interactive matching layer in step S305 is specifically as follows:
S30501、经过步骤S304处理得到Q1、Q2的句子语义特征向量
Figure PCTCN2020104723-appb-000032
Figure PCTCN2020104723-appb-000033
针对
Figure PCTCN2020104723-appb-000034
Figure PCTCN2020104723-appb-000035
进行减法、叉乘以及点乘三种操作,得到
Figure PCTCN2020104723-appb-000036
公式如下:
S30501: After processing in step S304, the sentence semantic feature vectors of Q1 and Q2 are obtained
Figure PCTCN2020104723-appb-000032
with
Figure PCTCN2020104723-appb-000033
against
Figure PCTCN2020104723-appb-000034
with
Figure PCTCN2020104723-appb-000035
Perform three operations of subtraction, cross product, and dot product to get
Figure PCTCN2020104723-appb-000036
The formula is as follows:
Figure PCTCN2020104723-appb-000037
Figure PCTCN2020104723-appb-000037
Figure PCTCN2020104723-appb-000038
Figure PCTCN2020104723-appb-000038
Figure PCTCN2020104723-appb-000039
Figure PCTCN2020104723-appb-000039
其中,点乘:也叫数量积,结果是一个向量在另一个向量方向上投影的长 度,是一个标量;叉乘:也叫向量积,结果是一个和已有两个向量都垂直的向量;Among them, dot product: also called the quantified product, the result is the length of a vector projected in the direction of another vector, which is a scalar; cross product: also called the vector product, the result is a vector that is perpendicular to the two existing vectors;
同时,使用一个全连接层Dense进一步编码得到
Figure PCTCN2020104723-appb-000040
Figure PCTCN2020104723-appb-000041
公式如下:
At the same time, use a fully connected layer Dense to further encode
Figure PCTCN2020104723-appb-000040
with
Figure PCTCN2020104723-appb-000041
The formula is as follows:
Figure PCTCN2020104723-appb-000042
Figure PCTCN2020104723-appb-000042
Figure PCTCN2020104723-appb-000043
Figure PCTCN2020104723-appb-000043
其中,i表示相应语义特征在句子中的相对位置;Q1 i为文本Q1经过步骤S304特征提取得到的
Figure PCTCN2020104723-appb-000044
中每个语义特征的相应向量表示;Q2 i为文本Q2经过步骤S304特征提取得到的
Figure PCTCN2020104723-appb-000045
中每个语义特征的相应向量表示;
Figure PCTCN2020104723-appb-000046
为针对句子语义特征向量
Figure PCTCN2020104723-appb-000047
Figure PCTCN2020104723-appb-000048
使用Dense进一步提取,得到的特征向量;
Figure PCTCN2020104723-appb-000049
表示编码维度为300;
Among them, i represents the relative position of the corresponding semantic feature in the sentence; Q1 i is the text Q1 obtained by feature extraction in step S304
Figure PCTCN2020104723-appb-000044
Each respective feature vector is the semantic representation; I Q2 Q2 through the text feature extraction step S304 obtained
Figure PCTCN2020104723-appb-000045
The corresponding vector representation of each semantic feature in;
Figure PCTCN2020104723-appb-000046
For sentence semantic feature vector
Figure PCTCN2020104723-appb-000047
with
Figure PCTCN2020104723-appb-000048
Use Dense to further extract the feature vector obtained;
Figure PCTCN2020104723-appb-000049
Indicates that the coding dimension is 300;
S30502、将
Figure PCTCN2020104723-appb-000050
Figure PCTCN2020104723-appb-000051
联接起来得到
Figure PCTCN2020104723-appb-000052
公式如下:
S30502, will
Figure PCTCN2020104723-appb-000050
with
Figure PCTCN2020104723-appb-000051
Connect to get
Figure PCTCN2020104723-appb-000052
The formula is as follows:
Figure PCTCN2020104723-appb-000053
Figure PCTCN2020104723-appb-000053
同时,
Figure PCTCN2020104723-appb-000054
Figure PCTCN2020104723-appb-000055
同样进行减法、叉乘操作,公式如下:
at the same time,
Figure PCTCN2020104723-appb-000054
with
Figure PCTCN2020104723-appb-000055
The same subtraction and cross multiplication operations are performed, and the formula is as follows:
Figure PCTCN2020104723-appb-000056
Figure PCTCN2020104723-appb-000056
Figure PCTCN2020104723-appb-000057
Figure PCTCN2020104723-appb-000057
再将二者结果联接得到
Figure PCTCN2020104723-appb-000058
公式如下:
Then connect the two results to get
Figure PCTCN2020104723-appb-000058
The formula is as follows:
Figure PCTCN2020104723-appb-000059
Figure PCTCN2020104723-appb-000059
S30503、将
Figure PCTCN2020104723-appb-000060
使用两层全连接层进行特征提取得到
Figure PCTCN2020104723-appb-000061
并将
Figure PCTCN2020104723-appb-000062
Figure PCTCN2020104723-appb-000063
进行求和,得到
Figure PCTCN2020104723-appb-000064
公式如下:
S30503, will
Figure PCTCN2020104723-appb-000060
Use two fully connected layers for feature extraction to get
Figure PCTCN2020104723-appb-000061
And will
Figure PCTCN2020104723-appb-000062
and
Figure PCTCN2020104723-appb-000063
To sum and get
Figure PCTCN2020104723-appb-000064
The formula is as follows:
Figure PCTCN2020104723-appb-000065
Figure PCTCN2020104723-appb-000065
Figure PCTCN2020104723-appb-000066
Figure PCTCN2020104723-appb-000066
Figure PCTCN2020104723-appb-000067
Figure PCTCN2020104723-appb-000067
S30504、将
Figure PCTCN2020104723-appb-000068
经过一层全连接层编码后的结果与步骤S30501中
Figure PCTCN2020104723-appb-000069
求和, 得到句子对的匹配表征向量
Figure PCTCN2020104723-appb-000070
公式如下:
S30504, will
Figure PCTCN2020104723-appb-000068
The result after a layer of fully connected layer encoding is the same as in step S30501
Figure PCTCN2020104723-appb-000069
Sum, get the matching representation vector of the sentence pair
Figure PCTCN2020104723-appb-000070
The formula is as follows:
Figure PCTCN2020104723-appb-000071
Figure PCTCN2020104723-appb-000071
所述步骤S306中构建预测层具体如下:The construction of the prediction layer in step S306 is specifically as follows:
S30601、预测层接收步骤S305输出的匹配表征向量,使用Sigmoid函数进行计算,得到处于[0,1]之间的匹配度表示y predS30601: The prediction layer receives the matching representation vector output in step S305, and uses the Sigmoid function for calculation to obtain a matching degree representation y pred between [0, 1];
S30602、将y pred与设立的阈值进行比较来判别句子对的匹配程度,具体如下: S30602. Compare y pred with the established threshold to determine the matching degree of the sentence pair, which is specifically as follows:
①、当y pred≥0.5时,表示句子Q1以及句子Q2相匹配; ①. When y pred ≥0.5, it means that sentence Q1 and sentence Q2 match;
②、当y pred<0.5时,表示句子Q1以及句子Q2不匹配。 ②. When y pred <0.5, it means that sentence Q1 and sentence Q2 do not match.
作为优选,所述步骤S4中训练多粒度融合模型具体如下:Preferably, the training of the multi-granularity fusion model in step S4 is specifically as follows:
S401、构建损失函数:通过将均方误差(MSE)设置为交叉熵的平衡因子,设计出平衡交叉熵,其中,均方误差的公式如下:S401. Construct a loss function: Design a balanced cross-entropy by setting the mean square error (MSE) as the cross-entropy balance factor, where the formula of the mean square error is as follows:
Figure PCTCN2020104723-appb-000072
Figure PCTCN2020104723-appb-000072
其中,y true表示真实标签,即每条训练样例中表示匹配与否的0、1标志;y pred表示预测结果; Among them, y true represents the true label, that is, the 0 and 1 flags that indicate match or not in each training example; y pred represents the prediction result;
当分类边界模糊时,平衡交叉熵的使用能够自动平衡正负样本,并提高分类的准确性;其将交叉熵与均方误差融合,公式如下:When the classification boundary is blurred, the use of balanced cross entropy can automatically balance the positive and negative samples and improve the accuracy of classification; it fuses the cross entropy with the mean square error, and the formula is as follows:
Figure PCTCN2020104723-appb-000073
Figure PCTCN2020104723-appb-000073
S402、优化训练模型:选择使用RMSprop优化函数作为本模型的优化函数,超参数均选择Keras中的默认值设置。S402. Optimize the training model: choose to use the RMSprop optimization function as the optimization function of this model, and the hyperparameters are all set to default values in Keras.
一种基于多粒度融合模型的中文句子语义智能匹配装置,该装置包括,A Chinese sentence semantic intelligent matching device based on a multi-granularity fusion model, the device comprising:
文本匹配知识库构建单元,用于使用爬虫程序,在互联网公共问答平台爬取问题集,或者使用网上公开的文本匹配数据集,作为原始相似句子知识库, 再对原始相似句子知识库进行预处理,主要操作为对原始相似句子知识库中的每个句子进行断字处理和分词处理,从而构建用于模型训练的文本匹配知识库;The text matching knowledge base building unit is used to use crawlers to crawl question sets on the Internet public question and answer platform, or use text matching data sets published on the Internet as the original similar sentence knowledge base, and then preprocess the original similar sentence knowledge base , The main operation is to perform hyphenation and word segmentation on each sentence in the original similar sentence knowledge base, so as to construct a text matching knowledge base for model training;
训练数据集生成单元,用于根据文本匹配知识库中的句子来构建训练正例数据和训练负例数据,并且基于正例数据与负例数据来构建最终的训练数据集;The training data set generation unit is used to construct training positive example data and training negative example data according to the sentences in the text matching knowledge base, and construct the final training data set based on the positive example data and the negative example data;
多粒度融合模型构建单元,用于构建字符词语映射转换表,并同时构建输入层、多粒度嵌入层、多粒度融合编码层、交互匹配层、预测层;其中,多粒度融合模型构建单元包括,The multi-granularity fusion model building unit is used to construct the character word mapping conversion table, and to construct the input layer, the multi-granularity embedding layer, the multi-granularity fusion coding layer, the interactive matching layer, and the prediction layer at the same time; among them, the multi-granularity fusion model building unit includes:
字符词语映射转换表构建子单元,用于对文本匹配知识库中的每个句子按字符和词语进行切分,并将每个字符和词语依次存入一个列表中,从而得到一个字符词语表,随后以数字1为起始,按照每个字符和词语被录入字符词语表的顺序依次递增排序,从而形成本发明所需的字符词语映射转换表;字符词语映射转换表构建完成后,表中每个字符和词语均被映射为唯一的数字标识;其后,使用Word2Vec训练字符词语向量模型,得到字符词语向量矩阵权重;The character word mapping conversion table constructs a subunit, which is used to segment each sentence in the text matching knowledge base according to characters and words, and store each character and word in a list in turn to obtain a character word table. Subsequently, starting with the number 1, the characters and words are entered in the character word table in ascending order, thereby forming the character word mapping conversion table required by the present invention; after the character word mapping conversion table is constructed, each character in the table Characters and words are mapped to unique digital identifiers; after that, use Word2Vec to train the character word vector model to obtain the weight of the character word vector matrix;
输入层构建子单元,用于根据字符词语映射转换表,将输入句子中的每个字符和词语转化为相应的数字标识,从而完成数据的输入,具体来说就是分别获取q1与q2,将其形式化为:(q1-char,q1-word,q2-char,q2-word);The input layer constructs sub-units, which are used to convert each character and word in the input sentence into a corresponding digital identifier according to the character-word mapping conversion table, thereby completing the data input, specifically to obtain q1 and q2 respectively, and convert them Formalized as: (q1-char,q1-word,q2-char,q2-word);
多粒度嵌入层构建子单元,用于加载预训练好的字符词语向量权重,将输入句子中的字符词语转换为字符词语向量形式,进而构成完整的句子向量表示;该操作根据字符词语的数字标识查找字符词语向量矩阵而完成;The multi-granular embedding layer constructs sub-units, which are used to load pre-trained character word vector weights, convert the character words in the input sentence into the character word vector form, and then form a complete sentence vector representation; this operation is based on the digital identification of the character words Completed by looking up the character word vector matrix;
多粒度融合编码层构建子单元,用于将多粒度嵌入层输出的词语级句子向量和字符级句子向量作为输入;先从两个角度来获取文本语义特征,即字符级别语义特征提取和词语级别语义特征提取;再通过按位相加的形式,对两个角度的文本语义特征进行整合,得到最终的句子语义特征向量;The multi-granularity fusion coding layer constructs sub-units, which are used to take the word-level sentence vector and character-level sentence vector output by the multi-granularity embedding layer as input; first obtain text semantic features from two perspectives, namely, character-level semantic feature extraction and word-level Semantic feature extraction; then through the form of bitwise addition, the semantic features of the text from the two perspectives are integrated to obtain the final sentence semantic feature vector;
交互匹配层构建子单元,用于将输入的两个句子语义特征向量,经过分层匹配计算,得到句子对的匹配表征向量;The interactive matching layer constructs sub-units, which are used to perform hierarchical matching calculations on the semantic feature vectors of the two input sentences to obtain the matching representation vectors of the sentence pairs;
预测层构建子单元,用于接收交互匹配层输出的匹配表征向量,使用Sigmoid函数进行计算,得到处于[0,1]之间的匹配度,最终通过与设立的阈值进行比较来判别句子对的匹配程度;The prediction layer constructs a sub-unit for receiving the matching representation vector output by the interactive matching layer, using the Sigmoid function to calculate, and obtaining the matching degree between [0,1], and finally judging the sentence pair by comparing with the established threshold Matching degree;
多粒度融合模型训练单元,用于构建模型训练过程中所需要的损失函数,并完成模型的优化训练。The multi-granularity fusion model training unit is used to construct the loss function needed in the model training process and complete the optimization training of the model.
作为优选,所述文本匹配知识库构建单元包括,Preferably, the text matching knowledge base building unit includes:
爬取原始数据子单元,用于在互联网公共问答平台爬取问题集,或者使用网上公开的文本匹配数据集,构建原始相似句子知识库;Crawling the original data subunit, used to crawl the question set on the Internet public question and answer platform, or use the text matching data set published on the Internet to build the original similar sentence knowledge base;
原始数据处理子单元,用于将原始相似句子知识库中的句子进行断字处理和分词处理,从而构建用于模型训练的文本匹配知识库;The original data processing subunit is used to hyphenate and segment the sentences in the original similar sentence knowledge base, thereby constructing the text matching knowledge base for model training;
所述训练数据集生成单元包括,The training data set generating unit includes:
训练正例数据构建子单元,用于将文本匹配知识库中语义匹配的句子进行组合,并对其添加匹配标签1,构建为训练正例数据;The training positive example data construction subunit is used to combine semantically matched sentences in the text matching knowledge base, and add matching label 1 to it to construct the training positive example data;
训练负例数据构建子单元,用于先从文本匹配知识库中选取一个句子q 1,再从文本匹配知识库中随机选择一个与句子q 1语义不匹配的句子q 2,将q 1与q 2进行组合,并对其添加匹配标签0,构建为训练负例数据; The training negative example data constructs a subunit, which is used to first select a sentence q 1 from the text matching knowledge base, and then randomly select a sentence q 2 that does not match the sentence q 1 semantically from the text matching knowledge base, and compare q 1 with q 2 Combine and add a matching label 0 to it, and construct it as training negative example data;
训练数据集构建子单元,用于将所有的训练正例数据与训练负例数据组合在一起,并打乱其顺序,从而构建最终的训练数据集;The training data set construction subunit is used to combine all the training positive example data and the training negative example data, and disrupt the order to construct the final training data set;
所述多粒度融合模型训练单元包括,The multi-granularity fusion model training unit includes:
损失函数构建子单元,用于构建损失函数,计算句子1和句子2间文本匹配度的误差;The loss function construction subunit is used to construct the loss function and calculate the error of the text matching degree between sentence 1 and sentence 2;
模型优化训练子单元,用于训练并调整模型训练中的参数,从而减小模型训练过程中预测的句子1与句子2间匹配度与真实匹配度之间的误差。The model optimization training subunit is used to train and adjust the parameters in the model training, thereby reducing the error between the predicted matching degree between sentence 1 and sentence 2 and the true matching degree in the model training process.
一种存储介质,其中存储有多条指令,所述指令由处理器加载,执行上述的基于多粒度融合模型的中文句子语义智能匹配方法的步骤。A storage medium stores a plurality of instructions, and the instructions are loaded by a processor to execute the steps of the above-mentioned intelligent matching method for Chinese sentence semantics based on a multi-granularity fusion model.
一种电子设备,所述电子设备包括:An electronic device, which includes:
上述的存储介质;以及The aforementioned storage medium; and
处理器,用于执行所述存储介质中的指令。The processor is configured to execute instructions in the storage medium.
本发明的基于多粒度融合模型的中文句子语义智能匹配方法及装置具有以下优点:The Chinese sentence semantic intelligent matching method and device based on the multi-granularity fusion model of the present invention have the following advantages:
(一)本发明将词语向量和字符向量整合在一起,从字符和词语两个粒度上,有效地提取中文句子的语义信息,进而提升中文句子编码的准确性;(1) The present invention integrates word vectors and character vectors, and effectively extracts the semantic information of Chinese sentences from the two granularities of characters and words, thereby improving the accuracy of Chinese sentence coding;
(二)对于中文句子从字符和词语两个粒度建模,句子的语义特征分别从字符和词语的粒度获得,句子中关键的语义信息可以从两个粒度上分别提取并强化,可极大地改善句子关键语义信息的表征;(2) For the modeling of Chinese sentences from the two granularities of characters and words, the semantic features of sentences are obtained from the granularities of characters and words respectively. The key semantic information in the sentence can be extracted and strengthened from the two granularities, which can be greatly improved. Representation of key semantic information of sentences;
(三)在工程实践任务中,本发明能够精确地实现中文语句匹配的任务;(3) In engineering practice tasks, the present invention can accurately realize the task of matching Chinese sentences;
(四)本发明使用均方误差(MSE)作为平衡因子来改善交叉熵损失函数,从而设计出平衡交叉熵损失函数;该损失函数可解决过度拟合问题,并且在训练过程中将分类边界进行模糊化处理;同时,它能够缓解正负样本之间的类别不平衡问题;(4) The present invention uses mean square error (MSE) as a balance factor to improve the cross-entropy loss function, thereby designing a balanced cross-entropy loss function; the loss function can solve the problem of overfitting, and the classification boundary is processed during the training process. Fuzzy processing; at the same time, it can alleviate the problem of category imbalance between positive and negative samples;
(五)对于输入句子,多粒度融合模型使用不同的编码方法来生成字符级 句子向量和词语级句子向量;针对词语级句子向量,用两个LSTM网络进行顺序编码,然后使用注意力机制进行深度特征提取;对于字符级句子向量,除了使用与词语级句子向量相同的处理方法以外,补充了一层LSTM网络和注意力机制进行编码;词语级句子向量和字符级句子向量的编码最终均被叠加在一起,作为句子的多粒度融合编码表示,可以使句子的编码表示更加精确和全面;(5) For the input sentence, the multi-granularity fusion model uses different encoding methods to generate character-level sentence vectors and word-level sentence vectors; for word-level sentence vectors, two LSTM networks are used for sequential encoding, and then the attention mechanism is used for depth Feature extraction; for character-level sentence vectors, in addition to using the same processing method as word-level sentence vectors, a layer of LSTM network and attention mechanism are added for encoding; the encoding of word-level sentence vectors and character-level sentence vectors are finally superimposed Together, as a multi-granularity fusion coding representation of a sentence, it can make the coding representation of a sentence more accurate and comprehensive;
(六)本发明使用均方误差(MSE)作为平衡因子来改善交叉熵损失函数,在公开数据集(LCQMC)上所做的大量实验,可以证明本发明优于现有方法;(6) The present invention uses mean square error (MSE) as a balance factor to improve the cross-entropy loss function. A large number of experiments done on the public data set (LCQMC) can prove that the present invention is superior to existing methods;
(七)本发明实现了多粒度融合模型,该模型同时考虑中文词语级粒度和字符级粒度,通过集成多粒度编码以更好地捕获语义特征。(7) The present invention realizes a multi-granularity fusion model, which considers both Chinese word-level granularity and character-level granularity, and integrates multi-granularity coding to better capture semantic features.
附图说明Description of the drawings
下面结合附图对本发明进一步说明。The present invention will be further described below in conjunction with the accompanying drawings.
附图1为基于多粒度融合模型的中文句子语义智能匹配方法的流程框图;Figure 1 is a flow chart of a Chinese sentence semantic intelligent matching method based on a multi-granularity fusion model;
附图2为构建文本匹配知识库的流程框图;Figure 2 is a block diagram of the process of constructing a text matching knowledge base;
附图3为构建文本匹配模型的训练数据集的流程框图;Figure 3 is a block diagram of the process of constructing the training data set of the text matching model;
附图4为构建多粒度融合模型的流程框图;Figure 4 is a block diagram of the process of constructing a multi-granularity fusion model;
附图5为训练多粒度融合模型的流程框图;Figure 5 is a block diagram of the process of training a multi-granularity fusion model;
附图6为多粒度融合模型的示意图;Figure 6 is a schematic diagram of a multi-granularity fusion model;
附图7为多粒度嵌入层的示意图;Figure 7 is a schematic diagram of a multi-granularity embedding layer;
附图8为多粒度融合编码层的示意图;Fig. 8 is a schematic diagram of a multi-granularity fusion coding layer;
附图9为交互匹配层的示意图;Fig. 9 is a schematic diagram of an interactive matching layer;
附图10为基于多粒度融合模型的中文句子语义智能匹配的装置的结构框图。Fig. 10 is a block diagram of a device for intelligent semantic matching of Chinese sentences based on a multi-granularity fusion model.
具体实施方式Detailed ways
参照说明书附图和具体实施例对本发明的基于多粒度融合模型的中文句子语义智能匹配方法及装置作以下详细地说明。The method and device for intelligent matching of Chinese sentence semantics based on the multi-granularity fusion model of the present invention will be described in detail below with reference to the drawings and specific embodiments of the specification.
实施例1:Example 1:
如附图1所示,本发明的基于多粒度融合模型的中文句子语义智能匹配方法,该方法具体如下:As shown in FIG. 1, the intelligent matching method for Chinese sentence semantics based on the multi-granularity fusion model of the present invention is specifically as follows:
S1、构建文本匹配知识库;如附图2所示,具体如下:S1. Construct a text matching knowledge base; as shown in Figure 2, the details are as follows:
S101、使用爬虫获取原始数据:在互联网公共问答平台爬取问题集,得到原始相似句子知识库;或者使用网上公开的句子匹配数据集,作为原始相似句子知识库;S101. Use crawlers to obtain original data: Crawl the question set on the Internet public question and answer platform to obtain the original similar sentence knowledge base; or use the sentence matching data set disclosed on the Internet as the original similar sentence knowledge base;
互联网上的公共问答平台中有着大量的问答数据及相似问题的推荐,这些 都是面向大众开放的。因此,可以根据问答平台的特点,设计相应的爬虫程序,以此来获取语义相似的文本句子集合,从而构建原始相似句子知识库。Public Q&A platforms on the Internet have a large amount of Q&A data and recommendations for similar questions, which are open to the public. Therefore, according to the characteristics of the question and answer platform, a corresponding crawler can be designed to obtain a collection of semantically similar text sentences, thereby constructing a knowledge base of original similar sentences.
举例:银行问答平台中的相似文本示例,如下表所示:Example: An example of similar text in the bank's Q&A platform, as shown in the following table:
句子1 Sentence 1 还款期限可以延后一天吗?Can the repayment period be extended by one day?
句子2 Sentence 2 是否可以申请延期一天还款?Can I apply for a one-day extension of repayment?
或者,使用网上公开的文本匹配数据集,作为原始知识库。比如LCQMC数据集【Liu,X.,Chen,Q.,Deng,C.,Zeng,H.,Chen,J.,Li,D.,Tang,B.:LCQMC:A large-scale Chinese question matching corpus.In:Proceedings of the 27th International Conference on Computational Linguistics.pp.1952-1962(2018)】,该数据集一共有260068对标注结果,分为三部分:238766训练集、8802验证集和12500测试集,是一种专门用于文本匹配任务的中文数据集。Or, use the text matching data set publicly available on the Internet as the original knowledge base. For example, the LCQMC data set [Liu,X.,Chen,Q.,Deng,C.,Zeng,H.,Chen,J.,Li,D.,Tang,B.:LCQMC:A large-scale Chinese question matching corpus .In:Proceedings of the 27th International Conference on Computational Linguistics.pp.1952-1962(2018)], this data set has a total of 260,068 pairs of annotation results, divided into three parts: 238766 training set, 8802 verification set and 12500 test set. It is a Chinese data set specially used for text matching tasks.
S102、预处理原始数据:预处理原始相似句子知识库中的相似文本,对每个句子进行分词和断字处理,得到文本匹配知识库;S102. Preprocess the original data: preprocess the similar texts in the original similar sentence knowledge base, and perform word segmentation and hyphenation processing on each sentence to obtain a text matching knowledge base;
对步骤S101中获得的相似文本进行预处理,得到文本匹配知识库。在步骤S102中,为了避免语义信息的丢失,本发明保留了句子中的所有停用词。The similar text obtained in step S101 is preprocessed to obtain a text matching knowledge base. In step S102, in order to avoid the loss of semantic information, the present invention reserves all stop words in the sentence.
其中,分词处理是以中文里的每个词语作为基本单位,对每条数据进行分词操作;举例,以步骤S101中展示的句子2“是否可以申请延期一天还款?”为例,对其进行分词处理后得到“是否可以申请延期一天还款?”。本发明将分词处理后的句子,记为词语级粒度的句子。Among them, the word segmentation processing takes each word in Chinese as the basic unit, and performs word segmentation operations on each piece of data; for example, take the sentence 2 "Can you apply for a one-day extension of repayment?" shown in step S101 as an example. After the word segmentation is processed, you get "Can you apply for a one-day extension of repayment?". The present invention records the sentence after word segmentation processing as a sentence with word level granularity.
断字处理是以中文里的每个字作为基本单位,对每条数据进行断字操作;每个汉字之间用空格进行切分,并保留每条数据中包括的数字、标点以及特殊字符在内的所有内容;举例:以步骤S101中展示的句子2“是否可以申请延期一天还款?”为例,对其进行断字处理后得到“是否可以申请延期一天还款?”。本发明将断字处理后的句子,记为字符级粒度的句子。Hyphenation processing takes each Chinese character as the basic unit, and performs hyphenation operations on each piece of data; each Chinese character is divided by spaces, and the numbers, punctuation and special characters included in each piece of data are kept in For example, take the sentence 2 "Can you apply for a one-day repayment extension?" shown in step S101 as an example, after hyphenating it, you get "Can you apply for a one-day extension of repayment?". The present invention records the hyphenated sentence as a sentence with character level granularity.
S2、构建文本匹配模型的训练数据集:对于每一个句子,在文本匹配知识库中都会有一个与之对应的标准的语义匹配的句子,此句子可与其组合用来构建训练正例;其他不匹配的句子可自由组合用来构建训练负例;用户可根据文本匹配知识库大小来设定负例的数量,从而构建训练数据集;如附图3所示,具体如下:S2. Constructing the training data set of the text matching model: For each sentence, there will be a corresponding standard semantic matching sentence in the text matching knowledge base. This sentence can be combined with it to construct training examples; others are not Matched sentences can be freely combined to construct training negative examples; users can set the number of negative examples according to the size of the text matching knowledge base to construct a training data set; as shown in Figure 3, the details are as follows:
S201、构建训练正例:将句子与其对应的语义匹配的句子进行组合,构建训练正例,形式化为:(Q1-char,Q1-word,Q2-char,Q2-word,1);S201. Construct a training positive example: Combine a sentence with its corresponding semantically matched sentence to construct a training positive example, which is formalized as: (Q1-char, Q1-word, Q2-char, Q2-word, 1);
其中,Q1-char表示字符级粒度的句子1;Q1-word表示词语级粒度的句子 1;Q2-char表示字符级粒度的句子2;Q2-word表示词语级粒度的句子2;1表示句子1和句子2这两个文本相匹配,是正例;Among them, Q1-char represents sentence 1 at character level granularity; Q1-word represents sentence 1 at word level granularity; Q2-char represents sentence 2 at character level granularity; Q2-word represents sentence 2 at word level granularity; 1 represents sentence 1 It matches the two texts of sentence 2, which is a positive example;
举例:对步骤S101中展示的句子1和句子2,经过步骤S102的预处理后,构建的正例为:Example: For Sentence 1 and Sentence 2 shown in step S101, after preprocessing in step S102, the constructed positive example is:
(“还款期限可以延后一天吗?”,“还款期限可以延后一天吗?”,“是否可以申请延期一天还款?”,“是否可以申请延期一天还款?”,1)。("Can the repayment period be extended by one day?", "Can the repayment period be extended by one day?", "Can I apply for an extension of one day's repayment?", "Can I apply for an extension of one day's repayment?", 1).
S202、构建训练负例:选中一个句子Q1,再从文本匹配知识库中随机选择一个与句子Q1不匹配的句子Q2,将Q1与Q2进行组合,构建负例,形式化为:(Q1-char,Q1-word,Q2-char,Q2-word,0);S202. Construct a training negative example: select a sentence Q1, then randomly select a sentence Q2 that does not match the sentence Q1 from the text matching knowledge base, combine Q1 and Q2 to construct a negative example, which is formalized as: (Q1-char ,Q1-word,Q2-char,Q2-word,0);
其中,Q1-char表示字符级粒度的句子1;Q1-word表示词语级粒度的句子1;Q2-char表示字符级粒度的句子2;Q2-word表示词语级粒度的句子2;0表示句子Q1和句子Q2这两个文本不匹配,是负例;Among them, Q1-char represents sentence 1 at character level granularity; Q1-word represents sentence 1 at word level granularity; Q2-char represents sentence 2 at character level granularity; Q2-word represents sentence 2 at word level granularity; 0 represents sentence Q1 It does not match the two texts of sentence Q2, which is a negative example;
举例:根据步骤S201中的所展示的示例数据,本发明仍然使用原问句作为Q1,再从文本匹配知识库中随机选择一个与句子Q1语义不匹配的句子Q2,将Q1与Q2进行组合,经过步骤S102的预处理后,构建的负例为:Example: According to the example data shown in step S201, the present invention still uses the original question as Q1, and then randomly selects a sentence Q2 that does not match the sentence Q1 semantically from the text matching knowledge base, and combines Q1 and Q2, After the preprocessing in step S102, the constructed negative example is:
(“还款期限可以延后一天吗?”,“还款期限可以延后一天吗?”,“为什么银行客户端登陆出现网络错误?”,“为什么银行客户端登陆出现网络错误?”,0)。("Can the repayment period be extended by one day?","Can the repayment period be extended by one day?","Why is there a network error in the bank client login?", "Why is there a network error in the bank client login?",0 ).
S203、构建训练数据集:将经过步骤S201和步骤S202操作后所获得的全部的正例样本和负例样本进行组合,并打乱其顺序,构建最终的训练数据集;其中,无论是正例数据还是负例数据均包含五个维度,即Q1-char、Q1-word、Q2-char、Q2-word、0或1。S203. Construct a training data set: Combine all the positive samples and negative samples obtained after the operations of step S201 and step S202, and disrupt their order to construct a final training data set; among them, whether it is positive data Still negative data includes five dimensions, namely Q1-char, Q1-word, Q2-char, Q2-word, 0 or 1.
S3、构建多粒度融合模型:如附图6所示,本发明的核心为多粒度融合模型,主要可分为四个部分:多粒度嵌入层、多粒度融合编码层、交互匹配层、预测层;首先构建多粒度嵌入层,对句子中的词语和字符进行向量映射,得到词语级句子向量和字符级句子向量;再构建多粒度融合编码层,对词语级句子向量和字符级句子向量进行编码处理,得到句子语义特征向量;再构建交互匹配层,对句子语义特征向量进行分层比较,得到句子对的匹配表征向量;最后,经预测层的Sigmoid函数处理,判断句子对的语义匹配程度。如附图4所示, 具体如下:S3. Construct a multi-granularity fusion model: as shown in Figure 6, the core of the present invention is a multi-granularity fusion model, which can be divided into four parts: multi-granularity embedding layer, multi-granularity fusion coding layer, interactive matching layer, prediction layer ; First, build a multi-granularity embedding layer to perform vector mapping on words and characters in the sentence to obtain word-level sentence vectors and character-level sentence vectors; then build a multi-granularity fusion coding layer to encode word-level sentence vectors and character-level sentence vectors After processing, the semantic feature vector of the sentence is obtained; then the interactive matching layer is constructed, and the semantic feature vector of the sentence is hierarchically compared to obtain the matching representation vector of the sentence pair; finally, the Sigmoid function of the prediction layer is processed to determine the semantic matching degree of the sentence pair. As shown in Figure 4, the details are as follows:
S301、构建字符词语映射转换表;具体如下:S301. Construct a character word mapping conversion table; the details are as follows:
S30101、字符词语表通过预处理后得到的文本匹配知识库来构建;S30101. The character vocabulary table is constructed by the text matching knowledge base obtained after preprocessing;
S30102、字符词语表构建完成后,表中每个字符、词语均被映射为唯一的数字标识,映射规则为:以数字1为起始,随后按照每个字符、词语被录入字符词语表的顺序依次递增排序,从而形成字符词语映射转换表;S30102. After the construction of the character vocabulary table is completed, each character and word in the table is mapped to a unique digital identifier. The mapping rule is: start with the number 1, and then follow the order in which each character or word is entered into the character vocabulary table Sort in ascending order to form a character-word mapping conversion table;
举例:以步骤S102处理后的内容,“还款期限可以延后一天吗?”、“还款期限可以延后一天吗?”为例,对其构建字符词语表及字符词语映射转换表如下:Example: Taking the content processed in step S102, "Can the repayment period be extended by one day?" and "Can the repayment period be extended by one day?" as an example, construct a character word table and a character word mapping conversion table for it as follows:
Figure PCTCN2020104723-appb-000074
Figure PCTCN2020104723-appb-000074
S30103、其后,使用Word2Vec训练字符词语向量模型,得到字符词语向量矩阵权重embedding_matrix;S30103. Then, use Word2Vec to train the character word vector model to obtain the character word vector matrix weight embedding_matrix;
举例说明:在Keras中,对于上面描述的代码实现如下所示:For example: In Keras, the implementation of the code described above is as follows:
w2v_model=genism.models.Word2Vec(w2v_corpus,size=embedding_dim,w2v_model=genism.models.Word2Vec(w2v_corpus,size=embedding_dim,
window=5,min_count=1,sg=1,window=5, min_count=1, sg=1,
workers=4,seed=1234,iter=25)workers = 4, seed = 1234, iter = 25)
embedding_matrix=numpy.zeros([len(tokenizer.word_index)+1,embedding_dim])embedding_matrix=numpy.zeros([len(tokenizer.word_index)+1,embedding_dim])
tokenizer=keras.preprocessing.text.Tokenizer(num_words=len(word_set))tokenizer=keras.preprocessing.text.Tokenizer(num_words=len(word_set))
for word,idx in tokenizer.word_index.items():for word,idx in tokenizer.word_index.items():
embedding_matrix[idx,:]=w2v_model.wv[word]embedding_matrix[idx,:]=w2v_model.wv[word]
其中,w2v_corpus为训练语料,即文本匹配知识库中的所有数据;embedding_dim为字符词语向量维度,在本发明中embedding_dim设置为300,word_set为字词表。Among them, w2v_corpus is the training corpus, that is, all data in the text matching knowledge base; embedding_dim is the dimension of the character word vector, embedding_dim is set to 300 in the present invention, and word_set is the word list.
S302、构建输入层;具体如下:S302. Construct an input layer; the details are as follows:
S30201、输入层包括四个输入,对两个待匹配的句子进行预处理分别获取Q1-char、Q1-word、Q2-char、Q2-word,将其形式化为:(Q1-char,Q1-word,Q2-char,Q2-word);S30201. The input layer includes four inputs. The two sentences to be matched are preprocessed to obtain Q1-char, Q1-word, Q2-char, and Q2-word respectively, which are formalized as: (Q1-char, Q1- word,Q2-char,Q2-word);
S30202、对于输入句子中的每个字符和词语均按照步骤S301中构建完成的字符词语映射转换表将其转化为相应的数字标识。S30202: For each character and word in the input sentence, it is converted into a corresponding digital identifier according to the character-word mapping conversion table constructed in step S301.
举例说明:本发明使用步骤S201中展示的正例文本作为样例,以此组成一条输入数据。其结果如下所示:Illustrative example: The present invention uses the positive example text displayed in step S201 as an example to form a piece of input data. The result is as follows:
(“还款期限可以延后一天吗?”,“还款期限可以延后一天吗?”,“是否可以申请延期一天还款?”,“是否可以申请延期一天还款?”)("Can the repayment period be extended by one day?", "Can the repayment period be extended by one day?", "Can I apply for an extension of one day's repayment?", "Can I apply for an extension of one day's repayment?")
根据字符词语表中的映射将上述的输入数据转换为数值表示(假定出现在句子2中但没有出现在句子1中的字符和词语的映射分别为“是”:18,“否”:19,“申”:20,“请”:21,“是否”:22,“申请”:23,“延期”:24),结果如下:According to the mapping in the character vocabulary table, the above input data is converted into a numerical representation (assuming that the mappings of characters and words that appear in sentence 2 but not in sentence 1 are "Yes": 18, "No": 19, "Apply": 20, "Please": 21, "Whether": 22, "Apply": 23, "Extension": 24), the results are as follows:
(“1,2,3,4,5,6,7,8,9,10,11,12”,“13,14,15,16,17,11,12”,“18,19,5,6,20,21,7,3,9,10,1,2,12”,“22,15,23,24,17,13,12”);("1,2,3,4,5,6,7,8,9,10,11,12","13,14,15,16,17,11,12","18,19,5, 6,20,21,7,3,9,10,1,2,12","22,15,23,24,17,13,12");
S303、构建多粒度嵌入层:对句子中的词语和字符进行向量映射,得到词语级句子向量和字符级句子向量;如附图7所示,具体如下:S303. Construct a multi-granular embedding layer: perform vector mapping on words and characters in the sentence to obtain word-level sentence vectors and character-level sentence vectors; as shown in FIG. 7, the details are as follows:
S30301、通过加载步骤S301中训练所得的字符词语向量矩阵权重来初始化当前层的权重参数;S30301: Initialize the weight parameters of the current layer by loading the weights of the character word vector matrix obtained through training in step S301;
S30302、针对输入句子Q1和Q2,经过多粒度嵌入层处理后得到其词语级句子向量和字符级句子向量Q1-word Emd、Q1-char Emd、Q2-word Emd、Q2-char Emd;其中,文本匹配知识库中每一个句子均能通过字符词语向量映射的方式,将文本信息转化为向量形式;本发明中设置embedding_dim为300。S30302. For the input sentences Q1 and Q2, the word-level sentence vector and character-level sentence vector Q1-word Emd, Q1-char Emd, Q2-word Emd, Q2-char Emd are obtained after multi-granular embedding layer processing; among them, text Each sentence in the matching knowledge base can transform text information into a vector form by means of character word vector mapping; embedding_dim is set to 300 in the present invention.
举例说明:在Keras中,对于上面描述的代码实现如下所示:For example: In Keras, the implementation of the code described above is as follows:
Figure PCTCN2020104723-appb-000075
Figure PCTCN2020104723-appb-000075
Figure PCTCN2020104723-appb-000076
Figure PCTCN2020104723-appb-000076
其中,embedding_matrix是步骤S301中训练所得的字符词语向量矩阵权重,embedding_matrix.shape[0]是字符词语向量矩阵的字词表(词典)的大小,embedding_dim是输出的字符词语向量的维度,input_length是输入序列的长度。Among them, embedding_matrix is the weight of the character word vector matrix trained in step S301, embedding_matrix.shape[0] is the size of the word table (dictionary) of the character word vector matrix, embedding_dim is the dimension of the output character word vector, and input_length is the input The length of the sequence.
相应的文本Q1和Q2,经过多粒度嵌入层处理后得到词语级句子向量和字符级句子向量Q1-word Emd、Q1-char Emd、Q2-word Emd、Q2-char Emd。The corresponding texts Q1 and Q2 are processed by the multi-granular embedding layer to obtain word-level sentence vectors and character-level sentence vectors Q1-word Emd, Q1-char Emd, Q2-word Emd, Q2-char Emd.
S304、构建多粒度融合编码层:如附图8所示,对词语级句子向量和字符级句子向量进行编码处理,得到句子语义特征向量;步骤S304中构建多粒度融合编码层是将步骤S303中多粒度嵌入层输出的词语级句子向量和字符级句子向量作为输入,从两个角度获取文本语义特征,即字符级别语义特征提取和词语级别语义特征提取;再通过按位相加的形式,对两个角度的文本语义特征进行整合,得到最终的句子语义特征向量;对于句子Q1求取最终的句子语义特征向量具体如下:S304. Construct a multi-granularity fusion coding layer: as shown in FIG. 8, the word-level sentence vector and character-level sentence vector are coded to obtain the sentence semantic feature vector; the construction of the multi-granularity fusion coding layer in step S304 is the step S303 The word-level sentence vector and character-level sentence vector output by the multi-granular embedding layer are used as input to obtain text semantic features from two perspectives, namely, character-level semantic feature extraction and word-level semantic feature extraction; The text semantic features from the two perspectives are integrated to obtain the final sentence semantic feature vector; for the sentence Q1, the final sentence semantic feature vector is obtained as follows:
S30401、针对字符级别语义特征提取,具体如下:S30401. For character-level semantic feature extraction, the details are as follows:
S3040101、使用LSTM进行特征提取,得到特征向量
Figure PCTCN2020104723-appb-000077
公式如下:
S3040101, use LSTM for feature extraction to obtain feature vectors
Figure PCTCN2020104723-appb-000077
The formula is as follows:
Figure PCTCN2020104723-appb-000078
Figure PCTCN2020104723-appb-000078
S3040102、针对
Figure PCTCN2020104723-appb-000079
进一步采用两种不同的方法进行编码,具体如下:
S3040102, for
Figure PCTCN2020104723-appb-000079
Further use two different methods for encoding, as follows:
①、对
Figure PCTCN2020104723-appb-000080
继续使用LSTM进行二次特征提取,得到相应特征向量
Figure PCTCN2020104723-appb-000081
公式如下:
①, yes
Figure PCTCN2020104723-appb-000080
Continue to use LSTM for secondary feature extraction to obtain the corresponding feature vector
Figure PCTCN2020104723-appb-000081
The formula is as follows:
Figure PCTCN2020104723-appb-000082
Figure PCTCN2020104723-appb-000082
②、对
Figure PCTCN2020104723-appb-000083
使用注意力机制Attention提取特征,得到相应特征向量
Figure PCTCN2020104723-appb-000084
公式如下:
②, yes
Figure PCTCN2020104723-appb-000083
Use the attention mechanism Attention to extract features and get the corresponding feature vector
Figure PCTCN2020104723-appb-000084
The formula is as follows:
Figure PCTCN2020104723-appb-000085
Figure PCTCN2020104723-appb-000085
S3040103、针对
Figure PCTCN2020104723-appb-000086
使用Attention再次进行编码提取关键特征,得到特征向量
Figure PCTCN2020104723-appb-000087
公式如下:
S3040103, for
Figure PCTCN2020104723-appb-000086
Use Attention to code again to extract key features to obtain feature vectors
Figure PCTCN2020104723-appb-000087
The formula is as follows:
Figure PCTCN2020104723-appb-000088
Figure PCTCN2020104723-appb-000088
S3040104、将
Figure PCTCN2020104723-appb-000089
Figure PCTCN2020104723-appb-000090
按位相加得到字符级别的语义特征
Figure PCTCN2020104723-appb-000091
公式如下:
S3040104, will
Figure PCTCN2020104723-appb-000089
and
Figure PCTCN2020104723-appb-000090
Bitwise addition to obtain character-level semantic features
Figure PCTCN2020104723-appb-000091
The formula is as follows:
Figure PCTCN2020104723-appb-000092
Figure PCTCN2020104723-appb-000092
其中,i表示相应字符向量在句子中的相对位置,Q i为句子Q1中每个字符的相应向量表示;Q′ i为经过初次LSTM编码后每个字符的相应向量表示;Q″ i为经过第二次LSTM编码后每个字符的相应向量表示; Among them, i represents the relative position of the corresponding character vector in the sentence, and Q i is the corresponding vector representation of each character in the sentence Q1; Q′ i is the corresponding vector representation of each character after the initial LSTM encoding; Q″ i is the corresponding vector representation of each character after the initial LSTM encoding. The corresponding vector representation of each character after the second LSTM encoding;
S30402、针对词语级别语义特征提取,具体如下:S30402. For word-level semantic feature extraction, the details are as follows:
S3040201、使用LSTM进行特征提取,得到特征向量
Figure PCTCN2020104723-appb-000093
公式如下:
S3040201 Use LSTM for feature extraction to obtain feature vectors
Figure PCTCN2020104723-appb-000093
The formula is as follows:
Figure PCTCN2020104723-appb-000094
Figure PCTCN2020104723-appb-000094
S3040202、针对
Figure PCTCN2020104723-appb-000095
进一步采用LSTM进行二次特征提取,得到相应特征向量
Figure PCTCN2020104723-appb-000096
公式如下:
S3040202, for
Figure PCTCN2020104723-appb-000095
Further use LSTM for secondary feature extraction to obtain the corresponding feature vector
Figure PCTCN2020104723-appb-000096
The formula is as follows:
Figure PCTCN2020104723-appb-000097
Figure PCTCN2020104723-appb-000097
S3040203、针对
Figure PCTCN2020104723-appb-000098
使用Attention再次进行编码提取关键特征,得到词语级别特征向量
Figure PCTCN2020104723-appb-000099
公式如下:
S3040203, for
Figure PCTCN2020104723-appb-000098
Use Attention to code again to extract key features to obtain word-level feature vectors
Figure PCTCN2020104723-appb-000099
The formula is as follows:
Figure PCTCN2020104723-appb-000100
Figure PCTCN2020104723-appb-000100
其中,i'表示相应词语向量在句子中的相对位置;Q i′为句子Q1中每个词语的相应向量表示;Q′ i′为经过初次LSTM编码后每个词语的相应向量表示;Q″ i′为经过第二次LSTM编码后每个词语的相应向量表示; Wherein, i 'represents the relative position of the corresponding word in the sentence vector; Q i' expressed as Q1 respective sentence vectors of each word; Q 'i' expressed as a respective vector for each of words after the initial coding LSTM; Q " i′ is the corresponding vector representation of each word after the second LSTM encoding;
S30403、经过步骤S30401和步骤S30402得到相应字符级别的特征向量
Figure PCTCN2020104723-appb-000101
以及词语级别的特征向量
Figure PCTCN2020104723-appb-000102
在多粒度融合编码层中,本发明的编码维度统一设置为300,本发明将
Figure PCTCN2020104723-appb-000103
Figure PCTCN2020104723-appb-000104
按位相加,得到针对文本Q1的最终句子语义特征向量
Figure PCTCN2020104723-appb-000105
公式如下:
S30403: After step S30401 and step S30402, the feature vector of the corresponding character level is obtained
Figure PCTCN2020104723-appb-000101
And word-level feature vectors
Figure PCTCN2020104723-appb-000102
In the multi-granularity fusion coding layer, the coding dimension of the present invention is uniformly set to 300, and the present invention sets
Figure PCTCN2020104723-appb-000103
with
Figure PCTCN2020104723-appb-000104
Add bitwise to get the final sentence semantic feature vector for text Q1
Figure PCTCN2020104723-appb-000105
The formula is as follows:
Figure PCTCN2020104723-appb-000106
Figure PCTCN2020104723-appb-000106
对于句子Q2求取最终的句子语义特征向量
Figure PCTCN2020104723-appb-000107
的方法,同步骤S30401到步骤S30403。
For sentence Q2, obtain the final sentence semantic feature vector
Figure PCTCN2020104723-appb-000107
The method is the same as step S30401 to step S30403.
S305、构建交互匹配层:对句子语义特征向量进行分层比较,得到句子对的匹配表征向量;如附图9所示,具体如下:S305. Construct an interactive matching layer: perform hierarchical comparison of sentence semantic feature vectors to obtain matching characterization vectors of sentence pairs; as shown in FIG. 9, the details are as follows:
S30501、经过步骤S304处理得到Q1、Q2的句子语义特征向量
Figure PCTCN2020104723-appb-000108
Figure PCTCN2020104723-appb-000109
针对
Figure PCTCN2020104723-appb-000110
Figure PCTCN2020104723-appb-000111
进行减法、叉乘以及点乘三种操作,得到
Figure PCTCN2020104723-appb-000112
公式如下:
S30501: After processing in step S304, the sentence semantic feature vectors of Q1 and Q2 are obtained
Figure PCTCN2020104723-appb-000108
with
Figure PCTCN2020104723-appb-000109
against
Figure PCTCN2020104723-appb-000110
with
Figure PCTCN2020104723-appb-000111
Perform three operations of subtraction, cross product, and dot product to get
Figure PCTCN2020104723-appb-000112
The formula is as follows:
Figure PCTCN2020104723-appb-000113
Figure PCTCN2020104723-appb-000113
Figure PCTCN2020104723-appb-000114
Figure PCTCN2020104723-appb-000114
Figure PCTCN2020104723-appb-000115
Figure PCTCN2020104723-appb-000115
其中,点乘:也叫数量积,结果是一个向量在另一个向量方向上投影的长度,是一个标量;叉乘:也叫向量积,结果是一个和已有两个向量都垂直的向量;Among them, dot product: also called the quantified product, the result is the length of a vector projected in the direction of another vector, which is a scalar; cross product: also called the vector product, the result is a vector that is perpendicular to the two existing vectors;
同时,使用一个全连接层Dense进一步编码得到
Figure PCTCN2020104723-appb-000116
Figure PCTCN2020104723-appb-000117
公式如下:
At the same time, use a fully connected layer Dense to further encode
Figure PCTCN2020104723-appb-000116
with
Figure PCTCN2020104723-appb-000117
The formula is as follows:
Figure PCTCN2020104723-appb-000118
Figure PCTCN2020104723-appb-000118
Figure PCTCN2020104723-appb-000119
Figure PCTCN2020104723-appb-000119
其中,i表示相应语义特征在句子中的相对位置;Q1 i为文本Q1经过步骤S304特征提取得到的
Figure PCTCN2020104723-appb-000120
中每个语义特征的相应向量表示;Q2 i为文本Q2经过步骤S304特征提取得到的
Figure PCTCN2020104723-appb-000121
中每个语义特征的相应向量表示;
Figure PCTCN2020104723-appb-000122
为针对句子语义特征向量
Figure PCTCN2020104723-appb-000123
Figure PCTCN2020104723-appb-000124
使用Dense进一步提取,得到的特征向量;
Figure PCTCN2020104723-appb-000125
表示编码维度为300;
Among them, i represents the relative position of the corresponding semantic feature in the sentence; Q1 i is the text Q1 obtained by feature extraction in step S304
Figure PCTCN2020104723-appb-000120
Each respective feature vector is the semantic representation; I Q2 Q2 through the text feature extraction step S304 obtained
Figure PCTCN2020104723-appb-000121
The corresponding vector representation of each semantic feature in;
Figure PCTCN2020104723-appb-000122
For sentence semantic feature vector
Figure PCTCN2020104723-appb-000123
with
Figure PCTCN2020104723-appb-000124
Use Dense to further extract the feature vector obtained;
Figure PCTCN2020104723-appb-000125
Indicates that the coding dimension is 300;
S30502、将
Figure PCTCN2020104723-appb-000126
Figure PCTCN2020104723-appb-000127
联接起来得到
Figure PCTCN2020104723-appb-000128
公式如下:
S30502, will
Figure PCTCN2020104723-appb-000126
with
Figure PCTCN2020104723-appb-000127
Connect to get
Figure PCTCN2020104723-appb-000128
The formula is as follows:
Figure PCTCN2020104723-appb-000129
Figure PCTCN2020104723-appb-000129
同时,
Figure PCTCN2020104723-appb-000130
Figure PCTCN2020104723-appb-000131
同样进行减法、叉乘操作,公式如下:
at the same time,
Figure PCTCN2020104723-appb-000130
with
Figure PCTCN2020104723-appb-000131
The same subtraction and cross multiplication operations are performed, and the formula is as follows:
Figure PCTCN2020104723-appb-000132
Figure PCTCN2020104723-appb-000132
Figure PCTCN2020104723-appb-000133
Figure PCTCN2020104723-appb-000133
再将二者结果联接得到
Figure PCTCN2020104723-appb-000134
公式如下:
Then connect the two results to get
Figure PCTCN2020104723-appb-000134
The formula is as follows:
Figure PCTCN2020104723-appb-000135
Figure PCTCN2020104723-appb-000135
S30503、将
Figure PCTCN2020104723-appb-000136
使用两层全连接层进行特征提取得到
Figure PCTCN2020104723-appb-000137
并将
Figure PCTCN2020104723-appb-000138
Figure PCTCN2020104723-appb-000139
进行求和,得到
Figure PCTCN2020104723-appb-000140
公式如下:
S30503, will
Figure PCTCN2020104723-appb-000136
Use two fully connected layers for feature extraction to get
Figure PCTCN2020104723-appb-000137
And will
Figure PCTCN2020104723-appb-000138
and
Figure PCTCN2020104723-appb-000139
To sum and get
Figure PCTCN2020104723-appb-000140
The formula is as follows:
Figure PCTCN2020104723-appb-000141
Figure PCTCN2020104723-appb-000141
Figure PCTCN2020104723-appb-000142
Figure PCTCN2020104723-appb-000142
Figure PCTCN2020104723-appb-000143
Figure PCTCN2020104723-appb-000143
S30504、将
Figure PCTCN2020104723-appb-000144
经过一层全连接层编码后的结果与步骤S30501中
Figure PCTCN2020104723-appb-000145
求和,得到句子对的匹配表征向量
Figure PCTCN2020104723-appb-000146
公式如下:
S30504, will
Figure PCTCN2020104723-appb-000144
The result after a layer of fully connected layer encoding is the same as in step S30501
Figure PCTCN2020104723-appb-000145
Sum, get the matching representation vector of the sentence pair
Figure PCTCN2020104723-appb-000146
The formula is as follows:
Figure PCTCN2020104723-appb-000147
Figure PCTCN2020104723-appb-000147
S306、构建预测层:经预测层的Sigmoid函数处理,判断句子对的语义匹配程度;具体如下:S306. Construct a prediction layer: After processing the Sigmoid function of the prediction layer, determine the degree of semantic matching of sentence pairs; the details are as follows:
S30601、预测层接收步骤S305输出的匹配表征向量,使用Sigmoid函数进行计算,得到处于[0,1]之间的匹配度表示y predS30601: The prediction layer receives the matching representation vector output in step S305, and uses the Sigmoid function for calculation to obtain a matching degree representation y pred between [0, 1];
S30602、将y pred与设立的阈值进行比较来判别句子对的匹配程度,具体如下: S30602. Compare y pred with the established threshold to determine the matching degree of the sentence pair, which is specifically as follows:
①、当y pred≥0.5时,表示句子Q1以及句子Q2相匹配; ①. When y pred ≥0.5, it means that sentence Q1 and sentence Q2 match;
②、当y pred<0.5时,表示句子Q1以及句子Q2不匹配。 ②. When y pred <0.5, it means that sentence Q1 and sentence Q2 do not match.
S4、训练多粒度融合模型;如附图5所示,具体如下:S4. Train a multi-granularity fusion model; as shown in Figure 5, the details are as follows:
S401、构建损失函数:通过将均方误差(MSE)设置为交叉熵的平衡因子,设计出平衡交叉熵,其中,均方误差的公式如下:S401. Construct a loss function: Design a balanced cross-entropy by setting the mean square error (MSE) as the cross-entropy balance factor, where the formula of the mean square error is as follows:
Figure PCTCN2020104723-appb-000148
Figure PCTCN2020104723-appb-000148
其中,y true表示真实标签,即每条训练样例中表示匹配与否的0、1标志;y pred表示预测结果; Among them, y true represents the true label, that is, the 0 and 1 flags that indicate match or not in each training example; y pred represents the prediction result;
当分类边界模糊时,平衡交叉熵的使用能够自动平衡正负样本,并提高分类的准确性;其将交叉熵与均方误差融合,公式如下:When the classification boundary is blurred, the use of balanced cross entropy can automatically balance the positive and negative samples and improve the accuracy of classification; it fuses the cross entropy with the mean square error, and the formula is as follows:
Figure PCTCN2020104723-appb-000149
Figure PCTCN2020104723-appb-000149
本发明设计了交叉熵损失函数来防止过拟合问题。在大多数现有的深度学***衡参数,以平衡正样本和负样本,从而大大提高了模型的性能。The present invention designs a cross-entropy loss function to prevent over-fitting problems. In most existing deep learning applications, cross entropy is a common loss function for training models. However, the method based on maximum likelihood estimation will generate input noise. This method may divide the training sample into 0 or 1, leading to the problem of overfitting. Moreover, according to the investigation, relatively little work has been done in designing new loss functions. The present invention proposes to use mean square error (MSE) as a balance parameter to balance positive samples and negative samples, thereby greatly improving the performance of the model.
在大多数分类任务中,交叉熵损失函数如下所示,并且这种形式通常是第一选择。In most classification tasks, the cross-entropy loss function is shown below, and this form is usually the first choice.
Figure PCTCN2020104723-appb-000150
Figure PCTCN2020104723-appb-000150
S402、优化训练模型:选择使用RMSprop优化函数作为本模型的优化函数,超参数均选择Keras中的默认值设置。本模型在训练数据集上进行优化训练。S402. Optimize the training model: choose to use the RMSprop optimization function as the optimization function of this model, and the hyperparameters are all set to default values in Keras. This model is optimized and trained on the training data set.
举例说明:上面描述的优化函数及其设置在Keras中使用代码表示为:For example: the optimization function and its settings described above are expressed in Keras as:
optim=keras.optimizers.RMSprop()optim=keras.optimizers.RMSprop()
model=keras.models.Model([Q1-char,Q1-word,Q2-char,Q2-word],[y pred]) model=keras.models.Model([Q1-char,Q1-word,Q2-char,Q2-word],[y pred ])
model.compile(loss=L loss,optimizer=optim,metrics=['accuracy',precision,recall,f1_score]); model.compile(loss=L loss ,optimizer=optim,metrics=['accuracy',precision,recall,f1_score]);
其中,损失函数loss选择本步骤S401中自定义Loss;优化算法optimizer选择前文定义好的optim;Q1-char,Q1-word,Q2-char,Q2-word作为模型输入,y pred为模型输出;评价指标metrics,本发明选取准确率accuracy,精确率precision,召回率recall,基于召回率和精确率计算的F 1-score。 Among them, the loss function loss selects the custom Loss in this step S401; the optimization algorithm optimizer selects the previously defined optim; Q1-char, Q1-word, Q2-char, and Q2-word are the model inputs, and y pred is the model output; evaluation Indicator metrics, the present invention selects accuracy, precision, recall, F 1 -score calculated based on recall and precision.
本发明的模型在LCQMC公开数据集上取得了优于当前模型的结果,实验结果的对比具体见下表:The model of the present invention has achieved better results than the current model on the LCQMC public data set. The comparison of the experimental results is shown in the following table:
Figure PCTCN2020104723-appb-000151
Figure PCTCN2020104723-appb-000151
其中,前十四行是现有技术的模型的实验结果【Liu,X.,Chen,Q.,Deng,C.,Zeng,H.,Chen,J.,Li,D.,Tang,B.,2018.Lcqmc:A large-scale chinese question matching corpus,in:Proceedings of the 27th International Conference on Computational Linguistics,pp.1952–1962】。本发明模型和现有模型进行了比较,可见本发明方法较其他方法其性能最优。Among them, the first fourteen lines are the experimental results of the prior art model [Liu, X., Chen, Q., Deng, C., Zeng, H., Chen, J., Li, D., Tang, B. ,2018.Lcqmc:A large-scale chinese question matching corpus,in:Proceedings of the 27th International Conference on Computational Linguistics,pp.1952–1962]. Comparing the model of the present invention with the existing model, it can be seen that the method of the present invention has the best performance compared with other methods.
实施例2:Example 2:
如附图10所示,本发明的基于多粒度融合模型的中文句子语义智能匹配装置,该装置包括,As shown in FIG. 10, the intelligent matching device for Chinese sentence semantics based on the multi-granularity fusion model of the present invention includes:
文本匹配知识库构建单元,用于使用爬虫程序,在互联网公共问答平台爬取问题集,或者使用网上公开的文本匹配数据集,作为原始相似句子知识库,再对原始相似句子知识库进行预处理,主要操作为对原始相似句子知识库中的每个句子进行断字处理和分词处理,从而构建用于模型训练的文本匹配知识库;文本匹配知识库构建单元包括,The text matching knowledge base building unit is used to use crawlers to crawl question sets on the Internet public question and answer platform, or use the text matching data set published on the Internet as the original similar sentence knowledge base, and then preprocess the original similar sentence knowledge base , The main operation is to perform hyphenation and word segmentation on each sentence in the original similar sentence knowledge base, thereby constructing a text matching knowledge base for model training; the text matching knowledge base building unit includes:
爬取原始数据子单元,用于在互联网公共问答平台爬取问题集,或者使用网上公开的文本匹配数据集,构建原始相似句子知识库;Crawling the original data subunit, used to crawl the question set on the Internet public question and answer platform, or use the text matching data set published on the Internet to build the original similar sentence knowledge base;
原始数据处理子单元,用于将原始相似句子知识库中的句子进行断字处理和分词处理,从而构建用于模型训练的文本匹配知识库;The original data processing subunit is used to hyphenate and segment the sentences in the original similar sentence knowledge base, thereby constructing the text matching knowledge base for model training;
训练数据集生成单元,用于根据文本匹配知识库中的句子来构建训练正例数据和训练负例数据,并且基于正例数据与负例数据来构建最终的训练数据集;训练数据集生成单元包括,The training data set generation unit is used to construct training positive example data and training negative example data according to the sentences in the text matching knowledge base, and build the final training data set based on the positive and negative example data; training data set generation unit include,
训练正例数据构建子单元,用于将文本匹配知识库中语义匹配的句子进行组合,并对其添加匹配标签1,构建为训练正例数据;The training positive example data construction subunit is used to combine semantically matched sentences in the text matching knowledge base, and add matching label 1 to it to construct the training positive example data;
训练负例数据构建子单元,用于先从文本匹配知识库中选取一个句子q 1,再从文本匹配知识库中随机选择一个与句子q 1语义不匹配的句子q 2,将q 1与q 2进行组合,并对其添加匹配标签0,构建为训练负例数据; The training negative example data constructs a subunit, which is used to first select a sentence q 1 from the text matching knowledge base, and then randomly select a sentence q 2 that does not match the sentence q 1 semantically from the text matching knowledge base, and compare q 1 with q 2 Combine and add a matching label 0 to it, and construct it as training negative example data;
训练数据集构建子单元,用于将所有的训练正例数据与训练负例数据组合在一起,并打乱其顺序,从而构建最终的训练数据集;The training data set construction subunit is used to combine all the training positive example data and the training negative example data, and disrupt the order to construct the final training data set;
多粒度融合模型构建单元,用于构建字符词语映射转换表,并同时构建输入层、多粒度嵌入层、多粒度融合编码层、交互匹配层、预测层;其中,多粒度融合模型构建单元包括,The multi-granularity fusion model building unit is used to construct the character word mapping conversion table, and to construct the input layer, the multi-granularity embedding layer, the multi-granularity fusion coding layer, the interactive matching layer, and the prediction layer at the same time; among them, the multi-granularity fusion model building unit includes:
字符词语映射转换表构建子单元,用于对文本匹配知识库中的每个句子按字符和词语进行切分,并将每个字符和词语依次存入一个列表中,从而得到一个字符词语表,随后以数字1为起始,按照每个字符和词语被录入字符词语表的顺序依次递增排序,从而形成本发明所需的字符词语映射转换表;字符词语映射转换表构建完成后,表中每个字符和词语均被映射为唯一的数字标识;其后,使用Word2Vec训练字符词语向量模型,得到字符词语向量矩阵权重;The character word mapping conversion table constructs a subunit, which is used to segment each sentence in the text matching knowledge base according to characters and words, and store each character and word in a list in turn to obtain a character word table. Subsequently, starting with the number 1, the characters and words are entered in the character word table in ascending order, thereby forming the character word mapping conversion table required by the present invention; after the character word mapping conversion table is constructed, each character in the table Characters and words are mapped to unique digital identifiers; after that, use Word2Vec to train the character word vector model to obtain the weight of the character word vector matrix;
输入层构建子单元,用于根据字符词语映射转换表,将输入句子中的每个字符和词语转化为相应的数字标识,从而完成数据的输入,具体来说就是分别 获取q1与q2,将其形式化为:(q1-char,q1-word,q2-char,q2-word);The input layer constructs sub-units, which are used to convert each character and word in the input sentence into a corresponding digital identifier according to the character-word mapping conversion table, thereby completing the data input, specifically to obtain q1 and q2 respectively, and convert them Formalized as: (q1-char,q1-word,q2-char,q2-word);
多粒度嵌入层构建子单元,用于加载预训练好的字符词语向量权重,将输入句子中的字符词语转换为字符词语向量形式,进而构成完整的句子向量表示;该操作根据字符词语的数字标识查找字符词语向量矩阵而完成;The multi-granular embedding layer constructs sub-units, which are used to load pre-trained character word vector weights, convert the character words in the input sentence into the character word vector form, and then form a complete sentence vector representation; this operation is based on the digital identification of the character words Completed by looking up the character word vector matrix;
多粒度融合编码层构建子单元,用于将多粒度嵌入层输出的词语级句子向量和字符级句子向量作为输入;先从两个角度来获取文本语义特征,即字符级别语义特征提取和词语级别语义特征提取;再通过按位相加的形式,对两个角度的文本语义特征进行整合,得到最终的句子语义特征向量;The multi-granularity fusion coding layer constructs sub-units, which are used to take the word-level sentence vector and character-level sentence vector output by the multi-granularity embedding layer as input; first obtain text semantic features from two perspectives, namely, character-level semantic feature extraction and word-level Semantic feature extraction; then through the form of bitwise addition, the semantic features of the text from the two perspectives are integrated to obtain the final sentence semantic feature vector;
交互匹配层构建子单元,用于将输入的两个句子语义特征向量,经过分层匹配计算,得到句子对的匹配表征向量;The interactive matching layer constructs sub-units, which are used to perform hierarchical matching calculations on the semantic feature vectors of the two input sentences to obtain the matching representation vectors of the sentence pairs;
预测层构建子单元,用于接收交互匹配层输出的匹配表征向量,使用Sigmoid函数进行计算,得到处于[0,1]之间的匹配度,最终通过与设立的阈值进行比较来判别句子对的匹配程度;The prediction layer constructs a sub-unit for receiving the matching representation vector output by the interactive matching layer, using the Sigmoid function to calculate, and obtaining the matching degree between [0,1], and finally judging the sentence pair by comparing with the established threshold Matching degree;
多粒度融合模型训练单元,用于构建模型训练过程中所需要的损失函数,并完成模型的优化训练;多粒度融合模型训练单元包括,The multi-granular fusion model training unit is used to construct the loss function needed in the model training process and complete the optimization training of the model; the multi-granular fusion model training unit includes:
损失函数构建子单元,用于构建损失函数,计算句子1和句子2间文本匹配度的误差;The loss function construction subunit is used to construct the loss function and calculate the error of the text matching degree between sentence 1 and sentence 2;
模型优化训练子单元,用于训练并调整模型训练中的参数,从而减小模型训练过程中预测的句子1与句子2间匹配度与真实匹配度之间的误差。The model optimization training subunit is used to train and adjust the parameters in the model training, thereby reducing the error between the predicted matching degree between sentence 1 and sentence 2 and the true matching degree in the model training process.
可以将附图10所示的基于多粒度融合模型的中文句子语义智能匹配的装置集成部署到各种硬件设备中,例如:个人电脑、工作站、智能移动设备等。The device for intelligent semantic matching of Chinese sentences based on the multi-granularity fusion model shown in FIG. 10 can be integrated and deployed in various hardware devices, such as personal computers, workstations, and smart mobile devices.
实施例3:Example 3:
基于实施例1的存储介质,其中存储有多条指令,指令由处理器加载,执行实施例1的基于多粒度融合模型的中文句子语义智能匹配方法的步骤。Based on the storage medium of Embodiment 1, a plurality of instructions are stored therein, and the instructions are loaded by the processor to execute the steps of the method for intelligent semantic matching of Chinese sentences based on the multi-granularity fusion model of Embodiment 1.
实施例4:Example 4:
基于实施例3的电子设备,电子设备包括:Based on the electronic device of Embodiment 3, the electronic device includes:
实施例3的存储介质;以及The storage medium of Embodiment 3; and
处理器,用于执行存储介质中的指令。The processor is used to execute instructions in the storage medium.
最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions recorded in the foregoing embodiments can still be modified, or some or all of the technical features can be equivalently replaced; and these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the technical solutions of the embodiments of the present invention. Scope.

Claims (10)

  1. 基于多粒度融合模型的中文句子语义智能匹配方法,其特征在于,该方法具体如下:A Chinese sentence semantic intelligent matching method based on a multi-granularity fusion model is characterized in that the method is specifically as follows:
    S1、构建文本匹配知识库;S1. Build a text matching knowledge base;
    S2、构建文本匹配模型的训练数据集;S2, construct a training data set of the text matching model;
    S3、构建多粒度融合模型;具体如下:S3. Build a multi-granularity fusion model; the details are as follows:
    S301、构建字符词语映射转换表;S301. Construct a character word mapping conversion table;
    S302、构建输入层;S302. Construct an input layer;
    S303、构建多粒度嵌入层:对句子中的词语和字符进行向量映射,得到词语级句子向量和字符级句子向量;S303. Construct a multi-granularity embedding layer: perform vector mapping on words and characters in the sentence to obtain word-level sentence vectors and character-level sentence vectors;
    S304、构建多粒度融合编码层:对词语级句子向量和字符级句子向量进行编码处理,得到句子语义特征向量;S304. Construct a multi-granularity fusion coding layer: perform coding processing on word-level sentence vectors and character-level sentence vectors to obtain sentence semantic feature vectors;
    S305、构建交互匹配层:对句子语义特征向量进行分层比较,得到句子对的匹配表征向量;S305. Construct an interactive matching layer: perform hierarchical comparison of sentence semantic feature vectors to obtain matching representation vectors of sentence pairs;
    S306、构建预测层:经预测层的Sigmoid函数处理,判断句子对的语义匹配程度;S306. Construct a prediction layer: After the Sigmoid function of the prediction layer is processed, the degree of semantic matching of sentence pairs is judged;
    S4、训练多粒度融合模型。S4. Train a multi-granularity fusion model.
  2. 根据权利要求1所述的基于多粒度融合模型的中文句子语义智能匹配方法,其特征在于,所述步骤S1中构建文本匹配知识库具体如下:The Chinese sentence semantic intelligent matching method based on a multi-granularity fusion model according to claim 1, wherein the construction of a text matching knowledge base in step S1 is specifically as follows:
    S101、使用爬虫获取原始数据:在互联网公共问答平台爬取问题集,得到原始相似句子知识库;或者使用网上公开的句子匹配数据集,作为原始相似句子知识库;S101. Use crawlers to obtain original data: Crawl the question set on the Internet public question and answer platform to obtain the original similar sentence knowledge base; or use the sentence matching data set disclosed on the Internet as the original similar sentence knowledge base;
    S102、预处理原始数据:预处理原始相似句子知识库中的相似文本,对每 个句子进行分词和断字处理,得到文本匹配知识库;其中,分词处理是以中文里的每个词语作为基本单位,对每条数据进行分词操作;断字处理是以中文里的每个字作为基本单位,对每条数据进行断字操作;每个汉字或词语之间用空格进行切分,并保留每条数据中包括的数字、标点以及特殊字符在内的所有内容;S102. Preprocess the original data: preprocess the similar text in the original similar sentence knowledge base, and perform word segmentation and hyphenation processing on each sentence to obtain a text matching knowledge base; wherein, the word segmentation processing is based on each word in Chinese Unit, perform word segmentation operations on each piece of data; hyphenation processing uses each character in Chinese as the basic unit to perform hyphenation operations on each piece of data; each Chinese character or word is segmented with a space, and each word is reserved. All contents including numbers, punctuation and special characters in the data;
    所述步骤S2中构建文本匹配模型的训练数据集具体如下:The training data set for constructing the text matching model in the step S2 is specifically as follows:
    S201、构建训练正例:将句子与其对应的语义匹配的句子进行组合,构建训练正例,形式化为:(Q1-char,Q1-word,Q2-char,Q2-word,1);S201. Construct a training positive example: Combine a sentence with its corresponding semantically matched sentence to construct a training positive example, which is formalized as: (Q1-char, Q1-word, Q2-char, Q2-word, 1);
    其中,Q1-char表示字符级粒度的句子1;Q1-word表示词语级粒度的句子1;Q2-char表示字符级粒度的句子2;Q2-word表示词语级粒度的句子2;1表示句子1和句子2这两个文本相匹配,是正例;Among them, Q1-char represents sentence 1 at character level granularity; Q1-word represents sentence 1 at word level granularity; Q2-char represents sentence 2 at character level granularity; Q2-word represents sentence 2 at word level granularity; 1 represents sentence 1 It matches the two texts of sentence 2, which is a positive example;
    S202、构建训练负例:选中一个句子Q1,再从文本匹配知识库中随机选择一个与句子Q1不匹配的句子Q2,将Q1与Q2进行组合,构建负例,形式化为:(Q1-char,Q1-word,Q2-char,Q2-word,0);S202. Construct a training negative example: select a sentence Q1, then randomly select a sentence Q2 that does not match the sentence Q1 from the text matching knowledge base, combine Q1 and Q2 to construct a negative example, which is formalized as: (Q1-char ,Q1-word,Q2-char,Q2-word,0);
    其中,Q1-char表示字符级粒度的句子1;Q1-word表示词语级粒度的句子1;Q2-char表示字符级粒度的句子2;Q2-word表示词语级粒度的句子2;0表示句子Q1和句子Q2这两个文本不匹配,是负例;Among them, Q1-char represents sentence 1 at character level granularity; Q1-word represents sentence 1 at word level granularity; Q2-char represents sentence 2 at character level granularity; Q2-word represents sentence 2 at word level granularity; 0 represents sentence Q1 It does not match the two texts of sentence Q2, which is a negative example;
    S203、构建训练数据集:将经过步骤S201和步骤S202操作后所获得的全部的正例样本和负例样本进行组合,并打乱其顺序,构建最终的训练数据集;其中,无论是正例数据还是负例数据均包含五个维度,即Q1-char、Q1-word、Q2-char、Q2-word、0或1。S203. Construct a training data set: Combine all the positive samples and negative samples obtained after the operations of step S201 and step S202, and disrupt their order to construct a final training data set; among them, whether it is positive data Still negative data includes five dimensions, namely Q1-char, Q1-word, Q2-char, Q2-word, 0 or 1.
  3. 根据权利要求1或2所述的基于多粒度融合模型的中文句子语义智能匹配方法,其特征在于,所述步骤S301中构建字符词语映射转换表具体如下:The Chinese sentence semantic intelligent matching method based on the multi-granularity fusion model according to claim 1 or 2, wherein the construction of the character word mapping conversion table in step S301 is specifically as follows:
    S30101、字符词语表通过预处理后得到的文本匹配知识库来构建;S30101. The character vocabulary table is constructed by the text matching knowledge base obtained after preprocessing;
    S30102、字符词语表构建完成后,表中每个字符、词语均被映射为唯一的 数字标识,映射规则为:以数字1为起始,随后按照每个字符、词语被录入字符词语表的顺序依次递增排序,从而形成字符词语映射转换表;S30102. After the construction of the character vocabulary table is completed, each character and word in the table is mapped to a unique digital identifier. The mapping rule is: start with the number 1, and then follow the order in which each character or word is entered into the character vocabulary table Sort in ascending order to form a character-word mapping conversion table;
    S30103、使用Word2Vec训练字符词语向量模型,得到字符词语向量矩阵权重embedding_matrix;S30103. Use Word2Vec to train the character word vector model to obtain the weight embedding_matrix of the character word vector matrix;
    所述步骤S302中构建输入层具体如下:The construction of the input layer in the step S302 is specifically as follows:
    S30201、输入层包括四个输入,对两个待匹配的句子进行预处理分别获取Q1-char、Q1-word、Q2-char、Q2-word,将其形式化为:(Q1-char,Q1-word,Q2-char,Q2-word);S30201. The input layer includes four inputs. The two sentences to be matched are preprocessed to obtain Q1-char, Q1-word, Q2-char, and Q2-word respectively, which are formalized as: (Q1-char, Q1- word,Q2-char,Q2-word);
    S30202、对于输入句子中的每个字符和词语均按照步骤S301中构建完成的字符词语映射转换表将其转化为相应的数字标识。S30202: For each character and word in the input sentence, it is converted into a corresponding digital identifier according to the character-word mapping conversion table constructed in step S301.
  4. 根据权利要求3所述的基于多粒度融合模型的中文句子语义智能匹配方法,其特征在于,所述步骤S303中构建多粒度嵌入层具体如下:The Chinese sentence semantic intelligent matching method based on a multi-granularity fusion model according to claim 3, wherein the construction of the multi-granularity embedding layer in step S303 is specifically as follows:
    S30301、通过加载步骤S301中训练所得的字符词语向量矩阵权重来初始化当前层的权重参数;S30301: Initialize the weight parameters of the current layer by loading the weights of the character word vector matrix obtained through training in step S301;
    S30302、针对输入句子Q1和Q2,经过多粒度嵌入层处理后得到其词语级句子向量和字符级句子向量Q1-word Emd、Q1-char Emd、Q2-word Emd、Q2-char Emd;其中,文本匹配知识库中每一个句子均能通过字符词语向量映射的方式,将文本信息转化为向量形式;S30302. For the input sentences Q1 and Q2, the word-level sentence vector and character-level sentence vector Q1-word Emd, Q1-char Emd, Q2-word Emd, Q2-char Emd are obtained after multi-granular embedding layer processing; among them, text Each sentence in the matching knowledge base can transform text information into vector form through character word vector mapping;
    所述步骤S304中构建多粒度融合编码层是将步骤S303中多粒度嵌入层输出的词语级句子向量和字符级句子向量作为输入,从两个角度获取文本语义特征,即字符级别语义特征提取和词语级别语义特征提取;再通过按位相加的形式,对两个角度的文本语义特征进行整合,得到最终的句子语义特征向量;对于句子Q1求取最终的句子语义特征向量具体如下:The construction of the multi-granularity fusion coding layer in step S304 takes the word-level sentence vector and character-level sentence vector output by the multi-granularity embedding layer in step S303 as input, and obtains text semantic features from two perspectives, namely, character-level semantic feature extraction and Word-level semantic feature extraction; then through the form of bitwise addition, the text semantic features from the two perspectives are integrated to obtain the final sentence semantic feature vector; for sentence Q1, the final sentence semantic feature vector is obtained as follows:
    S30401、针对字符级别语义特征提取,具体如下:S30401. For character-level semantic feature extraction, the details are as follows:
    S3040101、使用LSTM进行特征提取,得到特征向量
    Figure PCTCN2020104723-appb-100001
    公式如下:
    S3040101, use LSTM for feature extraction to obtain feature vectors
    Figure PCTCN2020104723-appb-100001
    The formula is as follows:
    Figure PCTCN2020104723-appb-100002
    Figure PCTCN2020104723-appb-100002
    S3040102、针对
    Figure PCTCN2020104723-appb-100003
    进一步采用两种不同的方法进行编码,具体如下:
    S3040102, for
    Figure PCTCN2020104723-appb-100003
    Further use two different methods for encoding, as follows:
    ①、对
    Figure PCTCN2020104723-appb-100004
    继续使用LSTM进行二次特征提取,得到相应特征向量
    Figure PCTCN2020104723-appb-100005
    公式如下:
    ①, yes
    Figure PCTCN2020104723-appb-100004
    Continue to use LSTM for secondary feature extraction to obtain the corresponding feature vector
    Figure PCTCN2020104723-appb-100005
    The formula is as follows:
    Figure PCTCN2020104723-appb-100006
    Figure PCTCN2020104723-appb-100006
    ②、对
    Figure PCTCN2020104723-appb-100007
    使用注意力机制Attention提取特征,得到相应特征向量
    Figure PCTCN2020104723-appb-100008
    公式如下:
    ②, yes
    Figure PCTCN2020104723-appb-100007
    Use the attention mechanism Attention to extract features and get the corresponding feature vector
    Figure PCTCN2020104723-appb-100008
    The formula is as follows:
    Figure PCTCN2020104723-appb-100009
    Figure PCTCN2020104723-appb-100009
    S3040103、针对
    Figure PCTCN2020104723-appb-100010
    使用Attention再次进行编码提取关键特征,得到特征向量
    Figure PCTCN2020104723-appb-100011
    公式如下:
    S3040103, for
    Figure PCTCN2020104723-appb-100010
    Use Attention to code again to extract key features to obtain feature vectors
    Figure PCTCN2020104723-appb-100011
    The formula is as follows:
    Figure PCTCN2020104723-appb-100012
    Figure PCTCN2020104723-appb-100012
    S3040104、将
    Figure PCTCN2020104723-appb-100013
    Figure PCTCN2020104723-appb-100014
    按位相加得到字符级别的语义特征
    Figure PCTCN2020104723-appb-100015
    公式如下:
    S3040104, will
    Figure PCTCN2020104723-appb-100013
    and
    Figure PCTCN2020104723-appb-100014
    Bitwise addition to obtain character-level semantic features
    Figure PCTCN2020104723-appb-100015
    The formula is as follows:
    Figure PCTCN2020104723-appb-100016
    Figure PCTCN2020104723-appb-100016
    其中,i表示相应字符向量在句子中的相对位置,Q i为句子Q1中每个字符的相应向量表示;Q′ i为经过初次LSTM编码后每个字符的相应向量表示;Q″ i为经过第二次LSTM编码后每个字符的相应向量表示; Among them, i represents the relative position of the corresponding character vector in the sentence, and Q i is the corresponding vector representation of each character in the sentence Q1; Q′ i is the corresponding vector representation of each character after the initial LSTM encoding; Q″ i is the corresponding vector representation of each character after the initial LSTM encoding. The corresponding vector representation of each character after the second LSTM encoding;
    S30402、针对词语级别语义特征提取,具体如下:S30402. For word-level semantic feature extraction, the details are as follows:
    S3040201、使用LSTM进行特征提取,得到特征向量
    Figure PCTCN2020104723-appb-100017
    公式如下:
    S3040201 Use LSTM for feature extraction to obtain feature vectors
    Figure PCTCN2020104723-appb-100017
    The formula is as follows:
    Figure PCTCN2020104723-appb-100018
    Figure PCTCN2020104723-appb-100018
    S3040202、针对
    Figure PCTCN2020104723-appb-100019
    进一步采用LSTM进行二次特征提取,得到相应特征向量
    Figure PCTCN2020104723-appb-100020
    公式如下:
    S3040202, for
    Figure PCTCN2020104723-appb-100019
    Further use LSTM for secondary feature extraction to obtain the corresponding feature vector
    Figure PCTCN2020104723-appb-100020
    The formula is as follows:
    Figure PCTCN2020104723-appb-100021
    Figure PCTCN2020104723-appb-100021
    S3040203、针对
    Figure PCTCN2020104723-appb-100022
    使用Attention再次进行编码提取关键特征,得到词语级 别特征向量
    Figure PCTCN2020104723-appb-100023
    公式如下:
    S3040203, for
    Figure PCTCN2020104723-appb-100022
    Use Attention to code again to extract key features to obtain word-level feature vectors
    Figure PCTCN2020104723-appb-100023
    The formula is as follows:
    Figure PCTCN2020104723-appb-100024
    Figure PCTCN2020104723-appb-100024
    其中,i′表示相应词语向量在句子中的相对位置;Q i′为句子Q1中每个词语的相应向量表示;Q′ i′为经过初次LSTM编码后每个词语的相应向量表示;Q″ i′为经过第二次LSTM编码后每个词语的相应向量表示; Among them, i′ represents the relative position of the corresponding word vector in the sentence; Q i′ is the corresponding vector representation of each word in the sentence Q1; Q′ i′ is the corresponding vector representation of each word after the initial LSTM encoding; Q” i′ is the corresponding vector representation of each word after the second LSTM encoding;
    S30403、经过步骤S30401和步骤S30402得到相应字符级别的特征向量
    Figure PCTCN2020104723-appb-100025
    以及词语级别的特征向量
    Figure PCTCN2020104723-appb-100026
    Figure PCTCN2020104723-appb-100027
    Figure PCTCN2020104723-appb-100028
    按位相加,得到针对文本Q1的最终句子语义特征向量
    Figure PCTCN2020104723-appb-100029
    公式如下:
    S30403: After step S30401 and step S30402, the feature vector of the corresponding character level is obtained
    Figure PCTCN2020104723-appb-100025
    And word-level feature vectors
    Figure PCTCN2020104723-appb-100026
    will
    Figure PCTCN2020104723-appb-100027
    with
    Figure PCTCN2020104723-appb-100028
    Add bitwise to get the final sentence semantic feature vector for text Q1
    Figure PCTCN2020104723-appb-100029
    The formula is as follows:
    Figure PCTCN2020104723-appb-100030
    Figure PCTCN2020104723-appb-100030
    对于句子Q2求取最终的句子语义特征向量
    Figure PCTCN2020104723-appb-100031
    的方法,同步骤S30401到步骤S30403。
    For sentence Q2, obtain the final sentence semantic feature vector
    Figure PCTCN2020104723-appb-100031
    The method is the same as step S30401 to step S30403.
  5. 根据权利要求4所述的基于多粒度融合模型的中文句子语义智能匹配方法,其特征在于,所述步骤S305构建交互匹配层具体如下:The Chinese sentence semantic intelligent matching method based on a multi-granularity fusion model according to claim 4, wherein the step S305 constructs an interactive matching layer specifically as follows:
    S30501、经过步骤S304处理得到Q1、Q2的句子语义特征向量
    Figure PCTCN2020104723-appb-100032
    Figure PCTCN2020104723-appb-100033
    针对
    Figure PCTCN2020104723-appb-100034
    Figure PCTCN2020104723-appb-100035
    进行减法、叉乘以及点乘三种操作,得到
    Figure PCTCN2020104723-appb-100036
    公式如下:
    S30501: After processing in step S304, the sentence semantic feature vectors of Q1 and Q2 are obtained
    Figure PCTCN2020104723-appb-100032
    with
    Figure PCTCN2020104723-appb-100033
    against
    Figure PCTCN2020104723-appb-100034
    with
    Figure PCTCN2020104723-appb-100035
    Perform three operations of subtraction, cross product, and dot product to get
    Figure PCTCN2020104723-appb-100036
    The formula is as follows:
    Figure PCTCN2020104723-appb-100037
    Figure PCTCN2020104723-appb-100037
    Figure PCTCN2020104723-appb-100038
    Figure PCTCN2020104723-appb-100038
    Figure PCTCN2020104723-appb-100039
    Figure PCTCN2020104723-appb-100039
    同时,使用一个全连接层Dense进一步编码得到
    Figure PCTCN2020104723-appb-100040
    Figure PCTCN2020104723-appb-100041
    公式如下:
    At the same time, use a fully connected layer Dense to further encode
    Figure PCTCN2020104723-appb-100040
    with
    Figure PCTCN2020104723-appb-100041
    The formula is as follows:
    Figure PCTCN2020104723-appb-100042
    Figure PCTCN2020104723-appb-100042
    Figure PCTCN2020104723-appb-100043
    Figure PCTCN2020104723-appb-100043
    其中,i表示相应语义特征在句子中的相对位置;Q1 i为文本Q1经过步骤S304特征提取得到的
    Figure PCTCN2020104723-appb-100044
    中每个语义特征的相应向量表示;Q2 i为文本Q2经过步骤S304特征提取得到的
    Figure PCTCN2020104723-appb-100045
    中每个语义特征的相应向量表示;
    Figure PCTCN2020104723-appb-100046
    为针对句子语义特征向量
    Figure PCTCN2020104723-appb-100047
    Figure PCTCN2020104723-appb-100048
    使用Dense进一步提取,得到的特征向量;
    Figure PCTCN2020104723-appb-100049
    表示编码维度为300;
    Among them, i represents the relative position of the corresponding semantic feature in the sentence; Q1 i is the text Q1 obtained by feature extraction in step S304
    Figure PCTCN2020104723-appb-100044
    Each respective feature vector is the semantic representation; I Q2 Q2 through the text feature extraction step S304 obtained
    Figure PCTCN2020104723-appb-100045
    The corresponding vector representation of each semantic feature in;
    Figure PCTCN2020104723-appb-100046
    For sentence semantic feature vector
    Figure PCTCN2020104723-appb-100047
    with
    Figure PCTCN2020104723-appb-100048
    Use Dense to further extract the feature vector obtained;
    Figure PCTCN2020104723-appb-100049
    Indicates that the coding dimension is 300;
    S30502、将
    Figure PCTCN2020104723-appb-100050
    Figure PCTCN2020104723-appb-100051
    联接起来得到
    Figure PCTCN2020104723-appb-100052
    公式如下:
    S30502, will
    Figure PCTCN2020104723-appb-100050
    with
    Figure PCTCN2020104723-appb-100051
    Connect to get
    Figure PCTCN2020104723-appb-100052
    The formula is as follows:
    Figure PCTCN2020104723-appb-100053
    Figure PCTCN2020104723-appb-100053
    同时,
    Figure PCTCN2020104723-appb-100054
    Figure PCTCN2020104723-appb-100055
    同样进行减法、叉乘操作,公式如下:
    at the same time,
    Figure PCTCN2020104723-appb-100054
    with
    Figure PCTCN2020104723-appb-100055
    The same subtraction and cross multiplication operations are performed, and the formula is as follows:
    Figure PCTCN2020104723-appb-100056
    Figure PCTCN2020104723-appb-100056
    Figure PCTCN2020104723-appb-100057
    Figure PCTCN2020104723-appb-100057
    再将二者结果联接得到
    Figure PCTCN2020104723-appb-100058
    公式如下:
    Then connect the two results to get
    Figure PCTCN2020104723-appb-100058
    The formula is as follows:
    Figure PCTCN2020104723-appb-100059
    Figure PCTCN2020104723-appb-100059
    S30503、将
    Figure PCTCN2020104723-appb-100060
    使用两层全连接层进行特征提取得到
    Figure PCTCN2020104723-appb-100061
    并将
    Figure PCTCN2020104723-appb-100062
    Figure PCTCN2020104723-appb-100063
    进行求和,得到
    Figure PCTCN2020104723-appb-100064
    公式如下:
    S30503, will
    Figure PCTCN2020104723-appb-100060
    Use two fully connected layers for feature extraction to get
    Figure PCTCN2020104723-appb-100061
    And will
    Figure PCTCN2020104723-appb-100062
    and
    Figure PCTCN2020104723-appb-100063
    To sum and get
    Figure PCTCN2020104723-appb-100064
    The formula is as follows:
    Figure PCTCN2020104723-appb-100065
    Figure PCTCN2020104723-appb-100065
    Figure PCTCN2020104723-appb-100066
    Figure PCTCN2020104723-appb-100066
    Figure PCTCN2020104723-appb-100067
    Figure PCTCN2020104723-appb-100067
    S30504、将
    Figure PCTCN2020104723-appb-100068
    经过一层全连接层编码后的结果与步骤S30501中
    Figure PCTCN2020104723-appb-100069
    求和,得到句子对的匹配表征向量
    Figure PCTCN2020104723-appb-100070
    公式如下:
    S30504, will
    Figure PCTCN2020104723-appb-100068
    The result after a layer of fully connected layer encoding is the same as in step S30501
    Figure PCTCN2020104723-appb-100069
    Sum, get the matching representation vector of the sentence pair
    Figure PCTCN2020104723-appb-100070
    The formula is as follows:
    Figure PCTCN2020104723-appb-100071
    Figure PCTCN2020104723-appb-100071
    所述步骤S306中构建预测层具体如下:The construction of the prediction layer in step S306 is specifically as follows:
    S30601、预测层接收步骤S305输出的匹配表征向量,使用Sigmoid函数进行计算,得到处于[0,1]之间的匹配度表示y predS30601: The prediction layer receives the matching representation vector output in step S305, and uses the Sigmoid function for calculation to obtain a matching degree representation y pred between [0, 1];
    S30602、将y pred与设立的阈值进行比较来判别句子对的匹配程度,具体如下: S30602. Compare y pred with the established threshold to determine the matching degree of the sentence pair, which is specifically as follows:
    ①、当y pred≥0.5时,表示句子Q1以及句子Q2相匹配; ①. When y pred ≥0.5, it means that sentence Q1 and sentence Q2 match;
    ②、当y pred<0.5时,表示句子Q1以及句子Q2不匹配。 ②. When y pred <0.5, it means that sentence Q1 and sentence Q2 do not match.
  6. 根据权利要求1所述的基于多粒度融合模型的中文句子语义智能匹配方 法,其特征在于,所述步骤S4中训练多粒度融合模型具体如下:The Chinese sentence semantic intelligent matching method based on a multi-granularity fusion model according to claim 1, wherein the training of the multi-granularity fusion model in step S4 is specifically as follows:
    S401、构建损失函数:通过将均方误差设置为交叉熵的平衡因子,设计出平衡交叉熵,其中均方误差的公式如下:S401. Construct a loss function: design a balanced cross entropy by setting the mean square error as the balance factor of the cross entropy, and the formula for the mean square error is as follows:
    Figure PCTCN2020104723-appb-100072
    Figure PCTCN2020104723-appb-100072
    其中,y true表示真实标签,即每条训练样例中表示匹配与否的0、1标志;y pred表示预测结果; Among them, y true represents the true label, that is, the 0 and 1 flags that indicate match or not in each training example; y pred represents the prediction result;
    当分类边界模糊时,平衡交叉熵的使用能够自动平衡正负样本,并提高分类的准确性;其将交叉熵与均方误差融合,公式如下:When the classification boundary is blurred, the use of balanced cross entropy can automatically balance the positive and negative samples and improve the accuracy of classification; it fuses the cross entropy with the mean square error, and the formula is as follows:
    Figure PCTCN2020104723-appb-100073
    Figure PCTCN2020104723-appb-100073
    S402、优化训练模型:选择使用RMSprop优化函数作为本模型的优化函数,超参数均选择Keras中的默认值设置。S402. Optimize the training model: choose to use the RMSprop optimization function as the optimization function of this model, and the hyperparameters are all set to default values in Keras.
  7. 一种基于多粒度融合模型的中文句子语义智能匹配装置,其特征在于,该装置包括,A Chinese sentence semantic intelligent matching device based on a multi-granularity fusion model is characterized in that the device includes:
    文本匹配知识库构建单元,用于使用爬虫程序,在互联网公共问答平台爬取问题集,或者使用网上公开的文本匹配数据集,作为原始相似句子知识库,再对原始相似句子知识库进行预处理,主要操作为对原始相似句子知识库中的每个句子进行断字处理和分词处理,从而构建用于模型训练的文本匹配知识库;The text matching knowledge base building unit is used to use crawlers to crawl question sets on the Internet public question and answer platform, or use text matching data sets published on the Internet as the original similar sentence knowledge base, and then preprocess the original similar sentence knowledge base , The main operation is to perform hyphenation and word segmentation on each sentence in the original similar sentence knowledge base, so as to construct a text matching knowledge base for model training;
    训练数据集生成单元,用于根据文本匹配知识库中的句子来构建训练正例数据和训练负例数据,并且基于正例数据与负例数据来构建最终的训练数据集;The training data set generation unit is used to construct training positive example data and training negative example data according to the sentences in the text matching knowledge base, and construct the final training data set based on the positive example data and the negative example data;
    多粒度融合模型构建单元,用于构建字符词语映射转换表,并同时构建输入层、多粒度嵌入层、多粒度融合编码层、交互匹配层、预测层;其中,多粒度融合模型构建单元包括,The multi-granularity fusion model building unit is used to construct the character word mapping conversion table, and to construct the input layer, the multi-granularity embedding layer, the multi-granularity fusion coding layer, the interactive matching layer, and the prediction layer at the same time; among them, the multi-granularity fusion model building unit includes:
    字符词语映射转换表构建子单元,用于对文本匹配知识库中的每个句子按字 符和词语进行切分,并将每个字符和词语依次存入一个列表中,从而得到一个字符词语表,随后以数字1为起始,按照每个字符和词语被录入字符词语表的顺序依次递增排序,从而形成本发明所需的字符词语映射转换表;字符词语映射转换表构建完成后,表中每个字符和词语均被映射为唯一的数字标识;其后,使用Word2Vec训练字符词语向量模型,得到字符词语向量矩阵权重;The character word mapping conversion table constructs a subunit, which is used to segment each sentence in the text matching knowledge base by character and word, and store each character and word in a list in turn to obtain a character word table. Then, starting with the number 1, the characters and words are entered into the character word table in ascending order, thereby forming the character word mapping conversion table required by the present invention; after the character word mapping conversion table is constructed, each character in the table Characters and words are mapped to unique digital identifiers; after that, use Word2Vec to train the character word vector model to obtain the weight of the character word vector matrix;
    输入层构建子单元,用于根据字符词语映射转换表,将输入句子中的每个字符和词语转化为相应的数字标识,从而完成数据的输入,具体来说就是分别获取q1与q2,将其形式化为:(q1-char,q1-word,q2-char,q2-word);The input layer constructs sub-units, which are used to convert each character and word in the input sentence into a corresponding digital identifier according to the character-word mapping conversion table, thereby completing the data input, specifically to obtain q1 and q2 respectively, and convert them Formalized as: (q1-char,q1-word,q2-char,q2-word);
    多粒度嵌入层构建子单元,用于加载预训练好的字符词语向量权重,将输入句子中的字符词语转换为字符词语向量形式,进而构成完整的句子向量表示;该操作根据字符词语的数字标识查找字符词语向量矩阵而完成;The multi-granular embedding layer constructs sub-units, which are used to load pre-trained character word vector weights, convert the character words in the input sentence into the character word vector form, and then form a complete sentence vector representation; this operation is based on the digital identification of the character words Completed by looking up the character word vector matrix;
    多粒度融合编码层构建子单元,用于将多粒度嵌入层输出的词语级句子向量和字符级句子向量作为输入;先从两个角度来获取文本语义特征,即字符级别语义特征提取和词语级别语义特征提取;再通过按位相加的形式,对两个角度的文本语义特征进行整合,得到最终的句子语义特征向量;The multi-granularity fusion coding layer constructs sub-units, which are used to take the word-level sentence vector and character-level sentence vector output by the multi-granularity embedding layer as input; first obtain text semantic features from two perspectives, namely, character-level semantic feature extraction and word-level Semantic feature extraction; then through the form of bitwise addition, the semantic features of the text from the two perspectives are integrated to obtain the final sentence semantic feature vector;
    交互匹配层构建子单元,用于将输入的两个句子语义特征向量,经过分层匹配计算,得到句子对的匹配表征向量;The interactive matching layer constructs sub-units, which are used to perform hierarchical matching calculations on the semantic feature vectors of the two input sentences to obtain the matching representation vectors of the sentence pairs;
    预测层构建子单元,用于接收交互匹配层输出的匹配表征向量,使用Sigmoid函数进行计算,得到处于[0,1]之间的匹配度,最终通过与设立的阈值进行比较来判别句子对的匹配程度;The prediction layer constructs a sub-unit for receiving the matching representation vector output by the interactive matching layer, using the Sigmoid function to calculate, and obtaining the matching degree between [0,1], and finally judging the sentence pair by comparing with the established threshold Matching degree;
    多粒度融合模型训练单元,用于构建模型训练过程中所需要的损失函数,并完成模型的优化训练。The multi-granularity fusion model training unit is used to construct the loss function needed in the model training process and complete the optimization training of the model.
  8. 根据权利要求7所述的基于多粒度融合模型的中文句子语义智能匹配装置,其特征在于,所述文本匹配知识库构建单元包括,The Chinese sentence semantic intelligent matching device based on a multi-granularity fusion model according to claim 7, wherein the text matching knowledge base building unit comprises:
    爬取原始数据子单元,用于在互联网公共问答平台爬取问题集,或者使用网上公开的文本匹配数据集,构建原始相似句子知识库;Crawling the original data subunit, used to crawl the question set on the Internet public question and answer platform, or use the text matching data set published on the Internet to build the original similar sentence knowledge base;
    原始数据处理子单元,用于将原始相似句子知识库中的句子进行断字处理和分词处理,从而构建用于模型训练的文本匹配知识库;The original data processing subunit is used to hyphenate and segment the sentences in the original similar sentence knowledge base, thereby constructing the text matching knowledge base for model training;
    所述训练数据集生成单元包括,The training data set generating unit includes:
    训练正例数据构建子单元,用于将文本匹配知识库中语义匹配的句子进行组合,并对其添加匹配标签1,构建为训练正例数据;The training positive example data construction subunit is used to combine semantically matched sentences in the text matching knowledge base, and add matching label 1 to it to construct the training positive example data;
    训练负例数据构建子单元,用于先从文本匹配知识库中选取一个句子q 1,再从文本匹配知识库中随机选择一个与句子q 1语义不匹配的句子q 2,将q 1与q 2进行组合,并对其添加匹配标签0,构建为训练负例数据; The training negative example data constructs a subunit, which is used to first select a sentence q 1 from the text matching knowledge base, and then randomly select a sentence q 2 that does not match the sentence q 1 semantically from the text matching knowledge base, and compare q 1 with q 2 Combine and add a matching label 0 to it, and construct it as training negative example data;
    训练数据集构建子单元,用于将所有的训练正例数据与训练负例数据组合在一起,并打乱其顺序,从而构建最终的训练数据集;The training data set construction subunit is used to combine all the training positive example data and the training negative example data, and disrupt the order to construct the final training data set;
    所述多粒度融合模型训练单元包括,The multi-granularity fusion model training unit includes:
    损失函数构建子单元,用于构建损失函数,计算句子1和句子2间文本匹配度的误差;The loss function construction subunit is used to construct the loss function and calculate the error of the text matching degree between sentence 1 and sentence 2;
    模型优化训练子单元,用于训练并调整模型训练中的参数,从而减小模型训练过程中预测的句子1与句子2间匹配度与真实匹配度之间的误差。The model optimization training subunit is used to train and adjust the parameters in the model training, thereby reducing the error between the predicted matching degree between sentence 1 and sentence 2 and the true matching degree in the model training process.
  9. 一种存储介质,其中存储有多条指令,其特征在于,所述指令由处理器加载,执行权利要求1-6中所述的基于多粒度融合模型的中文句子语义智能匹配方法的步骤。A storage medium storing a plurality of instructions, wherein the instructions are loaded by a processor to execute the steps of the Chinese sentence semantic intelligent matching method based on the multi-granularity fusion model described in claims 1-6.
  10. 一种电子设备,其特征在于,所述电子设备包括:An electronic device, characterized in that, the electronic device includes:
    权利要求9所述的存储介质;以及The storage medium of claim 9; and
    处理器,用于执行所述存储介质中的指令。The processor is configured to execute instructions in the storage medium.
PCT/CN2020/104723 2020-02-20 2020-07-27 Multi-granularity fusion model-based intelligent semantic chinese sentence matching method, and device WO2021164199A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010103529.1 2020-02-20
CN202010103529.1A CN111310438B (en) 2020-02-20 2020-02-20 Chinese sentence semantic intelligent matching method and device based on multi-granularity fusion model

Publications (1)

Publication Number Publication Date
WO2021164199A1 true WO2021164199A1 (en) 2021-08-26

Family

ID=71151080

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/104723 WO2021164199A1 (en) 2020-02-20 2020-07-27 Multi-granularity fusion model-based intelligent semantic chinese sentence matching method, and device

Country Status (2)

Country Link
CN (1) CN111310438B (en)
WO (1) WO2021164199A1 (en)

Cited By (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113705197A (en) * 2021-08-30 2021-11-26 北京工业大学 Fine-grained emotion analysis method based on position enhancement
CN114153839A (en) * 2021-10-29 2022-03-08 杭州未名信科科技有限公司 Integration method, device, equipment and storage medium of multi-source heterogeneous data
CN114218380A (en) * 2021-12-03 2022-03-22 淮阴工学院 Multi-mode-based cold chain loading user portrait label extraction method and device
CN114238563A (en) * 2021-12-08 2022-03-25 齐鲁工业大学 Multi-angle interaction-based intelligent matching method and device for Chinese sentences to semantic meanings
CN114239566A (en) * 2021-12-14 2022-03-25 公安部第三研究所 Method, device and processor for realizing two-step Chinese event accurate detection based on information enhancement and computer readable storage medium thereof
CN114281987A (en) * 2021-11-26 2022-04-05 重庆邮电大学 Dialogue short text statement matching method for intelligent voice assistant
CN114297390A (en) * 2021-12-30 2022-04-08 江南大学 Aspect category identification method and system under long-tail distribution scene
CN114357121A (en) * 2022-03-10 2022-04-15 四川大学 Innovative scheme design method and system based on data driving
CN114357158A (en) * 2021-12-09 2022-04-15 南京中孚信息技术有限公司 Long text classification technology based on sentence granularity semantics and relative position coding
CN114416930A (en) * 2022-02-09 2022-04-29 上海携旅信息技术有限公司 Text matching method, system, device and storage medium under search scene
CN114461806A (en) * 2022-02-28 2022-05-10 同盾科技有限公司 Training method and device of advertisement recognition model and advertisement shielding method
CN114492451A (en) * 2021-12-22 2022-05-13 马上消费金融股份有限公司 Text matching method and device, electronic equipment and computer readable storage medium
CN114547256A (en) * 2022-04-01 2022-05-27 齐鲁工业大学 Text semantic matching method and device for intelligent question answering of fire safety knowledge
CN114595306A (en) * 2022-01-26 2022-06-07 西北大学 Text similarity calculation system and method based on distance perception self-attention mechanism and multi-angle modeling
CN114742016A (en) * 2022-04-01 2022-07-12 山西大学 Chapter-level event extraction method and device based on multi-granularity entity differential composition
CN115048944A (en) * 2022-08-16 2022-09-13 之江实验室 Open domain dialogue reply method and system based on theme enhancement
CN115238684A (en) * 2022-09-19 2022-10-25 北京探境科技有限公司 Text collection method and device, computer equipment and readable storage medium
CN115438674A (en) * 2022-11-08 2022-12-06 腾讯科技(深圳)有限公司 Entity data processing method, entity linking method, entity data processing device, entity linking device and computer equipment
CN115600945A (en) * 2022-09-07 2023-01-13 淮阴工学院(Cn) Multi-granularity-based cold chain loading user portrait construction method and device
CN115910345A (en) * 2022-12-22 2023-04-04 广东数业智能科技有限公司 Mental health assessment intelligent early warning method and storage medium
CN115936014A (en) * 2022-11-08 2023-04-07 上海栈略数据技术有限公司 Medical entity code matching method, system, computer equipment and storage medium
CN116071759A (en) * 2023-03-06 2023-05-05 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Optical character recognition method fusing GPT2 pre-training large model
CN116204642A (en) * 2023-03-06 2023-06-02 上海阅文信息技术有限公司 Intelligent character implicit attribute recognition analysis method, system and application in digital reading
CN116306558A (en) * 2022-11-23 2023-06-23 北京语言大学 Method and device for computer-aided Chinese text adaptation
CN116304745A (en) * 2023-03-27 2023-06-23 济南大学 Text topic matching method and system based on deep semantic information
CN116629275A (en) * 2023-07-21 2023-08-22 北京无极慧通科技有限公司 Intelligent decision support system and method based on big data
CN116680590A (en) * 2023-07-28 2023-09-01 中国人民解放军国防科技大学 Post portrait label extraction method and device based on work instruction analysis
CN116822495A (en) * 2023-08-31 2023-09-29 小语智能信息科技(云南)有限公司 Chinese-old and Tai parallel sentence pair extraction method and device based on contrast learning
CN117271438A (en) * 2023-07-17 2023-12-22 乾元云硕科技(深圳)有限公司 Intelligent storage system for big data and method thereof
CN117390141A (en) * 2023-12-11 2024-01-12 江西农业大学 Agricultural socialization service quality user evaluation data analysis method
CN117556027A (en) * 2024-01-12 2024-02-13 一站发展(北京)云计算科技有限公司 Intelligent interaction system and method based on digital human technology
CN117590944A (en) * 2023-11-28 2024-02-23 上海源庐加佳信息科技有限公司 Binding system for physical person object and digital virtual person object
CN117633518A (en) * 2024-01-25 2024-03-01 北京大学 Industrial chain construction method and system
CN117669593A (en) * 2024-01-31 2024-03-08 山东省计算中心(国家超级计算济南中心) Zero sample relation extraction method, system, equipment and medium based on equivalent semantics
CN117744787A (en) * 2024-02-20 2024-03-22 中国电子科技集团公司第十研究所 Intelligent measurement method for first-order research rule knowledge rationality
CN117874209A (en) * 2024-03-12 2024-04-12 深圳市诚立业科技发展有限公司 NLP-based fraud short message monitoring and alarming system
CN117910460A (en) * 2024-03-18 2024-04-19 国网江苏省电力有限公司南通供电分公司 Electric power scientific research knowledge correlation construction method and system based on BGE model
CN118093791A (en) * 2024-04-24 2024-05-28 北京中关村科金技术有限公司 AI knowledge base generation method and system combined with cloud computing
CN118132683A (en) * 2024-05-07 2024-06-04 杭州海康威视数字技术股份有限公司 Training method of text extraction model, text extraction method and equipment
CN118153553A (en) * 2024-05-09 2024-06-07 江西科技师范大学 Social network user psychological crisis cause extraction method and system based on multitasking
CN118193743A (en) * 2024-05-20 2024-06-14 山东齐鲁壹点传媒有限公司 Multi-level text classification model based on pre-training model

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111310438B (en) * 2020-02-20 2021-06-08 齐鲁工业大学 Chinese sentence semantic intelligent matching method and device based on multi-granularity fusion model
CN111753524A (en) * 2020-07-01 2020-10-09 携程计算机技术(上海)有限公司 Text sentence break position identification method and system, electronic device and storage medium
CN111914551B (en) * 2020-07-29 2022-05-20 北京字节跳动网络技术有限公司 Natural language processing method, device, electronic equipment and storage medium
CN112149410A (en) * 2020-08-10 2020-12-29 招联消费金融有限公司 Semantic recognition method and device, computer equipment and storage medium
CN112000772B (en) * 2020-08-24 2022-09-06 齐鲁工业大学 Sentence-to-semantic matching method based on semantic feature cube and oriented to intelligent question and answer
CN112101030B (en) * 2020-08-24 2024-01-26 沈阳东软智能医疗科技研究院有限公司 Method, device and equipment for establishing term mapping model and realizing standard word mapping
CN112328890B (en) * 2020-11-23 2024-04-12 北京百度网讯科技有限公司 Method, device, equipment and storage medium for searching geographic position point
CN112256841B (en) * 2020-11-26 2024-05-07 支付宝(杭州)信息技术有限公司 Text matching and countermeasure text recognition method, device and equipment
CN112463924B (en) * 2020-11-27 2022-07-05 齐鲁工业大学 Text intention matching method for intelligent question answering based on internal correlation coding
CN112560502B (en) * 2020-12-28 2022-05-13 桂林电子科技大学 Semantic similarity matching method and device and storage medium
CN112613282A (en) * 2020-12-31 2021-04-06 桂林电子科技大学 Text generation method and device and storage medium
CN112966524B (en) * 2021-03-26 2024-01-26 湖北工业大学 Chinese sentence semantic matching method and system based on multi-granularity twin network
CN113065358B (en) * 2021-04-07 2022-05-24 齐鲁工业大学 Text-to-semantic matching method based on multi-granularity alignment for bank consultation service
CN113593709B (en) * 2021-07-30 2022-09-30 江先汉 Disease coding method, system, readable storage medium and device
CN113569014B (en) * 2021-08-11 2024-03-19 国家电网有限公司 Operation and maintenance project management method based on multi-granularity text semantic information
CN113780006B (en) * 2021-09-27 2024-04-09 广州金域医学检验中心有限公司 Training method of medical semantic matching model, medical knowledge matching method and device
CN114090747A (en) * 2021-10-14 2022-02-25 特斯联科技集团有限公司 Automatic question answering method, device, equipment and medium based on multiple semantic matching
CN114049884B (en) * 2022-01-11 2022-05-13 广州小鹏汽车科技有限公司 Voice interaction method, vehicle and computer-readable storage medium
CN115422362B (en) * 2022-10-09 2023-10-31 郑州数智技术研究院有限公司 Text matching method based on artificial intelligence
CN115688796B (en) * 2022-10-21 2023-12-05 北京百度网讯科技有限公司 Training method and device for pre-training model in natural language processing field

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180137404A1 (en) * 2016-11-15 2018-05-17 International Business Machines Corporation Joint learning of local and global features for entity linking via neural networks
CN108984532A (en) * 2018-07-27 2018-12-11 福州大学 Aspect abstracting method based on level insertion
CN109299262A (en) * 2018-10-09 2019-02-01 中山大学 A kind of text implication relation recognition methods for merging more granular informations
CN110083692A (en) * 2019-04-22 2019-08-02 齐鲁工业大学 A kind of the text interaction matching process and device of finance knowledge question
CN110321419A (en) * 2019-06-28 2019-10-11 神思电子技术股份有限公司 A kind of question and answer matching process merging depth representing and interaction models
CN110502627A (en) * 2019-08-28 2019-11-26 上海海事大学 A kind of answer generation method based on multilayer Transformer polymerization encoder
CN111310438A (en) * 2020-02-20 2020-06-19 齐鲁工业大学 Chinese sentence semantic intelligent matching method and device based on multi-granularity fusion model

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104408153B (en) * 2014-12-03 2018-07-31 中国科学院自动化研究所 A kind of short text Hash learning method based on more granularity topic models
CN107315772B (en) * 2017-05-24 2019-08-16 北京邮电大学 The problem of based on deep learning matching process and device
CN108268643A (en) * 2018-01-22 2018-07-10 北京邮电大学 A kind of Deep Semantics matching entities link method based on more granularity LSTM networks
CN109408627B (en) * 2018-11-15 2021-03-02 众安信息技术服务有限公司 Question-answering method and system fusing convolutional neural network and cyclic neural network
CN110032639B (en) * 2018-12-27 2023-10-31 ***股份有限公司 Method, device and storage medium for matching semantic text data with tag
CN110032635B (en) * 2019-04-22 2023-01-20 齐鲁工业大学 Problem pair matching method and device based on depth feature fusion neural network
CN110334184A (en) * 2019-07-04 2019-10-15 河海大学常州校区 The intelligent Answer System understood is read based on machine

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180137404A1 (en) * 2016-11-15 2018-05-17 International Business Machines Corporation Joint learning of local and global features for entity linking via neural networks
CN108984532A (en) * 2018-07-27 2018-12-11 福州大学 Aspect abstracting method based on level insertion
CN109299262A (en) * 2018-10-09 2019-02-01 中山大学 A kind of text implication relation recognition methods for merging more granular informations
CN110083692A (en) * 2019-04-22 2019-08-02 齐鲁工业大学 A kind of the text interaction matching process and device of finance knowledge question
CN110321419A (en) * 2019-06-28 2019-10-11 神思电子技术股份有限公司 A kind of question and answer matching process merging depth representing and interaction models
CN110502627A (en) * 2019-08-28 2019-11-26 上海海事大学 A kind of answer generation method based on multilayer Transformer polymerization encoder
CN111310438A (en) * 2020-02-20 2020-06-19 齐鲁工业大学 Chinese sentence semantic intelligent matching method and device based on multi-granularity fusion model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
NAIXIN ZHANG ET AL.: "MIFM: Multi-Granularity Information Fusion Model for Chinese Named Entity Recognition", IEEE ACCESS, vol. 2019, no. 7, 13 December 2019 (2019-12-13), ISSN: 2169-3536 *

Cited By (66)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113705197A (en) * 2021-08-30 2021-11-26 北京工业大学 Fine-grained emotion analysis method based on position enhancement
CN113705197B (en) * 2021-08-30 2024-04-02 北京工业大学 Fine granularity emotion analysis method based on position enhancement
CN114153839A (en) * 2021-10-29 2022-03-08 杭州未名信科科技有限公司 Integration method, device, equipment and storage medium of multi-source heterogeneous data
CN114281987A (en) * 2021-11-26 2022-04-05 重庆邮电大学 Dialogue short text statement matching method for intelligent voice assistant
CN114218380A (en) * 2021-12-03 2022-03-22 淮阴工学院 Multi-mode-based cold chain loading user portrait label extraction method and device
CN114218380B (en) * 2021-12-03 2022-07-29 淮阴工学院 Multi-mode-based cold chain loading user portrait label extraction method and device
CN114238563A (en) * 2021-12-08 2022-03-25 齐鲁工业大学 Multi-angle interaction-based intelligent matching method and device for Chinese sentences to semantic meanings
CN114357158A (en) * 2021-12-09 2022-04-15 南京中孚信息技术有限公司 Long text classification technology based on sentence granularity semantics and relative position coding
CN114357158B (en) * 2021-12-09 2024-04-09 南京中孚信息技术有限公司 Long text classification technology based on sentence granularity semantics and relative position coding
CN114239566B (en) * 2021-12-14 2024-04-23 公安部第三研究所 Method, device, processor and computer readable storage medium for realizing accurate detection of two-step Chinese event based on information enhancement
CN114239566A (en) * 2021-12-14 2022-03-25 公安部第三研究所 Method, device and processor for realizing two-step Chinese event accurate detection based on information enhancement and computer readable storage medium thereof
CN114492451B (en) * 2021-12-22 2023-10-24 马上消费金融股份有限公司 Text matching method, device, electronic equipment and computer readable storage medium
CN114492451A (en) * 2021-12-22 2022-05-13 马上消费金融股份有限公司 Text matching method and device, electronic equipment and computer readable storage medium
CN114297390A (en) * 2021-12-30 2022-04-08 江南大学 Aspect category identification method and system under long-tail distribution scene
CN114297390B (en) * 2021-12-30 2024-04-02 江南大学 Aspect category identification method and system in long tail distribution scene
CN114595306A (en) * 2022-01-26 2022-06-07 西北大学 Text similarity calculation system and method based on distance perception self-attention mechanism and multi-angle modeling
CN114595306B (en) * 2022-01-26 2024-04-12 西北大学 Text similarity calculation system and method based on distance perception self-attention mechanism and multi-angle modeling
CN114416930A (en) * 2022-02-09 2022-04-29 上海携旅信息技术有限公司 Text matching method, system, device and storage medium under search scene
CN114461806A (en) * 2022-02-28 2022-05-10 同盾科技有限公司 Training method and device of advertisement recognition model and advertisement shielding method
CN114357121A (en) * 2022-03-10 2022-04-15 四川大学 Innovative scheme design method and system based on data driving
CN114742016A (en) * 2022-04-01 2022-07-12 山西大学 Chapter-level event extraction method and device based on multi-granularity entity differential composition
CN114547256A (en) * 2022-04-01 2022-05-27 齐鲁工业大学 Text semantic matching method and device for intelligent question answering of fire safety knowledge
CN114547256B (en) * 2022-04-01 2024-03-15 齐鲁工业大学 Text semantic matching method and device for intelligent question and answer of fire safety knowledge
CN115048944A (en) * 2022-08-16 2022-09-13 之江实验室 Open domain dialogue reply method and system based on theme enhancement
CN115600945A (en) * 2022-09-07 2023-01-13 淮阴工学院(Cn) Multi-granularity-based cold chain loading user portrait construction method and device
CN115238684B (en) * 2022-09-19 2023-03-03 北京探境科技有限公司 Text collection method and device, computer equipment and readable storage medium
CN115238684A (en) * 2022-09-19 2022-10-25 北京探境科技有限公司 Text collection method and device, computer equipment and readable storage medium
CN115936014A (en) * 2022-11-08 2023-04-07 上海栈略数据技术有限公司 Medical entity code matching method, system, computer equipment and storage medium
CN115438674A (en) * 2022-11-08 2022-12-06 腾讯科技(深圳)有限公司 Entity data processing method, entity linking method, entity data processing device, entity linking device and computer equipment
CN115936014B (en) * 2022-11-08 2023-07-25 上海栈略数据技术有限公司 Medical entity code matching method, system, computer equipment and storage medium
CN115438674B (en) * 2022-11-08 2023-03-24 腾讯科技(深圳)有限公司 Entity data processing method, entity linking method, entity data processing device, entity linking device and computer equipment
CN116306558B (en) * 2022-11-23 2023-11-10 北京语言大学 Method and device for computer-aided Chinese text adaptation
CN116306558A (en) * 2022-11-23 2023-06-23 北京语言大学 Method and device for computer-aided Chinese text adaptation
CN115910345A (en) * 2022-12-22 2023-04-04 广东数业智能科技有限公司 Mental health assessment intelligent early warning method and storage medium
CN116071759A (en) * 2023-03-06 2023-05-05 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Optical character recognition method fusing GPT2 pre-training large model
CN116204642A (en) * 2023-03-06 2023-06-02 上海阅文信息技术有限公司 Intelligent character implicit attribute recognition analysis method, system and application in digital reading
CN116204642B (en) * 2023-03-06 2023-10-27 上海阅文信息技术有限公司 Intelligent character implicit attribute recognition analysis method, system and application in digital reading
CN116071759B (en) * 2023-03-06 2023-07-18 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Optical character recognition method fusing GPT2 pre-training large model
CN116304745B (en) * 2023-03-27 2024-04-12 济南大学 Text topic matching method and system based on deep semantic information
CN116304745A (en) * 2023-03-27 2023-06-23 济南大学 Text topic matching method and system based on deep semantic information
CN117271438A (en) * 2023-07-17 2023-12-22 乾元云硕科技(深圳)有限公司 Intelligent storage system for big data and method thereof
CN116629275B (en) * 2023-07-21 2023-09-22 北京无极慧通科技有限公司 Intelligent decision support system and method based on big data
CN116629275A (en) * 2023-07-21 2023-08-22 北京无极慧通科技有限公司 Intelligent decision support system and method based on big data
CN116680590A (en) * 2023-07-28 2023-09-01 中国人民解放军国防科技大学 Post portrait label extraction method and device based on work instruction analysis
CN116680590B (en) * 2023-07-28 2023-10-20 中国人民解放军国防科技大学 Post portrait label extraction method and device based on work instruction analysis
CN116822495B (en) * 2023-08-31 2023-11-03 小语智能信息科技(云南)有限公司 Chinese-old and Tai parallel sentence pair extraction method and device based on contrast learning
CN116822495A (en) * 2023-08-31 2023-09-29 小语智能信息科技(云南)有限公司 Chinese-old and Tai parallel sentence pair extraction method and device based on contrast learning
CN117590944A (en) * 2023-11-28 2024-02-23 上海源庐加佳信息科技有限公司 Binding system for physical person object and digital virtual person object
CN117390141B (en) * 2023-12-11 2024-03-08 江西农业大学 Agricultural socialization service quality user evaluation data analysis method
CN117390141A (en) * 2023-12-11 2024-01-12 江西农业大学 Agricultural socialization service quality user evaluation data analysis method
CN117556027A (en) * 2024-01-12 2024-02-13 一站发展(北京)云计算科技有限公司 Intelligent interaction system and method based on digital human technology
CN117556027B (en) * 2024-01-12 2024-03-26 一站发展(北京)云计算科技有限公司 Intelligent interaction system and method based on digital human technology
CN117633518B (en) * 2024-01-25 2024-04-26 北京大学 Industrial chain construction method and system
CN117633518A (en) * 2024-01-25 2024-03-01 北京大学 Industrial chain construction method and system
CN117669593A (en) * 2024-01-31 2024-03-08 山东省计算中心(国家超级计算济南中心) Zero sample relation extraction method, system, equipment and medium based on equivalent semantics
CN117669593B (en) * 2024-01-31 2024-04-26 山东省计算中心(国家超级计算济南中心) Zero sample relation extraction method, system, equipment and medium based on equivalent semantics
CN117744787A (en) * 2024-02-20 2024-03-22 中国电子科技集团公司第十研究所 Intelligent measurement method for first-order research rule knowledge rationality
CN117744787B (en) * 2024-02-20 2024-05-07 中国电子科技集团公司第十研究所 Intelligent measurement method for first-order research rule knowledge rationality
CN117874209A (en) * 2024-03-12 2024-04-12 深圳市诚立业科技发展有限公司 NLP-based fraud short message monitoring and alarming system
CN117874209B (en) * 2024-03-12 2024-05-17 深圳市诚立业科技发展有限公司 NLP-based fraud short message monitoring and alarming system
CN117910460A (en) * 2024-03-18 2024-04-19 国网江苏省电力有限公司南通供电分公司 Electric power scientific research knowledge correlation construction method and system based on BGE model
CN117910460B (en) * 2024-03-18 2024-06-07 国网江苏省电力有限公司南通供电分公司 Electric power scientific research knowledge correlation construction method and system based on BGE model
CN118093791A (en) * 2024-04-24 2024-05-28 北京中关村科金技术有限公司 AI knowledge base generation method and system combined with cloud computing
CN118132683A (en) * 2024-05-07 2024-06-04 杭州海康威视数字技术股份有限公司 Training method of text extraction model, text extraction method and equipment
CN118153553A (en) * 2024-05-09 2024-06-07 江西科技师范大学 Social network user psychological crisis cause extraction method and system based on multitasking
CN118193743A (en) * 2024-05-20 2024-06-14 山东齐鲁壹点传媒有限公司 Multi-level text classification model based on pre-training model

Also Published As

Publication number Publication date
CN111310438B (en) 2021-06-08
CN111310438A (en) 2020-06-19

Similar Documents

Publication Publication Date Title
WO2021164199A1 (en) Multi-granularity fusion model-based intelligent semantic chinese sentence matching method, and device
WO2021164200A1 (en) Intelligent semantic matching method and apparatus based on deep hierarchical coding
CN111259127B (en) Long text answer selection method based on transfer learning sentence vector
CN111310439B (en) Intelligent semantic matching method and device based on depth feature dimension changing mechanism
CN112347268A (en) Text-enhanced knowledge graph joint representation learning method and device
CN110019732B (en) Intelligent question answering method and related device
WO2021204014A1 (en) Model training method and related apparatus
CN111159485B (en) Tail entity linking method, device, server and storage medium
CN113377897B (en) Multi-language medical term standard standardization system and method based on deep confrontation learning
CN110032635A (en) One kind being based on the problem of depth characteristic fused neural network to matching process and device
CN111597314A (en) Reasoning question-answering method, device and equipment
WO2024131111A1 (en) Intelligent writing method and apparatus, device, and nonvolatile readable storage medium
TW201841121A (en) A method of automatically generating semantic similar sentence samples
CN107180026B (en) Event phrase learning method and device based on word embedding semantic mapping
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN111222330B (en) Chinese event detection method and system
CN111241303A (en) Remote supervision relation extraction method for large-scale unstructured text data
CN112926324A (en) Vietnamese event entity recognition method integrating dictionary and anti-migration
CN113204611A (en) Method for establishing reading understanding model, reading understanding method and corresponding device
CN112581327B (en) Knowledge graph-based law recommendation method and device and electronic equipment
CN116821373A (en) Map-based prompt recommendation method, device, equipment and medium
CN112417170B (en) Relationship linking method for incomplete knowledge graph
CN117828024A (en) Plug-in retrieval method, device, storage medium and equipment
CN110826341A (en) Semantic similarity calculation method based on seq2seq model
WO2023130688A1 (en) Natural language processing method and apparatus, device, and readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20920016

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20920016

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20920016

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 15/03/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20920016

Country of ref document: EP

Kind code of ref document: A1