CN113806474A

CN113806474A - Data matching method and device, electronic equipment and storage medium

Info

Publication number: CN113806474A
Application number: CN202010855824.2A
Authority: CN
Inventors: 胡珅健
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2020-08-24
Filing date: 2020-08-24
Publication date: 2021-12-17

Abstract

The disclosure provides a data matching method and device, electronic equipment and a storage medium, and relates to the technical field of computers. The data matching method comprises the following steps: acquiring question data and matching candidate answer data corresponding to the question data; performing word cutting processing on the question data and the candidate answer data to generate question-answer pair data in a word form; inputting the question-answer pair data in a word form into a pre-trained answer matching model so as to determine similarity data corresponding to the question-answer pair data through the answer matching model; and determining matching answer data corresponding to the question data according to the similarity data. According to the technical scheme of the embodiment of the disclosure, the accuracy of automatic question reply can be improved, and the use experience of a user is improved.

Description

Data matching method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a data matching method, a data matching apparatus, an electronic device, and a computer-readable storage medium.

Background

With the rapid development of internet technology, intelligent customer service is more and more concerned by people. Answer Selection (Answer Selection) in the field of Chinese health consultation refers to measuring semantic matching degree of questions and candidate answers by using technologies such as natural language processing or deep learning in the field of Chinese health (medical treatment), and further selecting more accurate answers from a plurality of candidate answers.

At present, in the related technical scheme, a general word segmentation tool is adopted for word segmentation, and then word vector training is performed on the result after word segmentation, but the question and answer data in the Chinese health (medical) field is more in terms and large in noise, so that the direct use of the general word segmentation tool for word segmentation on the term in the field causes great semantic loss, influences the precision of word vector calculation, and further influences the accuracy of answer selection matching.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The disclosed embodiments aim to provide a data matching method, a data matching device, an electronic device, and a computer-readable storage medium, so as to overcome the problem that the accuracy of a word vector is low and the accuracy of answer matching is affected due to the fact that a universal word segmentation tool is used to segment a professional term in a related scheme at least to a certain extent.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

According to a first aspect of the embodiments of the present disclosure, there is provided a data matching method, including:

acquiring question data and matching candidate answer data corresponding to the question data;

performing word cutting processing on the question data and the candidate answer data to generate question-answer pair data in a word form;

inputting the question-answer pair data in a word form into a pre-trained answer matching model so as to determine similarity data corresponding to the question-answer pair data through the answer matching model;

and determining matching answer data corresponding to the question data according to the similarity data.

In some example embodiments of the present disclosure, based on the foregoing scheme, the answer matching model includes a bidirectional recurrent neural network layer;

after inputting the quiz-answer pair data in the form of words into a pre-trained answer matching model, the method further comprises:

inputting the question-answer pair data in the form of words into the bidirectional recurrent neural network layer, and generating the question-answer pair data containing context information.

In some example embodiments of the present disclosure, based on the foregoing, the answer matching model further includes a multi-scale convolution layer;

after generating the question-answer pair data containing the context information, the method further comprises:

and performing feature extraction on the question-answer data containing the context information through the multi-scale convolution layer to obtain a question-answer pair feature vector corresponding to the question-answer data containing the context information.

In some example embodiments of the present disclosure, based on the foregoing scheme, the candidate answer data includes positive answer data and negative answer data; the question-answer pair feature vectors comprise question vectors, positive answer feature vectors and negative answer feature vectors;

before determining, by the answer matching model, similarity data corresponding to the question-answer pair data, the method further includes:

calculating first similarity data between the question vector and the forward answer feature vector;

and calculating second similarity data between the question vector and the negative answer feature vector.

In some example embodiments of the present disclosure, based on the foregoing scheme, the determining, by the answer matching model, similarity data corresponding to the question-answer pair data includes:

and inputting the first similarity data and the second similarity data into a maximum interval distance loss function corresponding to the answer matching model, and outputting the similarity data corresponding to the question-answer pair data.

In some example embodiments of the present disclosure, based on the foregoing scheme, the performing word cutting processing on the question data and the candidate answer data includes:

obtaining sample question-answer pair data, and preprocessing the sample question-answer pair data;

performing word cutting processing on the preprocessed sample question answers to obtain a word vector lookup table by training a target word vector model according to the sample question answers subjected to word cutting processing;

and performing word cutting processing on the question data and the candidate answer data according to the word vector lookup table.

In some example embodiments of the present disclosure, based on the foregoing scheme, the generating of question-answer pair data in the form of words includes:

and forming a triple set by the question data, the positive answer data and the negative answer data in the word form, and taking the triple set as the question-answer pair data in the word form.

According to a second aspect of the embodiments of the present disclosure, there is provided a data matching apparatus including:

the data acquisition module is used for acquiring question data and matching candidate answer data corresponding to the question data;

the data word cutting module is used for carrying out word cutting processing on the question data and the candidate answer data to generate question-answer pair data in a word form;

the similarity data determining module is used for inputting the question-answer pair data in a word form into a pre-trained answer matching model so as to determine similarity data corresponding to the question-answer pair data through the answer matching model;

and the answer data matching module is used for determining matched answer data corresponding to the question data according to the similarity data.

In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the data matching apparatus further includes a context information generating unit configured to:

In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the data matching apparatus further includes a question-answer pair feature vector generation unit configured to:

In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the data matching apparatus further includes a similarity data determination unit configured to:

In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the answer data matching module is further configured to:

In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the data word cutting module further includes a data word cutting unit, and the data word cutting unit is configured to:

In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the data word cutting module further includes a triple construction unit, and the triple construction unit is configured to:

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; and a memory having computer readable instructions stored thereon, the computer readable instructions when executed by the processor implementing the data matching method of any one of the above.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the data matching method according to any one of the above.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

according to the data matching method in the example embodiment of the disclosure, question data and corresponding candidate answer data are obtained; performing word cutting processing on the question data and the candidate answer data to generate question-answer pair data in a word form, and then determining similarity data corresponding to the question-answer pair data through an answer matching model; and determining matching answer data corresponding to the question data according to the similarity data. On one hand, the problem of semantic loss caused by word segmentation of the problem data and the candidate answer data can be avoided by performing word segmentation processing on the problem data and the candidate answer data, and the accuracy of a feature vector obtained after word segmentation is improved; on the other hand, the similarity data of the question-answer pair data in the word form is determined through the pre-trained answer matching model, and the matching answer data which is most matched with the question data is further selected according to the similarity data, so that the question data can be replied more accurately, and the user experience is improved; on the other hand, the question data and the candidate answer data are processed into question-answer pair data in a word form, so that the complexity of the data can be reduced, and the data processing efficiency of the system can be improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty. In the drawings:

FIG. 1 schematically illustrates a schematic diagram of a data matching method according to some embodiments of the present disclosure;

FIG. 2 schematically illustrates a flow diagram of a process of word cutting of question data and check answer data, in accordance with some embodiments of the present disclosure;

FIG. 3 schematically illustrates a flow diagram of training an answer matching model, according to some embodiments of the present disclosure;

FIG. 4 schematically illustrates a flow diagram of an answer matching model determining similarity of questions and answers to data, in accordance with some embodiments of the present disclosure;

FIG. 5 schematically illustrates a schematic diagram of a data matching apparatus, according to some embodiments of the present disclosure;

FIG. 6 schematically illustrates a structural schematic of a computer system of an electronic device, in accordance with some embodiments of the present disclosure;

fig. 7 schematically illustrates a schematic diagram of a computer-readable storage medium, according to some embodiments of the present disclosure.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the disclosure.

Furthermore, the drawings are merely schematic illustrations and are not necessarily drawn to scale. The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The semantic matching algorithm in the related art mainly includes: (1) the vector space model based on the bag of words theory specifically uses TF-IDF (Term Frequency-Inverse Document Frequency) and a segmented vector space model SVSM to search the webpage medical consultation data; (2) the method comprises the steps of obtaining a syntactic structure by using a syntactic analysis tool, extracting features by using the syntactic structure, and calculating the matching degree; (3) the method comprises the steps of performing semantic vector calculation based on a deep learning technology, applying a binary convolutional neural network to an answer selection task, generating semantic distributed representation of question-answer pairs through a convolutional neural network method, and establishing a similar matrix between the question-answer pairs to solve the matching degree of the question-answer pairs by combining the idea of machine translation; (4) the system interactive design specifically comprises the steps of starting from user interaction of the health question-answering system, and constructing health question-answering data by enhancing a user interface in a single answer, fragment list and fragment combination mode.

The inventor researches and discovers that the four semantic matching algorithms can be summarized into a feature extraction method based on external resources and a feature extraction method based on a deep learning technology.

However, the feature extraction method based on external resources has the following problems: the syntactic tree, lexical and grammatical information of the question-answer pair is extracted by means of a language generation tool or the like, or the relevance information of words in sentences is extracted through language resources such as HowNet, synonym forest, WordNet and the like. The method depends on the quality of the artificial structure characteristics, has higher dependence and limitation, and has poorer data generalization capability aiming at different languages and different research contents, so that the final evaluation index has higher floatability.

The feature extraction method based on the deep learning technology has the following problems: the convolutional neural network is used for extracting sentence characteristics, word sequence information is ignored, and the word sequence is important for understanding sentences; deep learning techniques are rarely used in the chinese domain, especially for chinese health advisory data; a large number of professional terms exist in the field of Chinese health consultation, unknown words often appear, and particularly, the situation that wrongly written words and the like appear on consultation data on a Chinese health consultation platform influences the precision of word vector calculation.

Based on one or more of the above problems, in the present exemplary embodiment, a data matching method is first provided, where the data matching method may be applied to a terminal device, such as an electronic device like a mobile phone or a computer, and may also be applied to a server, and this is not limited in this exemplary embodiment. Taking the method performed by the terminal device as an example, fig. 1 schematically shows a schematic diagram of a data matching method flow according to some embodiments of the present disclosure. Referring to fig. 1, the data matching method may include the steps of:

step S110, obtaining question data and matching candidate answer data corresponding to the question data;

step S120, the question data and the candidate answer data are subjected to word cutting processing to generate question-answer pair data in a word form;

step S130, inputting the question-answer pair data in a word form into a pre-trained answer matching model so as to determine similarity data corresponding to the question-answer pair data through the answer matching model;

step S140, determining matching answer data corresponding to the question data according to the similarity data.

According to the data matching method in the embodiment, on one hand, the problem of semantic loss caused by word segmentation of the question data and the candidate answer data can be avoided by performing word segmentation processing on the question data and the candidate answer data, and the accuracy of the feature vector obtained after word segmentation is improved; on the other hand, the similarity data of the question-answer pair data in the word form is determined through the pre-trained answer matching model, and the matching answer data which is most matched with the question data is further determined according to the similarity data, so that the question data can be replied more accurately, and the user experience is improved; on the other hand, the question data and the candidate answer data are processed into question-answer pair data in a word form, so that the complexity of the data can be reduced, and the data processing efficiency of the system can be improved.

Next, the data matching method in the present exemplary embodiment will be further explained.

In step S110, question data is obtained and candidate answer data corresponding to the question data is matched.

In an example embodiment of the present disclosure, the question data may be a consultation question input by a target object, where the target object may be a user object or an automated script tool (e.g., a test script, etc.), and this example embodiment is not particularly limited thereto. Of course, the question data may also be standard question data stored in advance, for example, by acquiring spoken question data input by the user object, the standard question data corresponding to the question data is matched in a preset database.

The candidate answer data may be a plurality of candidate answers matched with the question data, for example, a plurality of answer data corresponding to the question data may be matched in a preset database according to the keyword information corresponding to the question data and used as the candidate answer data. Of course, the standard question data may also be determined according to the spoken question data, and the candidate answer data matched with the question data is obtained from the preset database according to the standard question data, which is not particularly limited in this example embodiment. The candidate answer data may include correct answers that match the question data, or incorrect answers that match the question data, for example, assuming that the question data is "ask what treatment is for cold? "the candidate answer data corresponding to the question data may be" methods for treating cold include the following: a first one; a second type; the third kind is more correct answer data (with higher matching degree), or the incorrect answer data (with lower matching degree) is incorrect answer data (with three kinds of cold treatment methods). Of course, this is only an illustrative example, and the present exemplary embodiment is not limited thereto.

In step S120, the question data and the candidate answer data are subjected to word cutting processing, and question-answer pair data in a word form is generated.

In an example embodiment of the present disclosure, the word cutting process may refer to a process of dividing a question-answer pair constructed by question data and candidate answer data corresponding to the question data into word-form data, and the word-form question-answer pair data may be data obtained by performing word cutting on the question-answer pair corresponding to the question data.

For example, suppose the question data "ask what treatment is for the cold? The treatment of cold "may include candidate answer data" is as follows: a first one; a second type; third ". Specifically, the question data "please ask what treatment methods are for the cold? The treatment methods for the cold of the 'and candidate answer data' are as follows: a first one; a second type; and a third one. "as a question-and-answer pair" { what treatment methods are asked about cold? }; { methods of treatment of cold are several: a first one; a second type; third } "and then subjecting the question-answer pair to word cutting processing to obtain the question-answer pair data in the word form corresponding to the question-answer data" { please, ask, feel, cold, what, treatment, prescription, law, and worship? }; { sense, cold, treatment, therapy, recipe, method, some, several: first, second; second, species; third, fourth, fifth, sixth, seventh, eighth, ninth, tenth, twelfth, eleventh, twelfth, thirteenth, and eleventh, thirteenth, and thirteenth, respectively.

Specifically, the word cutting processing on the question data and the candidate answer data can be realized through the steps in fig. 2:

step S210, obtaining sample question-answer pair data, and preprocessing the sample question-answer pair data;

step S220, performing word cutting processing on the preprocessed sample question answers and data to obtain a word vector lookup table according to a word vector model of the sample question answers and data training targets after the word cutting processing;

step S230, performing word cutting processing on the question data and the candidate answer data according to the word vector lookup table.

The sample question-answer pair data may be sample data obtained for training the word segmentation model, and the sample question-answer pair data may be question data of a target field captured by a crawler tool and answer data corresponding to the question data, for example, the sample question-answer pair data may be question data of a chinese health consultation field obtained on a chinese health consultation platform by octopase (a web crawler tool) and associated answer data, which is, of course, only an illustrative example and should not cause any special limitation to the present exemplary embodiment.

The preprocessing may be a processing procedure for removing relevant interference factors before cutting characters on the data of the sample question answers, and the preprocessing may include a punctuation removal processing, a stop word removal processing, and the like, for example, as for the sample question data, "what treatment methods are for asking for cold? The "obtaining of sample question data after preprocessing" method for treating cold asking "is, of course, only illustrated schematically here, and should not be any particular limitation to this exemplary embodiment.

The target Word Vector model may refer to a Word segmentation tool capable of segmenting a complete sentence into a plurality of words, for example, the target Word Vector model may be a Word2 Vector (Word to Vector, a correlation model for generating a Word Vector) tool, a Named-entity recognition tool (NER), or other Word segmentation tools capable of segmenting words into sentences, which is not limited in this exemplary embodiment.

After the target word vector model is trained on the data through the sample question answering, the trained target word vector model is used as a word vector lookup table and can be used for word cutting processing on the data through the question answering.

In step S130, the question-answer pair data in the form of words is input into a pre-trained answer matching model, so as to determine similarity data corresponding to the question-answer pair data through the answer matching model.

In an example embodiment of the present disclosure, the answer matching model may refer to a pre-trained machine learning model for measuring similarity between question and answer pair data, for example, the answer matching model may be a deep learning model based on a convolutional neural network, or may be a random forest model, which is not particularly limited in this example embodiment. Similarity data may refer to a numerical value that measures the similarity between question and answer pairs, as determined by an answer matching model, for example, is question data "ask what treatment is for cold? "the first answer data is" the treatment of cold is as follows: a first one; a second type; third, "the second answer data is" there are three methods for treating cold, "and the similarity data between the question data and the first answer data is 0.9, and the similarity data between the question data and the first answer data is 0.2.

When the candidate answer data is more, the question-answer pair data corresponding to the question data is more, so that a triple set can be formed by the question data in the word form, the positive answer data and the negative answer data, and the triple set is used as the question-answer pair data in the word form.

Wherein the triplet sets may be in the form of question-answer pairs of question data, positive answer data and negative answer data in the form of words, e.g. q for each question data_iThere is a correct forward answer data a_i(if there are multiple correct answer data for a question, one of the correct answer data may be randomly selected), and then any one of the answer data may be randomly extracted from the candidate answer data corresponding to the other question data as negative answer data a_i-, then the problem data q_iThe corresponding challenge-response pair data may be in the form of a triplet set (q)_i，a_i+，a_i-) of course, the description is illustrative only, and no particular limitation should be placed on the exemplary embodiment.

For example, preprocessed glyph problem data q_iMay be "{ question, feeling, cold, having, treating, square, method }", the preprocessed glyph type positive answer data may be "{ feeling, cold, treating, square, method, having, first, second, third, fourth }", and the preprocessed glyph type negative answer data may be "{ feeling, cold, treating, square, etc."If there are three, four, etc., then the problem data q is obtained_iThe corresponding ternary combination set is [ { question, cold, present, where, some, treatment, therapy, prescription, method }; { sense, cold, treatment, therapy, recipe, method, have, first, second, species, third, species }; { sense, common cold, treatment, therapy, prescription, therapy, some, three, species }]It is understood that the present invention is illustrative only, and should not be construed as being limited to the exemplary embodiments.

The data structure is unified by constructing the question-answer pair data into a triple set form, so that the data storage and management are facilitated, the identification and processing of the question-answer pair data by the answer matching model are facilitated, and the accuracy of the model identification data is improved.

Similarly, when the answer matching model is trained, the data may be labeled by the sample questions and answers, and each question and answer is arranged into data in the target format, for example, after the data is labeled by the sample questions and answers, the training sample data index { label qid request _ content answer _ content } in the target format may be obtained; wherein, the value of label is 0 or 1, 0 represents an incorrect answer, and 1 represents a correct answer; the qid is a problem number, and identification is carried out during data processing, so that sample data can be managed conveniently; the query _ content represents the content corresponding to the question data in the form of words; answer _ content indicates the content corresponding to the positive answer data or the negative answer data in the form of words, and of course, this is only an illustrative example, and should not cause any special limitation to this exemplary embodiment.

For example, the training sample data in the target format may include positive samples "index [1, 001, { question, sense, cap, presence, what, treatment, therapy, method }, { sense, cap, therapy, method, presence, first, kind, second, kind, third, kind } ]", and negative samples "index [0, 002, { question, sense, cap, presence, what, treatment, therapy, method }, { sense, cap, therapy, method } ]", which are merely illustrative examples and should not be construed as any special limitation to the present exemplary embodiment.

In one example embodiment of the present disclosure, the answer matching model may include a bidirectional recurrent neural network layer; the question-answer pair data in the form of words may be input into the bi-directional recurrent neural network layer, generating question-answer pair data containing context information.

The basic idea of Bidirectional Recurrent Neural Network (BRNN) is to provide two Recurrent Neural Networks (RNN) for each training sequence forward and backward, respectively, and both are connected to an output layer. This structure provides complete past and future context information for each point in the output layer input sequence. Compared with a unidirectional circulation neural network, the bidirectional circulation neural network can be used for calculating forwards and backwards. Inputting the question-answer pair data in the word vector form into the bidirectional recurrent neural network layer to obtain the question-answer pair data containing the context information. The bi-directional recurrent neural network layers may generally include an input layer, a convolutional layer, a pooling layer (down-sampling), a fully-connected layer, and an output layer.

In an example embodiment of the present disclosure, the answer matching model may further include a multi-scale convolutional layer; after the question-answer pair data containing the context information is generated, the question-answer pair feature vector corresponding to the question-answer pair data containing the context information is obtained by performing feature extraction on the question-answer pair data containing the context information through the multi-scale convolution layer.

In order to obtain potential local feature representation with finer granularity in question-answer pair data, a multi-scale convolution neural network layer (considering the diversity of language expression structures, a multi-scale convolution kernel is adopted) is accessed after the output of the bidirectional circulation neural network layer. The operation of the multi-scale convolution layer in natural language processing is to connect word vectors of continuous words in question-answer pair data in series by taking a convolution window as a unit, and then map the vector into a new local feature vector through a certain mapping function, which is equivalent to performing a deeper abstraction on a semantic meaning. And performing down-sampling processing after the operation of the multi-scale convolution layer to obtain the question-answer pair characteristic vector representation corresponding to the question-answer pair data containing the context information.

Specifically, the candidate answer data may include positive answer data and negative answer data; the question-answer pair feature vectors may include a question vector, a positive answer feature vector, and a negative answer feature vector; specifically, first similarity data between the question vector and the forward answer feature vector can be calculated; second similarity data between the question vector and the negative answer feature vector is calculated.

The first similarity data may be similarity data corresponding to a question vector and a feature vector of a positive answer, and the second similarity data may be similarity data corresponding to a question vector and a feature vector of a negative answer, for example, the specific similarity data may be cosine similarity corresponding to the question vector and the feature vector of the positive answer or the feature vector of the negative answer, or pearson correlation coefficient corresponding to the question vector and the feature vector of the positive answer or the feature vector of the negative answer, or other data capable of measuring similarity corresponding to the question vector and the feature vector of the positive answer or the feature vector of the negative answer, which is not particularly limited in this example.

Further, the first similarity data and the second similarity data may be input into the maximum interval distance loss function corresponding to the answer matching model, and the similarity data corresponding to the question-answer pair data may be output. Because the question-answer pair data have paired attributes, under the method of calculating the similarity based on the semantic level, the similarity between the question data and the candidate answer data needs to be compared with other similar objects through corresponding strategies to better determine which is more matched. Therefore, the answer matching model may be trained by a maximum separation distance loss function (Pairwise change loss). The maximum standoff loss function can be shown in relation (1):

loss(q_i,a_i ⁺,a_i ^-)＝{margin-[cos(q_i,a_i ⁺)-cos(q_i,a_i ^-)]}₊ (1)

wherein q is_iCan represent problem data, a_i ⁺May represent forward answer data, a_i ^-Negative answer data may be represented, cos (x) may represent a semantic similarity function, { value }₊A hinge loss function may be represented, subscript "+" indicates taking a positive value, and margin may indicate a threshold value of a separation distance between positive answer data and negative answer data.

In step S140, matching answer data corresponding to the question data is determined according to the similarity data.

In an example embodiment of the present disclosure, the matched answer data may refer to answer data whose semantics are most matched with the question data among candidate answer data corresponding to the question data. And determining similarity data corresponding to each question-answer pair data through an answer matching model, sequencing the question-answer pair data in a descending manner according to the similarity data, and taking the answer data corresponding to the question-answer pair data with the maximum similarity data as the matching answer data of the question data.

Fig. 3 schematically illustrates a flow diagram of training an answer matching model, according to some embodiments of the present disclosure.

Referring to fig. 3, in step S310, sample question-answer pair data with tagged data in a word form is input into an answer matching model to be trained, so as to obtain a positive example sentence of a word vector, a question example sentence of the word vector, and a negative example sentence of the word vector;

step S320, inputting the positive example sentences of the word vectors, the problem example sentences of the word vectors and the negative example sentences of the word vectors into a bidirectional recurrent neural network layer to obtain positive example sentence data, problem example sentence data and negative example sentence data containing context information;

step S330, inputting the positive example sentence data, the problem example sentence data and the negative example sentence data containing the context information into a multi-scale convolution neural network, and performing convolution processing through a multi-scale convolution kernel;

step S340, obtaining semantic vector representations corresponding to the positive example sentence data, the problem example sentence data and the negative example sentence data through convolution processing;

step S350, calculating first similarity data between the semantic vector of the forward example sentence data and the semantic vector of the problem example sentence data;

step S360, calculating second similarity data between the semantic vector of the negative example sentence data and the semantic vector of the problem example sentence data;

in step S370, the maximum distance loss function is trained through the first similarity data, the second similarity data and the label data until the maximum distance loss function converges to complete the training of the answer matching model.

Fig. 4 schematically illustrates a flow diagram of an answer matching model determining similarity of questions and answers to data, according to some embodiments of the present disclosure.

Referring to fig. 4, step S401, inputting Question data (Question) in the cut-word Question-answer pair data into an answer matching model;

step S402, converting the problem data into corresponding characteristic vectors;

step S403, inputting the feature vector corresponding to the problem data into a bidirectional recurrent neural network (BiGRU) layer to obtain the problem data containing context information;

step S404, inputting the obtained question data containing the context information into the multi-scale convolution layer to obtain a semantic vector of the question data containing the context information;

step S405, inputting the obtained semantic vector into a pooling layer for pooling processing to obtain a semantic vector with a fixed size;

step S406, inputting the semantic vector with fixed size into an output layer to obtain semantic vector representation of the problem data;

step S407, inputting Answer data (Answer, including positive Answer data and negative Answer data) in the cut-word question-Answer pair data into an Answer matching model;

step S408, converting the answer data into corresponding feature vectors;

step S409, inputting the feature vector corresponding to the answer data into a bidirectional recurrent neural network (BiGRU) layer to obtain answer data containing context information;

step S410, inputting the obtained answer data containing the context information into the multi-scale convolution layer to obtain a semantic vector of the answer data containing the context information;

step S411, inputting the obtained semantic vector into a pooling layer for pooling processing to obtain a semantic vector with a fixed size;

step S412, inputting the semantic vector with fixed size into an output layer to obtain the semantic vector representation of answer data;

in step S413, a cosine similarity between the semantic vector representation of the question data and the semantic vector representation of the answer data is calculated.

It should be noted that although the various steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that these steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

In addition, in the present exemplary embodiment, a data matching apparatus is also provided. Referring to fig. 5, the data matching apparatus 500 includes: a data acquisition module 510, a data word cutting module 520, a similarity data determination module 530, and an answer data matching module 540. Wherein:

the data obtaining module 510 is configured to obtain question data and match candidate answer data corresponding to the question data;

the data word cutting module 520 is configured to perform word cutting processing on the question data and the candidate answer data to generate question-answer pair data in a word form;

the similarity data determining module 530 is configured to input the question-answer pair data in the form of words into a pre-trained answer matching model, so as to determine similarity data corresponding to the question-answer pair data through the answer matching model;

the answer data matching module 540 is configured to determine matching answer data corresponding to the question data according to the similarity data.

In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the data matching apparatus 500 further includes a similarity data determination unit configured to:

In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the answer data matching module 540 is further configured to:

In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the data word cutting module 520 further includes a data word cutting unit configured to:

In an exemplary embodiment of the present disclosure, based on the foregoing scheme, the data word cutting module 520 further includes a triple construction unit configured to:

The specific details of each module of the data matching apparatus have been described in detail in the corresponding data matching method, and therefore are not described herein again.

It should be noted that although in the above detailed description several modules or units of the data matching apparatus are mentioned, this division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

In addition, in an exemplary embodiment of the present disclosure, an electronic device capable of implementing the data matching method is also provided.

As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device 600 according to such an embodiment of the present disclosure is described below with reference to fig. 6. The electronic device 600 shown in fig. 6 is only an example and should not bring any limitations to the function and scope of use of the embodiments of the present disclosure.

As shown in fig. 6, the electronic device 600 is embodied in the form of a general purpose computing device. The components of the electronic device 600 may include, but are not limited to: the at least one processing unit 610, the at least one memory unit 620, a bus 630 connecting different system components (including the memory unit 620 and the processing unit 610), and a display unit 640.

Wherein the storage unit stores program code that is executable by the processing unit 610 to cause the processing unit 610 to perform steps according to various exemplary embodiments of the present disclosure as described in the above section "exemplary methods" of this specification. For example, the processing unit 610 may execute step S110 shown in fig. 1, obtain question data, and match candidate answer data corresponding to the question data; step S120, the question data and the candidate answer data are subjected to word cutting processing to generate question-answer pair data in a word form; step S130, inputting the question-answer pair data in a word form into a pre-trained answer matching model so as to determine similarity data corresponding to the question-answer pair data through the answer matching model; step S140, determining matching answer data corresponding to the question data according to the similarity data.

The storage unit 620 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)621 and/or a cache memory unit 622, and may further include a read only memory unit (ROM) 623.

The storage unit 620 may also include a program/utility 624 having a set (at least one) of program modules 625, such program modules 625 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 630 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 600 may also communicate with one or more external devices 670 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 600, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 600 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 650. Also, the electronic device 600 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 660. As shown, the network adapter 660 communicates with the other modules of the electronic device 600 over the bus 630. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the present disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the present disclosure described in the "exemplary methods" section above of this specification, when the program product is run on the terminal device.

Referring to fig. 7, a program product 700 for implementing the above-described data matching method according to an embodiment of the present disclosure is described, which may employ a portable compact disc read only memory (CD-ROM) and include program codes, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Furthermore, the above-described figures are merely schematic illustrations of processes included in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of data matching, comprising:

2. The data matching method of claim 1, wherein the answer matching model comprises a bi-directional recurrent neural network layer;

3. The data matching method of claim 2, wherein the answer matching model further comprises a multi-scale convolutional layer;

4. The data matching method according to claim 3, wherein the candidate answer data includes positive answer data and negative answer data; the question-answer pair feature vectors comprise question vectors, positive answer feature vectors and negative answer feature vectors;

5. The data matching method according to claim 4, wherein the determining similarity data corresponding to the question-answer pair data through the answer matching model includes:

6. The data matching method according to claim 1, wherein the subjecting the question data and the candidate answer data to word cutting processing includes:

7. The data matching method according to claim 4, wherein the generating of the question-answer pair data in the form of words includes:

8. A data matching apparatus, comprising:

9. An electronic device, comprising:

a processor; and

a memory having stored thereon computer readable instructions which, when executed by the processor, implement the data matching method of any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a data matching method according to any one of claims 1 to 7.