CN110969005A

CN110969005A - Method and device for determining similarity between entity corpora

Info

Publication number: CN110969005A
Application number: CN201811151935.4A
Authority: CN
Inventors: 王芳; 林文辉; ***; 孙科武; 杨硕; 赖新明; 王亚平
Original assignee: Aisino Corp
Current assignee: Aisino Corp
Priority date: 2018-09-29
Filing date: 2018-09-29
Publication date: 2020-04-07
Anticipated expiration: 2038-09-29
Also published as: CN110969005B

Abstract

The invention discloses a method and a device for determining similarity between entity corpora, wherein a training device randomly extracts a training set from a preset entity corpus, matches the entity corpora in the training set to obtain a training entity corpus relationship pair, obtains a matrix vector corresponding to the training entity corpus relationship pair, processes the matrix vector by using a convolutional neural network to obtain a training classification probability of the training entity corpus relationship pair, thereby completing the training of the convolutional neural network, providing the accurate searching function of the answer to the question for the user by using the convolutional neural network and the intelligent customer service of the preset entity corpus, further solving the problems of the intelligent customer service system in the prior art, due to the fact that information input by a user is not accurate, the intelligent customer service system cannot find correct answers from a knowledge base of the intelligent customer service system, and therefore the technical problem of user experience is lowered.

Description

Method and device for determining similarity between entity corpora

Technical Field

The invention relates to the technical field of deep learning, in particular to a method and a device for determining similarity between entity corpora.

Background

With the rapid development of artificial intelligence technology, it is often not enough to apply the extracted relationship between entity corpuses to text search, for example, in tax aspect, the relationship of tax entity corpuses refers to the similarity between tax entity corpuses. The method for extracting the relationship between entity linguistic data is divided into three categories, one category is a supervised learning method, namely, the relationship extraction task is taken as a classification problem. Effective features are designed according to training data so as to learn various classification models, and then the trained classifier is used for predicting the relation. The method has the disadvantages that a large amount of manual labeling is needed to train entity corpora, and the corpus labeling work is usually time-consuming and labor-consuming. The second category is semi-supervised learning methods: the method mainly adopts BootStraping to extract the relationship, and for the relationship to be extracted, the method firstly sets a plurality of seed instances manually, and then extracts the relationship template corresponding to the relationship and more instances from the data in an iterative manner. The third category is unsupervised learning methods: it is assumed that pairs of entities having the same semantic relationship have similar context information. Therefore, the semantic relationship of each entity corpus relationship pair can be represented by the context information corresponding to the entity corpus relationship pair, and the semantic relationships of all the entity pairs are clustered. The existing supervised learning relationship extraction method has achieved a good effect, but the method relies heavily on natural language processing labels such as part of speech labels, syntax parsing and the like to provide classification features, while natural language processing labels usually have a large number of errors, and the errors are continuously propagated and amplified in a relationship extraction system, so that the relationship extraction effect is influenced finally.

For example, in the existing intelligent customer service system, tax payment service is stepping into the intelligent era of "internet + tax". The intelligent customer service provides convenient, intelligent and ubiquitous customer service for the taxpayers, for example, in an intelligent customer service system such as WeChat public numbers in a certain city, the taxpayers can input related problems in a voice or text mode usually at the entrance of consultation, and the intelligent customer service finds matched answers from a tax knowledge base through artificial intelligence technologies such as voice recognition, natural language understanding and the like and feeds the answers back to the taxpayers in the forms of texts, pictures and texts, webpage links and the like. However, because taxpayers are distributed all over the country, the phenomena that the Putonghua is mixed with various dialects, the spoken expressions of tax entities are different or the spoken expressions of tax entities in all the regions are not strict in the tax consultation process, and the like exist, the intelligent customer service system generally cannot accurately match the nonstandard spoken expression contents with the standard answers, so that the answers cannot be quickly searched, and the satisfaction degree of the intelligent question-answering system is low. For example, the tax disc in the oral language of the taxpayer in a certain place is consistent with the gold tax disc in the standard knowledge base, and belongs to different words, the intelligent customer service system cannot take the content expressed by the oral language and the answer of the standard knowledge base as a complete matching item, so that the accurate search of the answer cannot be completed, and the result that the satisfaction degree of the intelligent customer service system is not high is generated.

Therefore, the prior art has at least the following technical problems:

for the intelligent customer service system, because the information input by the user is not accurate, the intelligent customer service system cannot find the correct answer from the knowledge base of the intelligent customer service system, and therefore the user experience is reduced.

Disclosure of Invention

The embodiment of the invention provides a method and a device for determining similarity between entity corpora, which are used for solving the technical problem that in the prior art, for an intelligent customer service system, the intelligent customer service system cannot find correct answers from a knowledge base of the intelligent customer service system due to inaccurate information input by a user, so that the user experience is reduced.

In a first aspect, an embodiment of the present invention provides a method for determining similarity between entity corpuses, including:

randomly extracting a training set from a preset entity corpus, wherein the training set is composed of a plurality of entity corpora;

matching any entity corpus in the training set with each entity corpus except the entity corpus until all the entity corpuses in the training set are matched, thereby obtaining a plurality of training entity corpus relation pairs;

acquiring each training statement matrix vector corresponding to each training entity corpus relationship pair;

processing the training statement matrix vectors by using a convolutional neural network to obtain the training classification probability of the training entity corpus relationship pairs;

and determining the similarity between the training entity corpora in the training entity corpus relationship pair based on the training classification probability.

Optionally, the obtaining of each training statement matrix vector corresponding to each training entity corpus relationship pair specifically includes:

acquiring a first set of word vectors corresponding to all words forming the training set, wherein each entity corpus in the training set is formed by a plurality of words;

and acquiring a training sentence matrix vector of each training entity corpus relationship pair based on the first set, wherein the training sentence matrix vector is composed of a plurality of word vectors.

Optionally, the processing the matrix vector of each training statement by using a convolutional neural network to obtain the training classification probability of each training entity corpus relationship pair specifically includes: performing convolution operation on each training statement matrix vector to acquire training characteristic information corresponding to the training entity corpus relationship pair;

sampling each training characteristic information to obtain a plurality of training optimal characteristics of each training entity corpus pair;

combining the training optimal features to obtain training local optimal features of the training entity corpus pairs;

and processing each training local optimal characteristic by using a Softmax model to obtain the training classification probability of each training entity corpus pair.

Optionally, the randomly extracting the training set from the preset entity corpus specifically includes:

extracting a training set and a test set from a preset entity corpus by using a random extraction algorithm; the union of the training set and the test set is the preset entity corpus, and the training set and the test set have no intersection.

After the processing the matrix vectors of the training sentences by using the convolutional neural network to obtain the test classification probability of the corpus relationship pairs of the training entities, the method further includes:

pairing any entity corpus in the test set with each entity corpus except the entity corpus until all the entity corpuses in the test set are paired, so as to obtain a plurality of test entity corpus relation pairs, wherein the test set consists of a plurality of entity corpuses;

obtaining each test statement matrix vector corresponding to each test entity corpus relationship pair;

processing the matrix vector of each test statement by using the convolutional neural network to obtain the classification probability of the corpus relationship pair of each test entity;

and outputting the classification probability of each corpus relationship pair of the test entities, so that a user can judge whether the convolutional neural network needs to be trained again based on the classification probability.

Optionally, the obtaining of each test statement matrix vector corresponding to each test entity corpus relationship pair specifically includes:

acquiring a second set of word vectors corresponding to all words forming the test set, wherein each entity corpus in the test set is formed by a plurality of words;

and acquiring a test statement matrix vector of each test entity corpus relationship pair based on the second set, wherein the test statement matrix vector is composed of a plurality of word vectors.

Optionally, the processing the matrix vector of each test statement by using the convolutional neural network to obtain the test classification probability of each test entity corpus relationship pair specifically includes:

performing convolution operation on each test statement matrix vector to acquire test characteristic information corresponding to the test entity corpus relationship pair;

sampling each test characteristic information to obtain a plurality of test optimal characteristics of each test entity corpus pair;

merging the test optimal features to obtain test local optimal features of the corpus pairs of the test entities;

and processing the local optimal characteristics of each test by using a Softmax model to obtain the test classification probability of each test entity corpus pair.

Optionally, the preset entity corpus is a preset tax entity corpus.

In a second aspect, an embodiment of the present invention provides an apparatus for determining similarity between entity corpuses, including:

the extraction unit is used for randomly extracting a training set from a preset entity corpus, wherein the training set is composed of a plurality of entity corpora;

a first matching unit, configured to match any entity corpus in the training set with each entity corpus except the entity corpus until all entity corpuses in the training set are matched, so as to obtain a plurality of training entity corpus relationship pairs;

the first acquisition unit is used for acquiring each training statement matrix vector corresponding to each training entity corpus relationship pair;

a second obtaining unit, configured to process the matrix vectors of the training sentences by using a convolutional neural network, and obtain a training classification probability of each training entity corpus relationship pair;

and the determining unit is used for determining the similarity between the training entity corpora in the training entity corpus relationship pair based on the training classification probability.

Optionally, the first obtaining unit specifically includes:

a first obtaining subunit, configured to obtain a first set of word vectors corresponding to all words forming the training set, where each entity corpus in the training set is formed by a plurality of words;

and a second obtaining subunit, configured to obtain a training sentence matrix vector of each training entity corpus relationship pair based on the first set, where the training sentence matrix vector is formed by a plurality of word vectors.

Optionally, the second obtaining unit specifically includes:

the first operation subunit is used for performing convolution operation on the matrix vectors of the training sentences to acquire training characteristic information corresponding to the training entity corpus relationship pairs;

the first sampling subunit is used for sampling and processing each training characteristic information to obtain a plurality of training optimal characteristics of each training entity corpus pair;

the first merging subunit is used for merging the training optimal features to obtain training local optimal features of the training entity corpus pairs;

and the first classification subunit is used for processing each training local optimal feature by using a Softmax model and acquiring the training classification probability of each training entity corpus pair.

Optionally, the apparatus further comprises:

a second pairing unit, configured to pair any entity corpus in the test set with each entity corpus except the entity corpus after the test classification probability of each training entity corpus relationship pair is obtained by processing the training statement matrix vector by using a convolutional neural network, until all entity corpuses in the test set are paired, thereby obtaining a plurality of test entity corpus relationship pairs, where the test set is composed of a plurality of entity corpuses;

a third obtaining unit, configured to obtain each test statement matrix vector corresponding to each test entity corpus relationship pair;

a fourth obtaining unit, configured to process the test statement matrix vectors by using the convolutional neural network, and obtain a classification probability of the corpus relationship pair of each test entity;

and the output unit is used for outputting the classification probability of each corpus relationship pair of the test entities, so that a user can judge whether the convolutional neural network needs to be trained again based on the classification probability.

Optionally, the third obtaining unit specifically includes:

a third obtaining subunit, configured to obtain a second set of word vectors corresponding to all words forming the test set, where each entity corpus in the test set is formed by a plurality of words;

and a fourth obtaining subunit, configured to obtain, based on the second set, a test statement matrix vector of each test entity corpus relationship pair, where the test statement matrix vector is formed by a plurality of word vectors.

Optionally, the fourth obtaining unit specifically includes:

the second operation subunit is used for performing convolution operation on the matrix vectors of the test statements to acquire test characteristic information corresponding to the corpus relationship pair of the test entity;

the second sampling subunit is used for sampling and processing each test characteristic information to obtain a plurality of test optimal characteristics of each test entity corpus pair;

the second merging subunit is used for merging the test optimal features to obtain the test local optimal features of each test entity corpus pair;

and the second classification subunit is used for processing the local optimal characteristics of each test by using a Softmax model and acquiring the test classification probability of each test entity corpus pair.

Optionally, the preset entity corpus is a preset tax entity corpus.

In a third aspect, an embodiment of the present invention provides an apparatus for determining similarity between entity corpuses, including:

at least one processor, and a memory coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor, the at least one processor performing the method as described in the first aspect above by executing the instructions stored by the memory.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, including:

the computer-readable storage medium has stored thereon computer instructions which, when executed by at least one processor of the apparatus for determining similarity between entity corpuses, implement the method as described in the above first aspect.

One or more technical solutions provided in the embodiments of the present invention have at least the following technical effects or advantages:

in the invention, the device for determining the similarity between entity corpora performs a method for determining the similarity between entity corpora, namely, randomly extracts a training set from a preset entity corpus, matches any entity corpus in the training set with each entity corpus except the entity corpus until all entity corpora in the training set are matched, thereby obtaining a plurality of training entity corpus relation pairs, obtains each training statement matrix vector corresponding to each training entity corpus relation pair, processes each training statement matrix vector by using a convolutional neural network, obtains the training classification probability of each training entity corpus relation pair, can complete the learning process of the convolutional neural network for the preset entity corpus, thereby providing a problem answer accurate search function for a user by using the convolutional neural network and the intelligent customer service of the preset entity corpus, therefore, the problem that the intelligent customer service system cannot find the correct answer from the knowledge base of the intelligent customer service system due to inaccurate information input by a user in the prior art can be solved, the technical problem of user experience is reduced, and the technical effect of improving the user experience is achieved.

Drawings

Fig. 1 is a schematic structural diagram of a convolutional neural network according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a method for determining similarity between entity corpuses according to an embodiment of the present invention;

FIG. 3 is a flowchart of a method for determining a training classification probability of a corpus relationship pair of a training entity using a convolutional neural network according to an embodiment of the present invention;

FIG. 4 is a flowchart of a method for determining whether retraining the convolutional neural network is required according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an intelligent customer service system according to an embodiment of the present invention, which employs a method for determining similarity between entity corpuses;

FIG. 6 is a schematic structural diagram illustrating an apparatus for determining similarity between entity corpuses according to an embodiment of the present invention;

fig. 7 is a schematic physical structure diagram of an apparatus for determining similarity between entity corpuses according to an embodiment of the present invention.

Detailed Description

In order to solve the technical problem, the technical scheme in the embodiment of the invention has the following general idea:

a method and a device for determining similarity between entity corpora specifically include:

In order to better understand the technical solutions, the technical solutions will be described in detail below with reference to the drawings and the specific embodiments of the specification, and it should be understood that the embodiments and specific features of the embodiments of the present invention are detailed descriptions of the technical solutions of the present invention, and are not limitations of the technical solutions of the present invention, and the technical features of the embodiments and examples of the present invention may be combined with each other without conflict.

In an embodiment of the invention, the convolutional neural network comprises a convolutional layer, a pooling layer and a full-link layer; the Convolutional Neural Network (CNN) is a principle of a Neural mechanism derived from vision, and Hubel et al found that a Network structure exists in the Neural mechanism of vision, which can reduce the complexity of the Network, and the Network structure has invariance to changes such as scaling and translation, thereby having the Convolutional Neural Network. Referring to fig. 1, the basic structure of CNN is a hierarchical recursive network structure, which mainly includes two layers: convolutional layers and sampling layers, which also include fully-connected layers, and the input of the convolutional neural network is input in the form of a matrix vector. The convolutional layer is also called a feature extraction layer, and the sampling layer is called a feature mapping layer or a pooling layer. The two-layer structure can be actually understood as reducing the characteristic dimension and reducing the optimized parameters, which is also the advantage of the convolution network over other neural networks in full connection. Parameters in the network are reduced by sharing the local weight, and the method has good effects on the aspects of voice recognition and image processing. Based on the advantages, the convolutional neural network has great advantages in the aspect of text processing.

The training device can be any terminal equipment which can run a computer program, such as a mobile phone, a tablet computer, a desktop computer and the like;

an entity corpus relationship pair may be composed of two entity corpuses or multiple entity corpuses, e.g., an entity corpus relationship pair may be represented as<e₁,e₂>Wherein e is₁,e₂∈E，e₁,e₂Is an entity corpus, and E is a preset entity corpus;

the sentence matrix vector corresponding to the entity corpus relationship pair may be represented as X, where X is a two-dimensional matrix of n × k, where n is the length of the word of the entity corpus relationship pair, and k is the word vector X of the ith word constituting the entity corpus relationship pair_iIs determined by the total number of words.

The above list is merely illustrative and not intended to be a specific limitation of the embodiment of the present invention.

Referring to fig. 2, an embodiment of the present invention provides a method for determining similarity between entity corpuses, including the following steps:

step S101, a training set is randomly extracted from a preset entity corpus, wherein the training set is composed of a plurality of entity corpora.

Step S102, any entity corpus in the training set is paired with each entity corpus except the entity corpus until all the entity corpora in the training set are paired, and therefore a plurality of training entity corpus relation pairs are obtained.

Step S103, obtaining each training statement matrix vector corresponding to each training entity corpus relationship pair.

And step S104, processing each training statement matrix vector by using a convolutional neural network to obtain the training classification probability of each training entity corpus relationship pair.

Step S105, based on the training classification probability, determining the similarity between the training entity corpora in the training entity corpus relationship pair.

Firstly, step S101 is executed to randomly extract a training set from a preset entity corpus, wherein the training set is composed of a plurality of entity corpora.

Specifically, the algorithm used in the training set extraction may be a random extraction algorithm, or may be other algorithms that can implement a random extraction function, and is not limited herein.

Further, randomly extracting the training set from the preset entity corpus also comprises randomly extracting the test set from the preset entity corpus; the union of the training set and the test set is the preset entity corpus, and the training set and the test set have no intersection.

Specifically, the method for extracting the test set includes extracting the test set while randomly extracting the training set from the predetermined entity corpus, for example, simultaneously extracting the training set and the test set from the predetermined entity corpus by using a random extraction algorithm; another method for extracting the test set is to use a part except the training set in a preset entity corpus as the test set after the training set is extracted; another method for extracting the test set is to randomly extract the test set from the predetermined entity corpus by using an algorithm, and then use the part of the predetermined entity corpus other than the test set as a training set.

In addition, the ratio of the training set to the test set can be freely set, for example, the ratio of the training set to the test set is 1:2, that is, the number of the entity corpora constituting the training set is one third of the total number of the entity corpora in the predetermined entity corpus.

After the training set is extracted, step S102 is executed to pair any entity corpus in the training set with each entity corpus except the entity corpus until all the entity corpuses in the training set are paired, so as to obtain a plurality of training entity corpus relationship pairs.

Specifically, for example, 10 entity corpora numbered 1, 2, …, and 10 are included in the training set, in order to obtain training entity corpus pairs, the entity corpus 1 and the entity corps 2-10 may be respectively paired to obtain 9 training entity corpus relationship pairs, and then the entity corpus 2 and the entity corps 3-10 may be respectively paired to obtain 8 training entity corpus relationship pairs, and so on, until the entity corpus 9 and the entity corpus 10 are paired, total 45 training entity corpus relationship pairs are obtained, where the order and manner of pairing the entity corps in the training set are not limited, as long as all the entity corps in the training set are paired and there is no repeated entity corpus relationship pair.

After step S102 is completed, step S103 is executed to obtain each training sentence matrix vector corresponding to each training entity corpus relationship pair.

Further, the obtaining of each training statement matrix vector corresponding to each training entity corpus relationship pair specifically includes:

Specifically, a first set of Word vectors corresponding to all words constituting the training set is obtained, Word vectors corresponding to the words can be obtained by using a Word2Vec model, the Word2Vec model can convert natural language into a vector form that can be recognized by a computer, for example, an entity corpus in the training set is "i love in beijing", "i love in beijing" includes 3 words, which are "i", "love in", "beijing", respectively, and the Word2Vec model is used to convert "i", "love in", "beijing" into Word vectors, the three Word vectors can be [1,0,0], [0,1,0], [0,0,1], and the length of a specific Word vector is determined by the number of non-repeated words constituting a preset entity corpus. Furthermore, each word corresponds to a word vector, and the first set includes word vectors corresponding to all non-repeated words constituting the training set.

After the first set is obtained, based on the first set, a training sentence matrix vector of each training entity corpus relationship pair in the training set may be obtained, for example, each training entity corpus relationship pair is composed of two training entity corpora, that is, composed of a plurality of words, based on the first set, a training sentence matrix vector corresponding to the training entity corpus relationship pair may be obtained, where the matrix vector is composed of word vectors corresponding to the words composing the training entity corpus relationship pair.

After the step S103 is executed, step S104 is executed, the convolutional neural network is used to process the matrix vectors of the training sentences, and the training classification probability of the corpus relationship pairs of the training entities is obtained.

Further, referring to fig. 3, processing the matrix vectors of the training sentences by using a convolutional neural network to obtain the training classification probability of the corpus relationship pairs of the training entities, specifically including the following steps:

step S104a, performing convolution operation on each training statement matrix vector to obtain training characteristic information corresponding to the training entity corpus relation pair.

Step S104b, sampling each training characteristic information, and obtaining a plurality of training optimal characteristics of each training entity corpus pair.

Step S104c, merging the training optimal features to obtain the training local optimal features of each training entity corpus pair.

And step S104d, processing each training local optimal feature by using a Softmax model, and acquiring the training classification probability of each training entity corpus pair.

In step S104, step S104a is executed first, a convolution operation is performed on each training sentence matrix vector, and training feature information corresponding to the training entity corpus relationship pair is acquired.

Specifically, after the training sentence matrix vector is input into the convolutional neural network, the convolutional layer of the convolutional neural network performs convolutional operation on the training sentence matrix vector to obtain training characteristic information corresponding to the training entity corpus, for example, the convolutional layer presets the size of a filtering window through a filter, and performs convolutional operation on the input matrix vector by using the filter, if the size of the filtering window is m, and an offset is added to perform convolutional operation, the characteristic information after convolutional operation can be expressed as:

c_i＝f(w·x_i:i＝m-1+b)

wherein, c_iFor the corresponding ith characteristic value after convolution operation, f (-) is the selection of the convolution kernel function of the layer, and w isWeight matrix in filter, where w ∈ R^h*mH m is the size of the selected filtering window, b e R is a bias matrix, x_i:i＝m-1From the ith word to the length of i + m-1 words in the text sentence. In addition, the convolutional layer may perform convolution operations using a plurality of filters, each of which may set the size of a filter window.

After the convolution layer of the convolutional neural network performs convolution operation on the training sentence matrix vector, the obtained training feature information corresponding to the training entity corpus can be represented as a feature matrix c:

c＝[c₁,c₂,…,c_n-h+1]

wherein c ∈ R^n-h+1。

After the feature information is obtained, step S104b is executed to perform sampling processing on each training feature information, and obtain a plurality of training optimal features of each training entity corpus pair.

Specifically, for example, after the convolutional layer of the convolutional neural network performs convolutional operation on the training sentence matrix vector, a plurality of convolutional results (e.g., the feature matrix c) may be obtained, and the pooling layer of the convolutional neural network may sample the plurality of convolutional results by using a Max-pooling (Max-pooling) method, according to the Max-pooling

And taking the maximum value to obtain the training optimal characteristics of the training entity expected pairs.

After the step S104b is completed, the step S104c is executed to perform merging processing on the training optimal features, so as to obtain the training local optimal features of each training entity corpus pair.

Specifically, the convolution results are combined for a plurality of training optimal features, so that a plurality of training optimal features can be combined into one training local optimal feature, the aggregation statistics of the plurality of training optimal features is realized, and the dimensionality of the optimal features is reduced.

And after the step S104c is executed, the step S104d is executed, the Softmax model is used for processing each training local optimal feature, and the training classification probability of each training entity corpus pair is obtained.

Specifically, after receiving the training local optimal features, the fully-connected layer of the convolutional neural network performs relationship classification on the local optimal features by using a Softmax model to obtain classification probability.

After step S104 or step S104d is completed, step S105 is executed to determine the similarity between the training entity corpuses in the training entity corpus relationship pair based on the training classification probability.

Specifically, the similarity between the training entity corpora in the training entity corpus relationship pair may be represented by similarity (Y) and non-similarity (N), and the similarity is determined based on the training classification probability, which may be that if the training classification probability is greater than a preset threshold, the training entity corpora in the training entity corpus relationship pair are determined to be similar, and if the training classification probability is less than the preset threshold, the training entity corpora in the training entity corpus relationship pair are determined to be non-similar.

Further, referring to fig. 4, after the step S105 is executed, the training method further includes the following steps:

step S201, pairing any entity corpus in the test set with each entity corpus except the entity corpus until all entity corpuses in the test set are paired, thereby obtaining a plurality of test entity corpus relationship pairs, wherein the test set is composed of a plurality of entity corpuses.

Step S202, obtaining each test statement matrix vector corresponding to each test entity corpus relationship pair.

Step S203, processing the matrix vectors of the test statements by using the convolutional neural network, and acquiring the classification probability of the corpus relationship pairs of the test entities.

And step S204, outputting the classification probability of each corpus relationship pair of the test entities, so that a user can judge whether the convolutional neural network needs to be trained again based on the classification probability.

After step S104d is executed, step S201, step S202, step S203, and step S204 are executed in sequence, wherein the specific method for executing the test set in step S201, step S202, and step S203 is the same as the specific method for executing the training set in step S102, step S103, and step S104, respectively, and is not described herein again.

When step S203 is executed, the method specifically includes:

The specific method for executing the test set specifically included in step S203 is the same as the specific method for executing the training set in steps S104a to S104d, and is not repeated here.

For step S204, specifically, the classification probability of each corpus pair of test entities is output, where the classification probability includes a group of classification probability values, and the classification probability value is a classification relative probability corresponding to a value of a local optimal feature, and the convolutional neural network outputs the classification probability and words corresponding to the classification probability, and evaluates the output result by using the following formula:

wherein r is_iNumber of the ith test entity corpus relationship pair, t, representing a class correctly classified_iThe total number of the ith training entity corpus relationship pairs, a, determined as the class_iTotal number of ith test entity corpus relationship pairs of test set, F₁Is a predefined index function.

Further, the convolutional neural network outputs accuracy, recall, and F to the user₁So that the user can be based on accuracy, recall, and F₁And judging whether the convolutional neural network needs to be trained again. For example, an accuracy of 60%, the user determines that the convolutional neural network needs to be trained again.

Further, the convolutional neural network outputs accuracy, recall, and F to the user₁In the process, the accuracy, the recall rate and the F of the character format can be output through a display interface of the training device₁。

Further, the preset entity corpus is a preset tax entity corpus, for example, the preset tax entity corpus may be an intelligent customer service tax knowledge base (the knowledge base has 7000 relevant tax knowledge items and 11000 expansion problems).

For example, referring to fig. 5, the method for determining similarity between entity corpora is applied to an intelligent customer service system, where the preset entity corpus is a preset tax entity corpus, and the method executed by the intelligent customer service system is as follows:

randomly extracting a training set or a test set from a preset tax entity corpus, pairing entity corpora in the training set or the training set, and acquiring a plurality of training entity corpus relation pairs or test entity corpus relation pairs, wherein if the training entity corpus relation pairs are input into a convolutional neural network, the output is the similarity between the training entity corpora in the training set, and the similarity can be similar or non-similar; if the input to the convolutional neural network is a corpus relationship pair of the test entity, the output result is accuracy, recall rate and F₁And the like, for giving the user information to decide whether or not retraining is necessary.

After the intelligent customer service executes the method, when the user uses the intelligent customer service system, even if the input information is inaccurate, the system can search similar entity corpora in the preset tax entity corpus based on the information input by the user, so that answers required by the user are searched.

Referring to fig. 6, based on the same inventive concept, a second embodiment of the present invention provides an apparatus for determining similarity between entity corpuses, including:

an extracting unit 601, configured to randomly extract a training set from a preset entity corpus, where the training set is composed of a plurality of entity corpora;

a first pairing unit 602, configured to pair any entity corpus in the training set with each entity corpus except the entity corpus until all entity corpuses in the training set are paired, so as to obtain a plurality of training entity corpus relationship pairs;

a first obtaining unit 603, configured to obtain each training statement matrix vector corresponding to each training entity corpus relationship pair;

a second obtaining unit 604, configured to process the matrix vectors of the training sentences by using a convolutional neural network, and obtain a training classification probability of each training entity corpus relationship pair;

a determining unit 605, configured to determine, based on the training classification probability, a similarity between the training entity corpuses in the training entity corpus relationship pair.

Optionally, the first obtaining unit specifically includes:

Optionally, the second obtaining unit specifically includes:

Optionally, the extracting unit is further configured to:

extracting a test set from a preset entity corpus by using a random extraction algorithm; the union of the training set and the test set is the preset entity corpus, and the training set and the test set have no intersection.

Optionally, the apparatus further comprises:

Optionally, the third obtaining unit specifically includes:

Optionally, the fourth obtaining unit specifically includes:

Optionally, the preset entity corpus is a preset tax entity corpus.

Referring to fig. 7, based on the same inventive concept, a third embodiment of the present invention provides an apparatus for determining similarity between entity corpuses, including:

at least one processor 701, and a memory 702 coupled to the at least one processor;

wherein the memory 702 stores instructions executable by the at least one processor 701, and the at least one processor 701 executes the steps of the method as described in the above method embodiments by executing the instructions stored by the memory 702.

Optionally, the processor 701 may specifically include a Central Processing Unit (CPU) and an Application Specific Integrated Circuit (ASIC), which may be one or more integrated circuits for controlling program execution, may be a hardware circuit developed by using a Field Programmable Gate Array (FPGA), and may be a baseband processor.

Optionally, processor 701 may include at least one processing core.

Optionally, the apparatus further includes a memory 702, and the memory 702 may include a Read Only Memory (ROM), a Random Access Memory (RAM), and a disk memory. The memory 702 is used for storing data required by the processor 701 in operation.

Based on the same inventive concept, a fourth embodiment of the present invention provides a computer-readable storage medium, including:

the computer-readable storage medium has stored thereon computer instructions which, when executed by at least one processor of the training apparatus, implement the method as described in the above-mentioned method embodiments.

The technical scheme in the embodiment of the invention at least has the following technical effects or advantages:

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus, and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method for determining similarity between entity corpuses, comprising:

2. The method according to claim 1, wherein the obtaining of each training sentence matrix vector corresponding to each training entity corpus relationship pair specifically comprises:

3. The method according to claim 1 or 2, wherein the processing the matrix vector of each training sentence by using a convolutional neural network to obtain the training classification probability of each corpus relationship pair of the training entities specifically comprises:

performing convolution operation on each training statement matrix vector to acquire training characteristic information corresponding to the training entity corpus relationship pair;

4. The method of claim 1 or 2, wherein the randomly extracting training sets from a predetermined corpus of entities further comprises:

5. The method of claim 4, wherein after said processing each of said training sentence matrix vectors using a convolutional neural network to obtain a test classification probability for each of said training entity corpus relationship pairs, said method further comprises:

6. The method according to claim 5, wherein said obtaining each test statement matrix vector corresponding to each test entity corpus relationship pair specifically comprises:

7. The method according to claim 5 or 6, wherein the processing of the matrix vector of each test statement by using the convolutional neural network to obtain the test classification probability of each test entity corpus relationship pair specifically comprises:

8. The method of claim 1, 2, 5, or 6, wherein the predetermined corpus of entities is a predetermined taxation corpus of entities.

9. An apparatus for determining similarity between entity corpuses, comprising:

10. The apparatus of claim 9, wherein the first obtaining unit specifically comprises:

11. The apparatus according to claim 9 or 10, wherein the second obtaining unit specifically includes:

12. The apparatus of claim 9 or 10, wherein the extraction unit is further configured to:

13. The apparatus of claim 12, wherein the apparatus further comprises:

14. The apparatus according to claim 13, wherein the third obtaining unit specifically includes:

15. The apparatus according to claim 13 or 14, wherein the fourth obtaining unit specifically includes:

16. The method of claim 9, 10, 13, or 14, wherein the predetermined corpus of entities is a predetermined taxation corpus of entities.

17. An apparatus for determining similarity between entity corpuses, comprising:

at least one processor, and a memory coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor, the at least one processor performing the method of any one of claims 1-8 by executing the instructions stored by the memory.

18. A computer-readable storage medium, comprising:

the computer-readable storage medium having stored thereon computer instructions which, when executed by at least one processor of the apparatus for determining similarity between entity corpuses, implement the method according to any one of claims 1-8.