CN112115723B

CN112115723B - Weak supervision semantic analysis method based on false positive sample detection

Info

Publication number: CN112115723B
Application number: CN202010961187.7A
Authority: CN
Inventors: 刘俊涛; 张毅; 王振杰; 王军伟; 高子文; 王军利; 周莹; 杨向广
Original assignee: 709th Research Institute of CSIC
Current assignee: 709th Research Institute of CSIC
Priority date: 2020-09-14
Filing date: 2020-09-14
Publication date: 2022-08-12
Anticipated expiration: 2040-09-14
Also published as: CN112115723A

Abstract

In order to solve the problem of weak supervision semantic analysis, the invention provides a weak supervision semantic analysis method based on false positive sample detection. The method is used for converting the operation described by the natural language into the formalized semantic description, for example, converting the data retrieval requirement described by the natural language into SQL statements to be executed most in the database management system and returning the retrieval result. The method provided by the invention detects the existence of false positive samples by measuring the return consistency of similar training samples, provides an objective function containing return consistency measurement, and optimizes the objective function by adopting a random gradient descent method so as to obtain a semantic analysis model. The method improves the accuracy of semantic analysis and has better generalization capability.

Description

Weak supervision semantic analysis method based on false positive sample detection

Technical Field

The invention belongs to the technical field of semantic analysis, and particularly relates to a weak supervision semantic analysis method based on false positive sample detection.

Background

Semantic parsing problems transform a given natural language into some formalized description. The formal description may be an expression based on a certain syntax, such as a λ -operator expression, or may be a tree or graph structure, such as a semantic parse tree, or may be a program language, such as SQL, PYTHON, or the like.

The general semantic analysis model is obtained by learning a training sample in a supervised mode, wherein the training sample is { (x) _i ,z _i ) In which x _i And z _i Respectively, a natural language and its corresponding formal semantic description. With the recent rise of natural language processing methods based on deep neural networks, supervised semantic parsing problems are generally translated into sequence-to-sequence translation problems.

The training samples of the weakly supervised semantic parsing problem are different from the training samples of the supervised semantic parsing problem, which are natural language corpora and results thereof under certain execution environment, namely { (x) _i ,c _i ,y _i ) In which x _i Is a natural language corpus, c _i And y _i Respectively, the corresponding execution environment and result, i.e. assume x _i Is expressed as z _i Will z _i In the environment c _i In the execution to obtain y _i I.e. c _i (z _i )＝y _i . Note that x _i Is represented by a semantic representation z _i Not present in the training samples, is implicit. The weak supervised semantic parsing problem is to pass the training sample { (x) _i ,c _i ,y _i ) And studying to obtain the formal description z of the given natural language x.

The training samples for weakly supervised semantic parsing do not contain the correct formal description z of the natural language x. More than one semantic representation can usually yield the same result. For example, for the same data table, multiple SQL statements may obtain the same search result. Where only a very small semantic representation correctly represents the true meaning of a given natural language corpus.

A typical application scenario of semantic parsing is to convert data retrieval requirements described by natural language into SQL statements for execution in a database management system and return the retrieval results.

Formalized representations (programs) that yield the desired execution result but do not represent the semantic meaning of the corpus correctly are called spurious positive samples. The presence of false positive samples, especially their premature appearance during learning, directs learning in the wrong direction, making the learning result less generalizable. One challenge of weakly supervised semantic parsing is therefore how to identify false positive samples. Therefore, it is necessary to provide a weakly supervised semantic parsing method based on false positive sample detection.

Disclosure of Invention

The invention provides a weak supervision semantic analysis technology considering false positive samples. Since the formal representation of the corpus is generated from the corpus, similar corpuses will generate similar formal representations. If there are false positive samples in the representation generated from a corpus x. The formal representation generated in the same way for a corpus x' similar to corpus x should have a large number of errors. Thus, detecting consistency of execution results of similar corpus generating representations may suggest the presence of false positive samples. The invention is based on the idea and provides the weakly supervised semantic analysis considering the false positive samples. The method can improve the accuracy of semantic analysis and increase the generalization capability of the semantic analysis.

In order to achieve the above object, the present invention provides a weak supervised semantic analysis method based on false positive sample detection, which comprises:

(1) formalized representation model p for training corpus by adopting stochastic gradient descent method _θ (z|x _i ) Determining a parameter θ therein, including:

(1.1) randomly taking one training sample in the training sample set S, and recording as (x) _i ,c _i ,y _i ) Wherein x is _i Is corpus of natural language representation, c _i Is x _i Is a formal representation of z _i Execution environment of y _i Is z _i At c _i The result of the execution in (1);

(1.2) according to p _θ (z|x _i ) Sampling multiple possible formalized representations z _i,k }；

(1.3) calculating sampled x _i Is expressed in a formal representation z _i,k The average of the returns of (c) },as x _i Is expressed in a formal manner _θ (x _i ,y _i )；

(1.4) calculating corpus x _i Is reported consistent metric k _i ；

(1.5) according to the above (x) _i ,c _i ,y _i )、{z _i,k }、G _θ (x _i ,y _i ) And k _i Calculating an objective function gradient

(1.6) utilizing the objective function gradient

Updating the parameter theta;

(1.7) if the target function is converged, ending the training, otherwise, returning to the step (1.1);

(2) utilizing the trained formal representation model p _θ (z|x _i ) And formalizing semantic description on the natural language to be formalized represented.

In one embodiment of the present invention, the objective function in the step (1.7) is:

in one embodiment of the present invention, G in the step (1.3) _θ (x _i ,y _i ) Random strategy p with theta as parameter _θ (z|x _i ) Generating x _i The formal representation of (1) is expected to return:

r (z) is the reward z performs in the environment c, P is the set of all possible formal representations, and z ∈ P is one of the possible formal representations.

In one embodiment of the invention, if the execution is the same as the result y in the training sample, then a return of 1 is obtained, otherwise the return is 0, i.e.:

where (c), (z) represents the formalization of the corpus to represent the result of z's execution in environment c.

In one embodiment of the present invention, the step (1.4) k _i Is corpus x _i The reward consistency metric of (2):

wherein s (x) _i ,x _j ) Is two corpora x _i ,x _j The similarity of (c).

In one embodiment of the present invention, two corpora x _i ,x _j Similarity of (a) s (x) _i ,x _j ) Comprises the following steps:

s(x _i ,x _j )＝||v _i -v _j || ₂

wherein corpus x _i ,x _j Each comprising n _i ,n _j Individual word, v _i,k (k＝1,..,n _i ),v _j,k (k＝1,..,n _j ) Respectively, pre-training word vectors for these words.

In an embodiment of the present invention, the step (1.5) specifically includes:

in an embodiment of the present invention, the step (1.6) specifically includes:

where λ is the learning rate.

In one embodiment of the invention, the corpus similarity is calculated in the preprocessing, and is not changed in the training process.

In one embodiment of the present invention, c _i Is a database, x _i Is a search requirement expressed in natural language, z _i Is satisfying x _i SQL statement of (y) _i Is an SQL statement z _i In database c _i The result of the search is that the training sample does not contain x _i Is correct semantic representation z _i 。

Generally, compared with the prior art, the technical scheme of the invention has the following beneficial effects: the method relieves the influence of false positive samples in the training process, the semantic analysis model obtained by training can generate more accurate semantic analysis results, and the comparison experiment on the public data set with the existing method shows that the accuracy of the semantic analysis results of the method is improved by about 6%. .

Drawings

FIG. 1 is a flowchart of a weak supervised semantic parsing method based on false positive sample detection according to an embodiment of the present invention;

FIG. 2 is a method for training a formal representation model of a corpus by using a stochastic gradient descent method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

As shown in fig. 1, the present invention provides a weak supervised semantic parsing method based on false positive sample detection, which includes:

(1) formalized representation model p for training corpus by adopting stochastic gradient descent method _θ (z|x _i ) Determining a parameter theta;

(2) using said trainedFormalized representation model p _θ (z|x _i ) And formalizing semantic description on the natural language to be formalized represented.

The key technical points of the technical scheme of the invention are described as follows:

calculating training sample S { (x) _i ,c _i ,y _i ) Every two corpora x in the Chinese _i ,x _j Similarity of (a) s (x) _i ,x _j ) Let us say corpus x _i ,x _j Each comprising n _i ,n _j Individual word, v _i,k (k＝1,..,n _i ),v _j,k (k＝1,..,n _j ) Respectively, pre-training word vectors (e.g., glove) for these words. Define corpus x _i ,x _j The similarity of (c) is calculated as follows:

s(x _i ,x _j )＝||v _i -v _j || ₂

wherein, the corpus similarity is calculated in the preprocessing, and is not changed in the training process;

wherein, the training sample set S { (x) _i ,c _i ,y _i ) In x _i Is a corpus of natural language representation, c _i Is x _i Is expressed as z _i Execution environment of (a), y _i Is z _i At c _i The result of execution in (c), typically _i May be a database, x _i Is a search requirement expressed in natural language, z _i Is satisfying x _i SQL statement of (y) _i Is SQL statement z _i In database c _i The training sample does not contain x _i Is z is a correct semantic representation of _i 。

The objective function of the weak supervised semantic parsing problem based on false positive sample detection is as follows:

wherein G is _θ (x _i ,y _i ) Random strategy p with theta as parameter _θ (z|x _i ) Generating x _i The formal representation of (1) is expected to return:

in the above formula, P is the set of all possible formal representations, and z ∈ P is one of the possible formal representations;

where the return function is R (z), which is the return z performed in the environment c, if the performed result is the same as the result y in the training sample, then a return of 1 is obtained, otherwise, the return is 0, that is:

where c (z) represents the formalization of the corpus to represent the result of z's execution in context c.

k _i Is corpus x _i The reward consistency metric of (2):

formalized representation model p for training corpus by adopting stochastic gradient descent method _θ (z|x _i ) Determining a parameter theta, wherein the training process is as follows:

(1) randomly taking one training sample in the training sample set S and recording as (x) _i ,c _i ,y _i )，

(2) According to p _θ (z|x _i ) Sampling multiple possible formalized representations z _i,k }，

(3) Calculating sampled x _i Is expressed in a formal representation z _i,k Mean value of the returns of x _i Is expressed in a formal manner _θ (x _i ,y _i )

(4) Calculating corpus x _i Is reported consistent metric k _i ，

(5) According to the above (x) _i ,c _i ,y _i )、{z _i,k }、G _θ (x _i ,y _i ) Calculating the objective function gradient:

(6) the parameters theta are updated in such a way that,

(7) if the objective function converges and ends the training, otherwise, go back to (1)

As shown in FIG. 2, the present invention provides a method for training a formal representation model of corpus by using stochastic gradient descent, comprising:

step 1: initialization reward expectation

The expectation of return for all training samples is set to 0, G _θ (x _i ,y _i )←0(i＝1,2,...)

Step 2: computing similarity of training samples

s(x _i ,x _j )＝||v _i ,v _j || ₂

and 3, step 3: training a formal representation model p _θ (z|x _i )

step 3-1: randomly taking one training sample in the training sample set S and recording as (x) _i ,c _i ,y _i )，

Step 3-2: according to p _θ (z|x _i ) Sampling multiple possible formalized representations z _i,k }，

Step 3-3: calculating sampled x _i Is expressed in a formal representation z _i,k The mean of the returns of x _i Is expressed in a formal manner _θ (x _i ,y _i )

Step 3-4: calculating corpus x _i Is reported consistent metric k _i ，

Step 3-5: according to the above (x) _i ,c _i ,y _i )、{z _i,k }、G _θ (x _i ,y _i ) Calculating the gradient of the objective function:

step 3-6: the parameter theta is updated in such a manner that,

step 3-7: if the objective function converges and the training is finished, otherwise, the step 3-1 is returned.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A weak supervision semantic analysis method based on false positive sample detection is characterized by comprising the following steps:

(1.1) randomly taking one training sample in the training sample set S, and recording as (x) _i ,c _i ,y _i ) Wherein x is _i Is a corpus of natural language representation, c _i Is x _i Is a formal representation of z _i Execution environment of y _i Is z _i At c _i The result of the execution in (1);

(1.2) according to p _θ (z|x _i ) Sampling multiple possible formal representations z _i,k }；

(1.3) calculating sampled x _i Is expressed in a formal representation z _i,k Mean value of the returns of x _i Is expressed in a formal manner _θ (x _i ,y _i )；G _θ (x _i ,y _i ) Random strategy p with theta as parameter _θ (z|x _i ) Generating x _i The formal representation of (1) is expected to return:

r (z) is the reward z performs in environment c, P is the set of all possible formal representations, z ∈ P is one of the possible formal representations; if the execution is the same as the result y in the training sample, then a return of 1 is obtained, otherwise the return is 0, i.e.:

wherein (c) (z) represents the result of the formalization of the corpus representing z performed in context c;

(1.4) calculating corpus x _i Is reported consistent metric k _i ；k _i Is corpus x _i The reward consistency metric of (2):

wherein s (x) _i ,x _j ) Is two corpora x _i ,x _j The similarity of (2);

(1.6) utilizing the objective function gradient

Updating the parameter theta;

(1.7) if the target function is converged, ending the training, otherwise, returning to the step (1.1); the objective function is:

(2) utilizing the trained formalizationRepresentation model p _θ (z|x _i ) And formalizing semantic description on the natural language to be formalized represented.

2. The method of claim 1, wherein two corpora x are used for weakly supervised semantic parsing based on false positive sample detection _i ,x _j Similarity of (a) s (x) _i ,x _j ) Comprises the following steps:

s(x _i ,x _j )＝||v _i -v _j || ₂

wherein corpus x _i ,x _j Each comprising n _i ,n _j Individual word, v _i,k ,v _j,k Pre-training word vectors, v, of these words, respectively _i,k Where k is 1 _i And v is _j,k Where k is 1 _j 。

3. The unsupervised semantic parsing method based on false positive sample detection as claimed in claim 1, wherein the step (1.6) is specifically:

where λ is the learning rate.

4. The unsupervised semantic parsing method based on false positive sample detection as claimed in claim 1 wherein corpus similarity is calculated in preprocessing and is not changed during training.

5. The method for weakly supervised semantic parsing based on false positive sample detection as recited in claim 1, wherein c is _i Is a database, x _i Is a search requirement expressed in natural language, z _i Is satisfying x _i SQL statement of (y) _i Is an SQL statement z _i In database c _i The result of the search is that the training sample does not contain x _i Is correct semantic representation z _i 。