CN112115723B - Weak supervision semantic analysis method based on false positive sample detection - Google Patents

Weak supervision semantic analysis method based on false positive sample detection Download PDF

Info

Publication number
CN112115723B
CN112115723B CN202010961187.7A CN202010961187A CN112115723B CN 112115723 B CN112115723 B CN 112115723B CN 202010961187 A CN202010961187 A CN 202010961187A CN 112115723 B CN112115723 B CN 112115723B
Authority
CN
China
Prior art keywords
corpus
training
false positive
formal
representation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010961187.7A
Other languages
Chinese (zh)
Other versions
CN112115723A (en
Inventor
刘俊涛
张毅
王振杰
王军伟
高子文
王军利
周莹
杨向广
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
709th Research Institute of CSIC
Original Assignee
709th Research Institute of CSIC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 709th Research Institute of CSIC filed Critical 709th Research Institute of CSIC
Priority to CN202010961187.7A priority Critical patent/CN112115723B/en
Publication of CN112115723A publication Critical patent/CN112115723A/en
Application granted granted Critical
Publication of CN112115723B publication Critical patent/CN112115723B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

In order to solve the problem of weak supervision semantic analysis, the invention provides a weak supervision semantic analysis method based on false positive sample detection. The method is used for converting the operation described by the natural language into the formalized semantic description, for example, converting the data retrieval requirement described by the natural language into SQL statements to be executed most in the database management system and returning the retrieval result. The method provided by the invention detects the existence of false positive samples by measuring the return consistency of similar training samples, provides an objective function containing return consistency measurement, and optimizes the objective function by adopting a random gradient descent method so as to obtain a semantic analysis model. The method improves the accuracy of semantic analysis and has better generalization capability.

Description

Weak supervision semantic analysis method based on false positive sample detection
Technical Field
The invention belongs to the technical field of semantic analysis, and particularly relates to a weak supervision semantic analysis method based on false positive sample detection.
Background
Semantic parsing problems transform a given natural language into some formalized description. The formal description may be an expression based on a certain syntax, such as a λ -operator expression, or may be a tree or graph structure, such as a semantic parse tree, or may be a program language, such as SQL, PYTHON, or the like.
The general semantic analysis model is obtained by learning a training sample in a supervised mode, wherein the training sample is { (x) i ,z i ) In which x i And z i Respectively, a natural language and its corresponding formal semantic description. With the recent rise of natural language processing methods based on deep neural networks, supervised semantic parsing problems are generally translated into sequence-to-sequence translation problems.
The training samples of the weakly supervised semantic parsing problem are different from the training samples of the supervised semantic parsing problem, which are natural language corpora and results thereof under certain execution environment, namely { (x) i ,c i ,y i ) In which x i Is a natural language corpus, c i And y i Respectively, the corresponding execution environment and result, i.e. assume x i Is expressed as z i Will z i In the environment c i In the execution to obtain y i I.e. c i (z i )=y i . Note that x i Is represented by a semantic representation z i Not present in the training samples, is implicit. The weak supervised semantic parsing problem is to pass the training sample { (x) i ,c i ,y i ) And studying to obtain the formal description z of the given natural language x.
The training samples for weakly supervised semantic parsing do not contain the correct formal description z of the natural language x. More than one semantic representation can usually yield the same result. For example, for the same data table, multiple SQL statements may obtain the same search result. Where only a very small semantic representation correctly represents the true meaning of a given natural language corpus.
A typical application scenario of semantic parsing is to convert data retrieval requirements described by natural language into SQL statements for execution in a database management system and return the retrieval results.
Formalized representations (programs) that yield the desired execution result but do not represent the semantic meaning of the corpus correctly are called spurious positive samples. The presence of false positive samples, especially their premature appearance during learning, directs learning in the wrong direction, making the learning result less generalizable. One challenge of weakly supervised semantic parsing is therefore how to identify false positive samples. Therefore, it is necessary to provide a weakly supervised semantic parsing method based on false positive sample detection.
Disclosure of Invention
The invention provides a weak supervision semantic analysis technology considering false positive samples. Since the formal representation of the corpus is generated from the corpus, similar corpuses will generate similar formal representations. If there are false positive samples in the representation generated from a corpus x. The formal representation generated in the same way for a corpus x' similar to corpus x should have a large number of errors. Thus, detecting consistency of execution results of similar corpus generating representations may suggest the presence of false positive samples. The invention is based on the idea and provides the weakly supervised semantic analysis considering the false positive samples. The method can improve the accuracy of semantic analysis and increase the generalization capability of the semantic analysis.
In order to achieve the above object, the present invention provides a weak supervised semantic analysis method based on false positive sample detection, which comprises:
(1) formalized representation model p for training corpus by adopting stochastic gradient descent method θ (z|x i ) Determining a parameter θ therein, including:
(1.1) randomly taking one training sample in the training sample set S, and recording as (x) i ,c i ,y i ) Wherein x is i Is corpus of natural language representation, c i Is x i Is a formal representation of z i Execution environment of y i Is z i At c i The result of the execution in (1);
(1.2) according to p θ (z|x i ) Sampling multiple possible formalized representations z i,k };
(1.3) calculating sampled x i Is expressed in a formal representation z i,k The average of the returns of (c) },as x i Is expressed in a formal manner θ (x i ,y i );
(1.4) calculating corpus x i Is reported consistent metric k i
(1.5) according to the above (x) i ,c i ,y i )、{z i,k }、G θ (x i ,y i ) And k i Calculating an objective function gradient
Figure BDA0002680613040000031
(1.6) utilizing the objective function gradient
Figure BDA0002680613040000032
Updating the parameter theta;
(1.7) if the target function is converged, ending the training, otherwise, returning to the step (1.1);
(2) utilizing the trained formal representation model p θ (z|x i ) And formalizing semantic description on the natural language to be formalized represented.
In one embodiment of the present invention, the objective function in the step (1.7) is:
Figure BDA0002680613040000033
in one embodiment of the present invention, G in the step (1.3) θ (x i ,y i ) Random strategy p with theta as parameter θ (z|x i ) Generating x i The formal representation of (1) is expected to return:
Figure BDA0002680613040000034
r (z) is the reward z performs in the environment c, P is the set of all possible formal representations, and z ∈ P is one of the possible formal representations.
In one embodiment of the invention, if the execution is the same as the result y in the training sample, then a return of 1 is obtained, otherwise the return is 0, i.e.:
Figure BDA0002680613040000035
where (c), (z) represents the formalization of the corpus to represent the result of z's execution in environment c.
In one embodiment of the present invention, the step (1.4) k i Is corpus x i The reward consistency metric of (2):
Figure BDA0002680613040000036
wherein s (x) i ,x j ) Is two corpora x i ,x j The similarity of (c).
In one embodiment of the present invention, two corpora x i ,x j Similarity of (a) s (x) i ,x j ) Comprises the following steps:
s(x i ,x j )=||v i -v j || 2
Figure BDA0002680613040000037
Figure BDA0002680613040000038
wherein corpus x i ,x j Each comprising n i ,n j Individual word, v i,k (k=1,..,n i ),v j,k (k=1,..,n j ) Respectively, pre-training word vectors for these words.
In an embodiment of the present invention, the step (1.5) specifically includes:
Figure BDA0002680613040000041
in an embodiment of the present invention, the step (1.6) specifically includes:
Figure BDA0002680613040000042
where λ is the learning rate.
In one embodiment of the invention, the corpus similarity is calculated in the preprocessing, and is not changed in the training process.
In one embodiment of the present invention, c i Is a database, x i Is a search requirement expressed in natural language, z i Is satisfying x i SQL statement of (y) i Is an SQL statement z i In database c i The result of the search is that the training sample does not contain x i Is correct semantic representation z i
Generally, compared with the prior art, the technical scheme of the invention has the following beneficial effects: the method relieves the influence of false positive samples in the training process, the semantic analysis model obtained by training can generate more accurate semantic analysis results, and the comparison experiment on the public data set with the existing method shows that the accuracy of the semantic analysis results of the method is improved by about 6%. .
Drawings
FIG. 1 is a flowchart of a weak supervised semantic parsing method based on false positive sample detection according to an embodiment of the present invention;
FIG. 2 is a method for training a formal representation model of a corpus by using a stochastic gradient descent method according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
As shown in fig. 1, the present invention provides a weak supervised semantic parsing method based on false positive sample detection, which includes:
(1) formalized representation model p for training corpus by adopting stochastic gradient descent method θ (z|x i ) Determining a parameter theta;
(2) using said trainedFormalized representation model p θ (z|x i ) And formalizing semantic description on the natural language to be formalized represented.
The key technical points of the technical scheme of the invention are described as follows:
calculating training sample S { (x) i ,c i ,y i ) Every two corpora x in the Chinese i ,x j Similarity of (a) s (x) i ,x j ) Let us say corpus x i ,x j Each comprising n i ,n j Individual word, v i,k (k=1,..,n i ),v j,k (k=1,..,n j ) Respectively, pre-training word vectors (e.g., glove) for these words. Define corpus x i ,x j The similarity of (c) is calculated as follows:
s(x i ,x j )=||v i -v j || 2
Figure BDA0002680613040000051
Figure BDA0002680613040000052
wherein, the corpus similarity is calculated in the preprocessing, and is not changed in the training process;
wherein, the training sample set S { (x) i ,c i ,y i ) In x i Is a corpus of natural language representation, c i Is x i Is expressed as z i Execution environment of (a), y i Is z i At c i The result of execution in (c), typically i May be a database, x i Is a search requirement expressed in natural language, z i Is satisfying x i SQL statement of (y) i Is SQL statement z i In database c i The training sample does not contain x i Is z is a correct semantic representation of i
The objective function of the weak supervised semantic parsing problem based on false positive sample detection is as follows:
Figure BDA0002680613040000053
wherein G is θ (x i ,y i ) Random strategy p with theta as parameter θ (z|x i ) Generating x i The formal representation of (1) is expected to return:
Figure BDA0002680613040000054
in the above formula, P is the set of all possible formal representations, and z ∈ P is one of the possible formal representations;
where the return function is R (z), which is the return z performed in the environment c, if the performed result is the same as the result y in the training sample, then a return of 1 is obtained, otherwise, the return is 0, that is:
Figure BDA0002680613040000061
where c (z) represents the formalization of the corpus to represent the result of z's execution in context c.
k i Is corpus x i The reward consistency metric of (2):
Figure BDA0002680613040000062
formalized representation model p for training corpus by adopting stochastic gradient descent method θ (z|x i ) Determining a parameter theta, wherein the training process is as follows:
(1) randomly taking one training sample in the training sample set S and recording as (x) i ,c i ,y i ),
(2) According to p θ (z|x i ) Sampling multiple possible formalized representations z i,k },
(3) Calculating sampled x i Is expressed in a formal representation z i,k Mean value of the returns of x i Is expressed in a formal manner θ (x i ,y i )
Figure BDA0002680613040000063
(4) Calculating corpus x i Is reported consistent metric k i
(5) According to the above (x) i ,c i ,y i )、{z i,k }、G θ (x i ,y i ) Calculating the objective function gradient:
Figure BDA0002680613040000064
(6) the parameters theta are updated in such a way that,
Figure BDA0002680613040000065
(7) if the objective function converges and ends the training, otherwise, go back to (1)
As shown in FIG. 2, the present invention provides a method for training a formal representation model of corpus by using stochastic gradient descent, comprising:
step 1: initialization reward expectation
The expectation of return for all training samples is set to 0, G θ (x i ,y i )←0(i=1,2,...)
Step 2: computing similarity of training samples
Calculating training sample S { (x) i ,c i ,y i ) Every two corpora x in the Chinese i ,x j Similarity of (a) s (x) i ,x j ) Let us say corpus x i ,x j Each comprising n i ,n j Individual word, v i,k (k=1,..,n i ),v j,k (k=1,..,n j ) Respectively, pre-training word vectors (e.g., glove) for these words. Define corpus x i ,x j The similarity of (c) is calculated as follows:
s(x i ,x j )=||v i ,v j || 2
Figure BDA0002680613040000071
Figure BDA0002680613040000072
and 3, step 3: training a formal representation model p θ (z|x i )
Formalized representation model p for training corpus by adopting stochastic gradient descent method θ (z|x i ) Determining a parameter theta, wherein the training process is as follows:
step 3-1: randomly taking one training sample in the training sample set S and recording as (x) i ,c i ,y i ),
Step 3-2: according to p θ (z|x i ) Sampling multiple possible formalized representations z i,k },
Step 3-3: calculating sampled x i Is expressed in a formal representation z i,k The mean of the returns of x i Is expressed in a formal manner θ (x i ,y i )
Figure BDA0002680613040000073
Step 3-4: calculating corpus x i Is reported consistent metric k i
Figure BDA0002680613040000074
Step 3-5: according to the above (x) i ,c i ,y i )、{z i,k }、G θ (x i ,y i ) Calculating the gradient of the objective function:
Figure BDA0002680613040000075
step 3-6: the parameter theta is updated in such a manner that,
Figure BDA0002680613040000081
step 3-7: if the objective function converges and the training is finished, otherwise, the step 3-1 is returned.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (5)

1. A weak supervision semantic analysis method based on false positive sample detection is characterized by comprising the following steps:
(1) formalized representation model p for training corpus by adopting stochastic gradient descent method θ (z|x i ) Determining a parameter θ therein, including:
(1.1) randomly taking one training sample in the training sample set S, and recording as (x) i ,c i ,y i ) Wherein x is i Is a corpus of natural language representation, c i Is x i Is a formal representation of z i Execution environment of y i Is z i At c i The result of the execution in (1);
(1.2) according to p θ (z|x i ) Sampling multiple possible formal representations z i,k };
(1.3) calculating sampled x i Is expressed in a formal representation z i,k Mean value of the returns of x i Is expressed in a formal manner θ (x i ,y i );G θ (x i ,y i ) Random strategy p with theta as parameter θ (z|x i ) Generating x i The formal representation of (1) is expected to return:
Figure FDA0003691700280000011
r (z) is the reward z performs in environment c, P is the set of all possible formal representations, z ∈ P is one of the possible formal representations; if the execution is the same as the result y in the training sample, then a return of 1 is obtained, otherwise the return is 0, i.e.:
Figure FDA0003691700280000012
wherein (c) (z) represents the result of the formalization of the corpus representing z performed in context c;
(1.4) calculating corpus x i Is reported consistent metric k i ;k i Is corpus x i The reward consistency metric of (2):
Figure FDA0003691700280000013
wherein s (x) i ,x j ) Is two corpora x i ,x j The similarity of (2);
(1.5) according to the above (x) i ,c i ,y i )、{z i,k }、G θ (x i ,y i ) And k i Calculating an objective function gradient
Figure FDA0003691700280000014
Figure FDA0003691700280000015
(1.6) utilizing the objective function gradient
Figure FDA0003691700280000016
Updating the parameter theta;
(1.7) if the target function is converged, ending the training, otherwise, returning to the step (1.1); the objective function is:
Figure FDA0003691700280000021
(2) utilizing the trained formalizationRepresentation model p θ (z|x i ) And formalizing semantic description on the natural language to be formalized represented.
2. The method of claim 1, wherein two corpora x are used for weakly supervised semantic parsing based on false positive sample detection i ,x j Similarity of (a) s (x) i ,x j ) Comprises the following steps:
s(x i ,x j )=||v i -v j || 2
Figure FDA0003691700280000022
Figure FDA0003691700280000023
wherein corpus x i ,x j Each comprising n i ,n j Individual word, v i,k ,v j,k Pre-training word vectors, v, of these words, respectively i,k Where k is 1 i And v is j,k Where k is 1 j
3. The unsupervised semantic parsing method based on false positive sample detection as claimed in claim 1, wherein the step (1.6) is specifically:
Figure FDA0003691700280000024
where λ is the learning rate.
4. The unsupervised semantic parsing method based on false positive sample detection as claimed in claim 1 wherein corpus similarity is calculated in preprocessing and is not changed during training.
5. The method for weakly supervised semantic parsing based on false positive sample detection as recited in claim 1, wherein c is i Is a database, x i Is a search requirement expressed in natural language, z i Is satisfying x i SQL statement of (y) i Is an SQL statement z i In database c i The result of the search is that the training sample does not contain x i Is correct semantic representation z i
CN202010961187.7A 2020-09-14 2020-09-14 Weak supervision semantic analysis method based on false positive sample detection Active CN112115723B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010961187.7A CN112115723B (en) 2020-09-14 2020-09-14 Weak supervision semantic analysis method based on false positive sample detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010961187.7A CN112115723B (en) 2020-09-14 2020-09-14 Weak supervision semantic analysis method based on false positive sample detection

Publications (2)

Publication Number Publication Date
CN112115723A CN112115723A (en) 2020-12-22
CN112115723B true CN112115723B (en) 2022-08-12

Family

ID=73803017

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010961187.7A Active CN112115723B (en) 2020-09-14 2020-09-14 Weak supervision semantic analysis method based on false positive sample detection

Country Status (1)

Country Link
CN (1) CN112115723B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108283809A (en) * 2018-02-11 2018-07-17 腾讯科技(深圳)有限公司 Data processing method, device, computer equipment and storage medium
CN109740588A (en) * 2018-12-24 2019-05-10 中国科学院大学 The X-ray picture contraband localization method reassigned based on the response of Weakly supervised and depth
CN110598609A (en) * 2019-09-02 2019-12-20 北京航空航天大学 Weak supervision target detection method based on significance guidance

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3769264A1 (en) * 2018-05-18 2021-01-27 Deepmind Technologies Limited Meta-gradient updates for training return functions for reinforcement learning systems
CN111382253B (en) * 2020-03-02 2022-07-15 思必驰科技股份有限公司 Semantic parsing method and semantic parser

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108283809A (en) * 2018-02-11 2018-07-17 腾讯科技(深圳)有限公司 Data processing method, device, computer equipment and storage medium
CN109740588A (en) * 2018-12-24 2019-05-10 中国科学院大学 The X-ray picture contraband localization method reassigned based on the response of Weakly supervised and depth
CN110598609A (en) * 2019-09-02 2019-12-20 北京航空航天大学 Weak supervision target detection method based on significance guidance

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
William Duffy et al..User-trained activity recognition using smartphones and weak supervision.《2019 30th Irish Signals and Systems Conference (ISSC)》.2019,第1-5页. *
付治 等.基于k个标记样本的弱监督学习框架.《软件学报》.2020,第31卷(第4期),第981-990页. *

Also Published As

Publication number Publication date
CN112115723A (en) 2020-12-22

Similar Documents

Publication Publication Date Title
Chen et al. Unsupervised induction and filling of semantic slots for spoken dialogue systems using frame-semantic parsing
US20210117625A1 (en) Semantic parsing of natural language query
CN111931506A (en) Entity relationship extraction method based on graph information enhancement
US20220129450A1 (en) System and method for transferable natural language interface
CN111782807B (en) Self-bearing technology debt detection classification method based on multiparty integrated learning
CN114168740A (en) Transformer concurrent fault diagnosis method based on graph convolution neural network and knowledge graph
Sharath et al. Question answering over knowledge base using language model embeddings
CN116861269A (en) Multi-source heterogeneous data fusion and analysis method in engineering field
Balaji et al. Text summarization using NLP technique
CN106448660A (en) Natural language fuzzy boundary determining method with introduction of big data analysis
CN111581365B (en) Predicate extraction method
CN112115723B (en) Weak supervision semantic analysis method based on false positive sample detection
WO2021160822A1 (en) A method for linking a cve with at least one synthetic cpe
CN116720498A (en) Training method and device for text similarity detection model and related medium thereof
Zhang et al. Dependency-aware form understanding
CN111767388B (en) Candidate pool generation method
Avram et al. UPB at SemEval-2021 task 8: extracting semantic information on measurements as multi-turn question answering
CN114579763A (en) Character-level confrontation sample generation method for Chinese text classification task
CN116881471B (en) Knowledge graph-based large language model fine tuning method and device
CN115295134B (en) Medical model evaluation method and device and electronic equipment
CN117520567B (en) Knowledge graph-based large language model training method
JP7384354B2 (en) Information processing device, information processing method and program
Haque et al. Label Smoothing Improves Neural Source Code Summarization
Fan et al. Zero-Shot Event Detection Based on Prompt and Deep Prototype Clustering
CN116628199A (en) Weak supervision text classification method and system with enhanced label semantics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant