CN110609995B

CN110609995B - Method and device for constructing Tibetan language question-answer corpus

Info

Publication number: CN110609995B
Application number: CN201810617055.5A
Authority: CN
Inventors: 孙媛; 夏天赐
Original assignee: Minzu University of China
Current assignee: Minzu University of China
Priority date: 2018-06-15
Filing date: 2018-06-15
Publication date: 2023-06-27
Anticipated expiration: 2038-06-15
Also published as: CN110609995A

Abstract

The invention provides a method and a device for constructing a Tibetan language question-answer corpus, which belong to the field of big data processing, and the method provided by the invention comprises the following steps: selecting a Tibetan triplet entity as a central word entity, and acquiring all triples related to the central word entity; mapping all entities in all triples into corresponding relations between the entities and labels, and constructing a Tibetan language question-answer corpus according to the corresponding relations and the central word entities. According to the scheme, through finding all triples related to Tibetan triple entities and mapping the triples into the corresponding relation between the entities and the labels, a Tibetan question-answer corpus is constructed, and the defects of time consumption and labor consumption of manual participation are overcome.

Description

Method and device for constructing Tibetan language question-answer corpus

Technical Field

The invention relates to the technical field of big data processing, in particular to a method and equipment for constructing a Tibetan language question-answer corpus.

Background

The question-answering system is a very important research hotspot in the field of natural language processing in recent years, which allows users to ask questions in natural language and then return a relatively accurate, satisfactory answer to the user.

Compared with a Chinese-English rich question-answering system, the Tibetan language question-answering system has the advantages of extremely few predicted data, single type, and deficiency of Chinese-Tibetan and English-Tibetan translation technologies, and the Chinese-English question-answering corpus is difficult to directly apply to the Tibetan language question-answering corpus, so that a large-scale Tibetan language question-answering corpus is not established at present.

Disclosure of Invention

The embodiment of the invention provides a method and equipment for constructing a Tibetan language question-answer corpus, which aims to provide a scheme for constructing the Tibetan language question-answer corpus by utilizing the existing triplet entity, optimize natural question sentences in the constructed Tibetan language question-answer corpus and realize expansion of the Tibetan language question-answer corpus according to a Tibetan language knowledge base and the optimized natural question sentences.

In a first aspect, an embodiment of the present invention provides a method for constructing a Tibetan language question-answer corpus, where the method includes:

taking one triplet entity as a central word entity, and acquiring all triples related to the central word entity;

mapping all entities in all triples into corresponding relations between the entities and labels;

and constructing a Tibetan language question-answering corpus according to the corresponding relation and the central word entity.

In another aspect, an embodiment of the present invention provides a device for constructing a Tibetan language question-answer corpus, where the device includes: constructing a Tibetan language question-answer corpus module and an optimized Tibetan language question-answer corpus module;

the Tibetan question-answering corpus module is used for constructing a Tibetan question-answering corpus module, selecting a triple entity as a central word entity and acquiring all triples related to the central word entity; mapping all entities in all triples into corresponding relations between the entities and labels, and constructing a Tibetan language question-answer corpus according to the corresponding relations and the central word entities;

the optimized Tibetan language question-answering corpus module calculates the vector of the template question and the vector of the real question; obtaining probability distribution of the template question by using a neural network according to the vector of the template question; and detecting whether the template question is valid or not according to the vector of the real question and the probability distribution of the template question.

The beneficial effects are as follows:

the invention mainly utilizes entity-relation-entity triples in the existing Tibetan language knowledge base to construct a Tibetan language question-answer corpus so as to generate natural question sentences. In addition, the automatic expansion of the Tibetan language question-answer corpus is realized by correcting and optimizing the natural question grammar and semantic structure in the question-answer prediction library through the countermeasure neural network and then training an end-to-end neural network model by combining the knowledge base and the natural question.

Drawings

Specific embodiments of the invention will be described below with reference to the accompanying drawings, in which:

FIG. 1 is a schematic flow chart of a method for constructing a Tibetan question-answer corpus in accordance with an embodiment of the present invention;

FIG. 2 is a schematic logic diagram of constructing a Tibetan question-answer corpus in a second embodiment of the present invention;

FIG. 3 is a schematic diagram of optimizing a constructed Tibetan question-answer corpus in accordance with a second embodiment of the present invention;

FIG. 4 shows a flowchart of a method for optimizing a constructed Tibetan question-answer corpus constructed in a second embodiment of the invention;

fig. 5 shows a logic schematic diagram for expanding a Tibetan language question-answer corpus in the third embodiment of the invention.

Detailed Description

In order to make the technical solution and advantages of the present invention more apparent, the following detailed description of exemplary embodiments of the present invention is provided in conjunction with the accompanying drawings, it being apparent that the described embodiments are only some, but not all embodiments of the present invention. And the embodiments and features of the embodiments in this description may be combined with each other without conflict.

The inventors noted during the course of the invention that: at present, no scheme for obtaining a Tibetan language corpus by training a deep learning model is adopted, and the Tibetan language question-answer corpus is constructed, so that the question-answer efficiency is improved effectively.

Example 1

FIG. 1 shows a flowchart of a method for constructing a Tibetan question-answer corpus in an embodiment of the invention, the method comprising:

step 101, taking a Tibetan triplet entity as a central word entity, and acquiring all triples related to the central word entity;

step 102, mapping all entities in all triples into corresponding relations between the entities and labels;

and 103, constructing a Tibetan question-answering corpus according to the corresponding relation and the central word entity.

In step 101, a Tibetan triplet entity may be randomly selected as the central word entity, as shown in fig. 2, where the selected triplet entity is<

Father, jersey>

>

Wherein the tags in step 102 include a shallow tag that is not related to the triplet attribute, typically a person, a place, an organization, etc., and a deep tag that is related to the triplet attribute, e.g<

Death time, 1895>Triplet, can judge the entity: />

Not only the person but also the dead person.

Step 103 specifically includes: constructing a center diagram according to the corresponding relation and the center word entity, wherein the center diagram comprises nodes and edges; and according to the center graph and the center word entity, performing graph query to construct a Tibetan language question-answer corpus.

The nodes in the center graph comprise center word nodes and associated nodes, edges represent the relation between two entities, in practical application, the center word nodes can be represented by double rectangles, and the associated nodes can be represented by single rectangles;

specifically, according to the corresponding relation and the center word entity, constructing a center diagram specifically includes:

selecting a first entity of the central triplet as a central word, and representing the first entity by double rectangles; sequentially adding the entities in the entity-mapping table into a centre graph, and representing the entity-mapping table by a single rectangle; relationships between entities are established according to the selected triples, represented by arrows.

In practical application, the constructed Tibetan language question-answer corpus contains natural sentences, and the corresponding general rule of the natural sentences is that characters->

Site-by-site>/>

Time-scheme>/>

General question correspondence rules are: character->

Site-by-site>/>

Time-scheme>/>

Etc. As shown in FIG. 2, the selected center word tag is a character corresponding to +.>

The death cause of the center word is known from the center diagram: />

Death time of the center word: 1895, thus generating a problem template: />

According to the method provided by the invention, the Tibetan question-answer corpus is constructed by finding all triples related to Tibetan triple entities and mapping the triples into the corresponding relation between the entities and the labels, so that the defects of time consumption and labor consumption of manual participation are overcome.

Example two

In the second embodiment of the present invention, on the basis of constructing the Tibetan language question-answer corpus in the above embodiment, a scheme for optimizing natural sentences in the constructed Tibetan language question-answer corpus is added.

Fig. 3 shows a logic schematic diagram of optimizing natural sentences in a Tibetan language question-answer corpus in an embodiment of the present invention, where the natural sentences in the Tibetan language question-answer corpus include template question sentences and real question sentences, and specific optimization steps are as follows in fig. 4:

step 201, calculating vectors of template questions in a Tibetan language question-answering corpus and vectors of real questions in the Tibetan language question-answering corpus;

specifically, the vector latitude of each word is added through a word2vec tool, and sentence vector expression is obtained. The vector expression of the template question is denoted as Z, and the vector expression of the real question is denoted as X.

Step 202, obtaining probability distribution of the template question by using a neural network according to the vector of the template question;

specifically, G in fig. 3 represents a neural network that generates a model. The vector Z is input as a template question, and the probability distribution G (Z) about the template question is output.

And 203, detecting whether the template question is valid or not according to the vector of the real question and the probability distribution of the template question.

Specifically, fig. 3D shows a neural network of the discrimination model. The vector X of the real question and the probability distribution G (Z) output by the generative model are input, and output as a constant Y. The template question is generally considered valid when the Y value is set to 0.5.

According to the embodiment of the invention, the natural sentences in the constructed Tibetan language question-answering corpus are optimized by adopting the antagonistic neural network, and the semantics and grammar structures of the natural questions are generated by continuously adjusting the templates by comparing the Tibetan language real questions, so that the natural questions are more natural and accurate, and the manual intervention is reduced.

Example III

According to the second embodiment of the invention, on the basis of the optimized construction of the Tibetan language question-answer corpus, a scheme for expanding the Tibetan language question-answer corpus is added, and an end-to-end neural network is trained.

As shown in fig. 5, the specific scheme includes:

construction and prediction: according to the Tibetan language question-answering corpus constructed in the first embodiment and the effective template question constructed in the second embodiment, four-element groups are constructed, wherein the order of the four-element groups is subject, relation, object and problem;

encoding: obtaining vector expressions of entities and relations in a Tibetan language question-answer corpus by using a TransE algorithm, obtaining a subject vector expression, a relation vector expression and a guest vector expression, and forming a triplet word vector expression according to the subject vector expression, the relation vector expression and the guest vector expression;

e.g. Enc (F) = { Enc (F) _s ，Enc(F) _r ，Enc(F) _o }，Enc(F) _s Representing principal vector expression, enc (F) _r Representing the expression of a relational vector, enc (F) _o Representing the object vector expression, and then splicing the three vectors to form the triplet word vector expression to be input to the decoding stage.

Decoding: according to LSTM neural network and attention mechanism, mapping the expression of the triplet word vector into corresponding natural question sentence, training out end-to-end neural network

In the tetrad, the subject and the object respectively represent a first entity and a second entity of the tetrad in the Tibetan language question-answer corpus; the relationship represents a relationship between two entities; the question represents a natural question generated after the validity is detected, that is, the natural question generated by the countermeasure neural network in the above embodiment.

According to the scheme provided by the embodiment of the invention, the acquired Tibetan language question-answering training corpus is utilized, the network parameters are continuously adjusted and optimized by adopting the end-to-end neural network model, so that the Tibetan language question-answering corpus is automatically expanded and constructed, and the Tibetan language question-answering corpus is automatically expanded by the end-to-end neural network model, so that manual intervention is reduced.

Example IV

Based on the same inventive concept, the embodiments of the present invention further provide a device for constructing a Tibetan language question-answering corpus, and since the principles of solving the problems by these devices are similar to the method for constructing the Tibetan language question-answering corpus mentioned in the above embodiments, the implementation of these devices may refer to the implementation of the method, and the repetition is omitted.

In this embodiment, the device for constructing the Tibetan language question-answer corpus includes: constructing a Tibetan language question-answer corpus module and an optimized Tibetan language question-answer corpus module;

constructing a Tibetan language question-answering corpus module, selecting a triplet entity as a central word entity, and acquiring all triples related to the central word entity; mapping all entities in all triples into corresponding relations between the entities and labels, and constructing a Tibetan language question-answer corpus according to the corresponding relations and the central word entities;

the Tibetan language question-answering corpus module is optimized, and the vector of the template question sentence and the vector of the real question sentence are calculated; obtaining probability distribution of the template question by using a neural network according to the vector of the template question; and detecting whether the template question is valid or not according to the vector of the real question and the probability distribution of the template question.

Further, the device also comprises an expansion Tibetan language question-answer corpus module, which specifically comprises:

constructing a quadruple according to the Tibetan language question-answer corpus generated by the Tibetan language question-answer corpus module and the effective template question sentence generated by the Tibetan language question-answer corpus module, wherein the order of the quadruple is subject, relation, object and problem;

obtaining the vectors of the entities and the relations in the Tibetan language question-answer corpus by using a TransE algorithm, obtaining a subject vector expression, a relation vector expression and a guest vector expression, and forming a triplet word vector expression according to the subject vector expression, the relation vector expression and the guest vector expression;

and mapping the triplet word vector expression into corresponding natural question sentences according to the LSTM neural network and the attention mechanism.

The device for constructing the Tibetan language question-answering corpus provided by the invention has the advantages that the Tibetan language question-answering corpus is constructed by finding all triples related to Tibetan language triplet entities and mapping the triples into the corresponding relation between the entities and the labels, so that the defects of time consumption and labor consumption of manual participation are overcome. Natural sentences in the constructed Tibetan language question-answering corpus are optimized by adopting the antagonistic neural network, and the semantics and grammar structures of the natural questions are generated by continuously adjusting templates by comparing Tibetan language real questions, so that the natural questions are more natural and accurate, and the manual intervention is reduced.

And moreover, the network parameters are continuously adjusted and optimized by adopting an end-to-end neural network model, so that the automatic expansion and construction of the Tibetan language question-answer corpus are realized, the Tibetan language question-answer corpus is automatically expanded by the end-to-end neural network model, and the manual intervention is reduced.

For convenience of description, the above apparatus are each described as functionally divided into various modules or units, respectively. Of course, the functions of each module or unit may be implemented in the same piece or pieces of software or hardware when implementing the present invention.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

Claims

1. A method of constructing a Tibetan language question-answering corpus, the method comprising:

taking a Tibetan triplet entity as a central word entity, and acquiring all triples related to the central word entity;

mapping all entities in all triples into corresponding relations between the entities and labels; the label comprises a shallow label and a deep label, wherein the shallow label is irrelevant to the triplet attribute, and the deep label is relevant to the triplet attribute;

constructing a Tibetan language question-answer corpus according to the corresponding relation and the central word entity; the Tibetan language question-answer corpus comprises natural sentences, wherein the natural sentences comprise template question sentences and real question sentences, and the corresponding rules of the natural sentences are characters->

(who) places->/>

Where, time->/>

(when);

the method for constructing the Tibetan question-answering corpus comprises the following steps of:

constructing a center graph according to the corresponding relation and the center word entity, wherein the center graph comprises nodes and edges; the nodes in the center graph comprise center word nodes and associated nodes, and the edges represent the relationship between two entities;

according to the center graph and the center word entity, graph query is carried out, and a Tibetan language question-answer corpus is constructed;

the method further comprises the steps of:

calculating the vector of the template question and the vector of the real question;

obtaining probability distribution of the template question by using a neural network according to the vector of the template question;

detecting whether the template question is valid or not according to the vector of the real question and the probability distribution of the template question;

when the template question is detected to be valid, the method further comprises:

constructing a quadruple according to the Tibetan language question-answering corpus and the effective template question, wherein the order of the quadruple is a subject, a relation, an object and a problem; the subject and the object respectively represent a first entity and a second entity of the triplet in the Tibetan language question-answer corpus; the relationship represents a relationship between two entities; the question represents a natural question sentence generated after the validity is detected;

obtaining vectors of entities and relations in the Tibetan language question-answer corpus by using a TransE algorithm, obtaining a subject vector expression, a relation vector expression and a guest vector expression, and forming a triplet word vector expression according to the subject vector expression, the relation vector expression and the guest vector expression;

2. An apparatus for constructing a Tibetan language question-answering corpus, the apparatus comprising: constructing a Tibetan language question-answer corpus module and an optimized Tibetan language question-answer corpus module;

the optimized Tibetan language question-answering corpus module calculates the vector of the template question sentence and the vector of the real question sentence; obtaining probability distribution of the template question by using a neural network according to the vector of the template question; detecting whether the template question is valid or not according to the vector of the real question and the probability distribution of the template question;

the device also comprises an expansion Tibetan language question-answer corpus module, which specifically comprises:

constructing a quadruple according to the Tibetan language question-answer corpus generated by the Tibetan language question-answer corpus constructing module and the effective template question sentence generated by the Tibetan language question-answer corpus optimizing module, wherein the order of the quadruple is subject, relation, object and problem;