CN113139043A

CN113139043A - Question and answer sample generation method and device, electronic equipment and storage medium

Info

Publication number: CN113139043A
Application number: CN202110476855.1A
Authority: CN
Inventors: 张文君; 宋丹丹; 张玉东; 庞海龙
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-04-29
Filing date: 2021-04-29
Publication date: 2021-07-20
Anticipated expiration: 2041-04-29
Also published as: CN113139043B

Abstract

The application discloses a question and answer sample generation method and device, electronic equipment and a storage medium, and relates to the technical field of artificial intelligence, deep learning and big data. The specific implementation scheme is as follows: acquiring a target question text and an auxiliary answer text set; and selecting a target answer text for the target question text according to the similarity between the target question text and at least two auxiliary answer texts in the auxiliary answer text set so as to obtain a question-answer sample comprising the target question text and the target answer text. The method and the device for generating the negative question-answer sample can improve the generation efficiency of the negative question-answer sample.

Description

Question and answer sample generation method and device, electronic equipment and storage medium

Technical Field

The application relates to the technical field of data processing, in particular to an artificial intelligence technology, a big data technology and a deep learning technology, and specifically relates to a question and answer sample generation method and device, electronic equipment and a storage medium.

Background

With the development of science and technology and the continuous progress of internet technology, a search-based interactive community question-answering platform has become an important channel for acquiring and sharing knowledge in life and work of people. Community Question Answering (CQA) is a combination of open knowledge sharing websites, and provides direct answers to questions by means of collective wisdom of network users through user participation.

However, because of the openness of the CQA, the answers of the CQA have very different quality, some answers can help the questioner to obtain information, and some answers cannot meet the requirements of the questioner, i.e., answer questions, even contain various irrelevant, low-quality, and even malicious information. This difference in content quality is a major problem to be solved in the question-and-answer community.

Disclosure of Invention

The application provides a question and answer sample generation method and device, electronic equipment and a storage medium.

According to an aspect of the present application, there is provided a question and answer sample generation method, including:

acquiring a target question text and an auxiliary answer text set;

and selecting a target answer text for the target question text according to the similarity between the target question text and at least two auxiliary answer texts in the auxiliary answer text set so as to obtain a question-answer sample comprising the target question text and the target answer text.

According to another aspect of the present application, there is provided a question and answer sample generation apparatus including:

the question-answering source data acquisition module is used for acquiring a target question text and an auxiliary answer text set;

and the target answer text screening module is used for selecting a target answer text for the target question text according to the similarity between the target question text and at least two auxiliary answer texts in the auxiliary answer text set so as to obtain a negative question-answer sample comprising the target question text and the target answer text.

According to another aspect of the present application, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor, and the instructions are executable by the at least one processor to enable the at least one processor to perform the question and answer sample generation method according to any embodiment of the present application.

According to another aspect of the present application, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the question and answer sample generation method according to any one of the embodiments of the present application.

According to another aspect of the present application, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the question and answer sample generation method according to any one of the embodiments of the present application.

The method and the device for generating the question and answer samples can improve the generation efficiency of the question and answer samples.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a schematic diagram of a question and answer sample generation method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a question and answer sample generation method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a question and answer sample generation method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a question and answer sample generation method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a question and answer sample generation method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a method for training a question-answer correlation detection model according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a question-answer correlation detection model training according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a question and answer sample generation apparatus according to an embodiment of the present application;

fig. 9 is a block diagram of an electronic device for implementing a question and answer sample generation method according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a flowchart of a question-answer sample generation method disclosed in an embodiment of the present application, which may be applied to a case of generating a negative question-answer sample for training a question-answer correlation detection model. The method of this embodiment may be executed by a question and answer sample generating device, which may be implemented in software and/or hardware, and is specifically configured in an electronic device with a certain data operation capability, where the electronic device may be a client device, a mobile phone, a tablet computer, a vehicle-mounted terminal, a desktop computer, or a server-side device.

S101, obtaining a target question text and an auxiliary answer text set.

The target question text may refer to text containing a question, illustratively, "do apples eat good? ". The auxiliary answer text set comprises at least two auxiliary answer texts, wherein the auxiliary answer texts can be texts containing answers, and the auxiliary answer texts are different from the standard answer texts of the target question texts. The standard answer text of the target question text refers to an accurate answer corresponding to the target question text. Illustratively, the target question text is "do apples eat well? ", the standard answer text is 'good eating' and the auxiliary answer text is 'pineapple good eating'. The auxiliary answer text set is used for selecting at least one auxiliary answer text, and the auxiliary answer text and the target question text form at least one question-answer pair respectively to be used as a negative question-answer sample.

In fact, there is no correlation between the target question text and each of the auxiliary answer texts in the auxiliary answer text set. Target question texts can be obtained from question data collected from all community question-answering platforms in the network. The question data is extracted from the interactive text of the question-answer relationship, and the semantic of the sentence is a question relative to any sentence in the interactive text. And acquiring an auxiliary answer text set from answer data collected in a community question-answering platform in the network. The answer data is obtained by extracting any sentence in the interactive text from the interactive text of the question-answer relationship, wherein the semantic of the sentence is the answer. Illustratively, the community question and answer platform may be referred to as an open community question and answer platform. The question and answer data in the community question and answer platform can be Chinese or other foreign characters such as English. The collection of the question data and the collection of the answer data are both collected in a random mode, so that the target question texts and the auxiliary answer texts in the auxiliary answer text set collected in the random mode have no correlation.

S102, selecting a target answer text for the target question text according to the similarity between the target question text and at least two auxiliary answer texts in the auxiliary answer text set so as to obtain a question-answer sample comprising the target question text and the target answer text.

The question-answer sample is a question-answer pair consisting of a question text and an answer text. The positive question-answer sample refers to a question-answer pair related to the question text and the answer text, and the negative question-answer sample refers to a question-answer pair unrelated to the question text and the answer text, namely, the question-answer pair can be understood as a question-answer pair of 'no-answer' question. Exemplarily, the question text is "do apple eat? ", and the answer text is the question-answer pair formed by" pineapple is good for eating "is the negative question-answer sample; the question text is "do apple eat good? ", and the question-answer pair formed by the answer text" good taste "is a positive question-answer sample.

The similarity value is used to describe the degree of similarity between the target question text and each of the auxiliary answer texts. For example, the auxiliary answer text having the highest similarity value may be selected to be determined as the target answer text, or the first i auxiliary answer texts having the highest similarity values may be selected to be determined as the target answer text. The similarity value may be calculated by at least one of Term Frequency-Inverse text Frequency index (TF-IDF), Latent Dirichlet Allocation (LDA), deep learning, and the like. And screening the target answer text according to the similarity between the target question text and each auxiliary answer text, and inquiring the auxiliary answer text with certain similarity or with certain similarities. However, because the question text and the auxiliary answer text set are randomly acquired and have low relevance, answers which are similar to the question but not related to the question can be obtained through screening. And forming a negative question-answer sample by using the screened target answer text and the low-relevance target question text.

In the prior art, identification based on relevance or identification of answers without relevance based on low-quality word lists is generally adopted. The relevance identification is used for solving the problem of irrelevant answer questions by measuring the matching degree of the questions and the answers through a relevance technology. The recognition is carried out based on the low-quality word list, the low-quality answers are recognized by manually mining the low-quality word list and utilizing a word list matching technology, and the low-quality answers are used for solving the question-answering contents containing the characteristic words. But for the problem text: is apple good and bad to eat? Answer text 1: apple is a fruit, answer text 2: the pear is delicious. Both answer text 1 and answer text 2 are irrelevant answer text relative to the question text. The application range of the correlation identification method is limited, the correlation identification method is only suitable for the question and answer content identification scene with low correlation, and answers with certain correlation are difficult to accurately identify; the low-quality recognition method based on the word list is only suitable for low-quality contents which can hit the characteristic words, errors are easily detected for the missed low-quality contents, and meanwhile, great manual workload is brought to the arrangement of the low-quality word list.

In view of the above, by filtering the target answer text according to the similarity, the target answer text which is not related to the target question text with specified similarity can be found, and forms a question-answer sample, reduces the labor cost for generating the question-answer sample, improves the generation efficiency, trains a question-answer correlation detection model, improves the model detection accuracy, and selecting multiple corresponding target answer texts according to the similarity screening, and specifying different types of similarities to respectively generate multiple question-answer samples, thereby generating question-answer pairs with different similarities, increasing the similarity range covered by the negative question-answer samples, increasing the diversity of the negative question-answer samples, therefore, the representativeness of the negative question-answer samples is increased, and meanwhile, the models obtained by training the negative question-answer samples with different similarities are increased, the method can accurately detect the negative question-answer samples with different similarities, namely, the accuracy of question-answer correlation detection is improved.

According to the technical scheme, the target answer text is selected for the target question text according to the similarity between the target question text and at least two service answer texts in the auxiliary answer text set so as to generate the question-answer sample, the automatic generation of the question-answer sample is realized, the labor cost for mining the question-answer sample is reduced, the generation efficiency of the question-answer sample is improved, the representativeness of the question-answer sample is increased, and the accuracy of the question-answer correlation detection is improved.

Fig. 2 is a flowchart of another method for generating a question and answer sample disclosed in an embodiment of the present application, which is further optimized and expanded based on the above technical solution, and can be combined with the above optional embodiments. Under the condition that the target question text selects the target answer text, the optimization is as follows: and acquiring a new auxiliary answer text from the auxiliary non-labeled answer set, and replacing the selected target answer text with the new auxiliary answer text to update the auxiliary answer text set for generating a new question-answer sample.

S201, obtaining a target question text and an auxiliary answer text set.

S202, selecting a target answer text for the target question text according to the similarity between the target question text and at least two auxiliary answer texts in the auxiliary answer text set to obtain a question-answer sample comprising the target question text and the target answer text.

And S203, under the condition that the target question text selects the target answer text, acquiring a new auxiliary answer text from an auxiliary non-labeled answer set, and replacing the selected target answer text with the new auxiliary answer text to update the auxiliary answer text set for generating a new question-answer sample.

The auxiliary unlabeled answer set may refer to a set formed by answer texts without labeled data, wherein the auxiliary unlabeled answer set is used for forming and updating an auxiliary answer text set so as to screen out a target answer text unrelated to the target question text. And the new auxiliary answer text is used for replacing the selected target answer text and is added into the auxiliary answer text set so as to filter the target answer text of the next round. In fact, under the condition of no labeled answer text, the answer of the target answer text is screened, so that the probability that the question text with strong correlation and the answer text form a question-answer sample can be reduced, the correlation between the target question text and the target answer text in the question-answer sample is reduced, and the quality of the question-answer sample is improved.

And acquiring a new auxiliary answer text from the auxiliary non-labeled answer set, replacing the target answer text in the auxiliary answer text set, and filling the auxiliary answer text in the auxiliary answer text set after forming a negative question-answer sample. The auxiliary answer texts are updated, so that the problems that repeated contents of formed negative question-answer samples are too much and representativeness is reduced due to the fact that a plurality of target question texts select the same auxiliary answer text as the target answer text, and therefore accuracy of a trained model is reduced are solved.

K auxiliary answer texts can be screened from the auxiliary unlabeled answer set to form an auxiliary answer text set. And each time a question-answer sample is generated, popping up a target answer text in the question-answer sample in the auxiliary answer text set, selecting an answer text from the auxiliary unlabeled answer set, adding the selected answer text into the auxiliary answer text set to replace the target answer text and update the auxiliary answer text set. And the number of the auxiliary answers included in the updated auxiliary answer text set is still K. K is a super parameter and can be configured by the user. The correlation between the target question text and the target answer text in the negative question-answer sample is determined by the value of K: generally, the larger K, the greater the correlation; the smaller K, the smaller the correlation. It can be understood that the number of the auxiliary answer texts related to the target question text is small, the auxiliary answer text set comprises a larger number of auxiliary answers, the overall correlation between the target question text and the auxiliary answer text set is improved, the probability of inquiring the auxiliary answer text related to the target question text is higher, and therefore, the correlation between the target question answer text and the target answer text is obtained through screening; correspondingly, the less the number of the auxiliary answers included in the auxiliary answer text set, the lower the overall relevance between the target question text and the auxiliary answer text set, and the lower the probability of inquiring the auxiliary answer text related to the target question text, so that the lower the relevance between the target question answer text and the target answer text obtained by screening.

Optionally, the auxiliary unlabeled answer set is obtained from an original unlabeled answer set.

The original unlabeled answer set may be a set formed by collecting the obtained answer texts from at least one community question-answering platform. And obtaining the auxiliary non-labeled answer set from the original non-labeled answer set. In fact, the number of answer texts obtained from multiple community question-answering platforms is very large, the answer texts are directly selected from the original answer texts to form an auxiliary answer text set, random screening needs to be performed in a large amount of data, and the screening cost is very high. By acquiring the auxiliary non-labeled answer set from the original non-labeled answer set and generating and updating the auxiliary answer text set from the auxiliary non-labeled answer set, the amount of source data to be screened can be reduced, thereby reducing the screening cost and improving the generation efficiency of the question and answer sample.

In addition, the target question text is obtained from the original unlabeled question set. The original unlabeled question set may be a set formed by collecting the obtained question texts from at least one community question-and-answer platform. Problem texts can be sequentially selected from the original problem set without the label and determined as target problem texts. The original unlabeled question set includes a smaller number of question texts than the original unlabeled answer set.

In a specific example, as shown in fig. 3, the step of forming the negative question-answer sample includes: (1) the method comprises the steps that m answer texts (answers) and n question texts are randomly extracted from a database and respectively used as an answer data source (namely an auxiliary unmarked answer set) and a question data source (namely an original unmarked question set), wherein m is larger than n, at least one answer text can be finally screened out from one question text to form a negative question-answer sample, the database is formed by collecting data from a community question-answer platform, and the database comprises an original unmarked answer set and an original unmarked question set. (2) From an answer data source (an auxiliary unlabelled answer set), K answer texts are randomly popped up and put into an answer matching pool, namely, an auxiliary answer text set is generated. (3) From the question data source, 1 question text A pops up randomly, and similarity calculation is carried out on all answer texts (answers) in the answer matching pool. (4) And determining a target answer text (answer) B according to the similarity, popping the answer B out of an answer matching pool, and forming a question-answer pair (question-answer pair) with the answer B. (5) The AB question-answer pairs are output into the generated dataset. (6) Randomly popping up 1 answer from an answer data source (auxiliary unlabeled answer set) to supplement an answer matching pool (auxiliary answer text set), and keeping K answers in the answer matching pool. Repeating (3) - (6) until sufficient production data is obtained.

The answer text with the highest similarity can be used as a target answer text B to pop up, so that the recognition capability of character face similarity can be enhanced, and the recognition of deep semantics can be improved; meanwhile, the similarity degree of the questions and answers can be controlled by controlling the size of k, and the correct answer is matched with a small probability when k which is too large exists; and the answer data source, the question data source and the answer matching pool pop-up data are all unreleased.

According to the technical scheme, a new auxiliary answer text is obtained from an auxiliary unmarked answer set, a target answer text is replaced, the auxiliary answer text set is updated, the probability that a question-answer sample is formed by a question text with strong correlation and an answer text can be reduced, the correlation between the target question text and the target answer text in the question-answer sample is reduced, the quality of the question-answer sample is improved, multiple target question texts can be prevented from selecting the same auxiliary answer text as the target answer text, the representativeness of the question-answer sample is improved, and the accuracy of a trained model is improved.

Fig. 4 is a flowchart of another method for generating a question and answer sample according to an embodiment of the present application, which is further optimized and expanded based on the above technical solution, and can be combined with the above optional embodiments. Selecting a target answer text for the target question text according to the similarity between the target question text and at least two auxiliary answer texts in the auxiliary answer text set, which is embodied as follows: respectively calculating similarity values between the target question text and at least two auxiliary answer texts in the auxiliary answer text set; wherein the similarity value comprises a literal similarity value and/or a grammatical structure similarity value; and selecting a target answer text for the target question text according to the similarity value between the target question text and each auxiliary answer text.

S301, acquiring a target question text and an auxiliary answer text set.

S302, respectively calculating similarity values between the target question text and at least two auxiliary answer texts in the auxiliary answer text set; wherein the similarity value comprises a literal similarity value and/or a grammatical structure similarity value.

The literal similarity is used to evaluate whether the text is similar. Generally, literal similarity refers to text similarity and is semantic independent. The literal text similarity can be solved, and the two texts can be considered to be similar as long as the two texts are almost as long as the semantics are not the same. Grammar structure similarity may refer to similarity of text grammars. In the case that the similarity value includes a literal similarity value and a grammar structure similarity value, the similarity value may be equal to a weighted sum of the literal similarity value and the grammar structure similarity value, or the similarity value may be equal to a product between an index of the literal similarity value and an index of the grammar structure similarity value, wherein the weight in the weighted sum, and the numerical value of the index may be set as required.

Calculating a similarity value grammatical structure similarity value between the target question text and the auxiliary answer text may include: respectively carrying out grammatical analysis on the target question text and the auxiliary answer text to generate a grammatical structure character sequence corresponding to the target question text and a grammatical structure character sequence corresponding to the auxiliary answer text; and calculating the similarity between the grammar structure character sequence corresponding to the target question text and the grammar structure character sequence corresponding to the auxiliary answer text.

The syntax parsing generates a sequence of syntactic structure characters, which may be parsed, for example, using a syntax parser generator developed in JAVA. For two grammar structure character sequences, similarity values can be calculated by adopting methods such as TFIDF (fuzzy inference) and the like, or a word embedded (embedding) network with a grammar structure is trained according to a method of an encoder, and then cosine similarity values are calculated. The grammar parsing may include parsing of parts of speech, dependency relationship, semantic roles, and the like, and fusion is performed to obtain a grammar structure character sequence. For example, the part-of-speech character string, the dependency relationship character string, and the semantic character string are concatenated to form a grammar-structured character sequence, where concatenation may refer to using a preset spacer, such as a space, a short connecting line, or even a designated character. Illustratively, the question text is: who is a young grandma. Wherein "who" is a pronoun, denoted by r; "yes" is a verb, denoted by v; "Xiaoming" is a name, denoted nr; "and" usually follows a noun, meaning an indefinite noun, denoted u; "wife" is a noun, denoted by n. The part-of-speech string may be rvnrun. The dependency between "who" and "yes" is a major predicate, denoted by SBV; the dependency relationship between "yes" and "wife" is a mote-guest relationship, represented by VOB; the dependency between "Xiaoming" and "is the word" structure, denoted by DE; the dependency between "of" and "wife" is the word "structure, denoted by DE. The corresponding dependency string is sbvvobded. The grammar structure character sequence may be a part-of-speech character string and a dependency character string spliced using a short connecting line, such as rvnrun-sbvvobded. In addition, other display forms are available, and can be set as necessary.

And S303, selecting a target answer text for the target question text according to the similarity value between the target question text and each auxiliary answer text to obtain a negative question-answer sample comprising the target question text and the target answer text.

Optionally, the selecting a target answer text for the target question text according to the similarity value between the target question text and each of the auxiliary answer texts includes: dividing the similarity value between the target question text and each auxiliary answer text into at least two similarity intervals; the number of the similarity intervals is the same as the number of target answer texts to be selected; and selecting at least two target answer texts for the target question texts according to the at least two similarity intervals.

The similarity interval is used for dividing the similarity value so as to screen a plurality of target answer texts. The number of the similarity intervals is the same as that of the target answer texts to be selected, and the result shows that the similarity values are divided according to the number of the target answer texts to be selected, so that the similarity intervals with the same number are obtained.

For example, one auxiliary answer text may be selected in each similarity interval, resulting in the number of similarity intervals of the target answer text. Or, according to the distribution of the similarity values, at least two auxiliary answer texts may be selected in the similarity interval with the similarity values distributed more, and the auxiliary answer texts are not selected in the similarity interval with the similarity values distributed less, which may be specifically set as required.

In a specific example, the similarity value range is between 0 and 1, and the similarity interval may be divided into: 0-0.1, 0.1-0.2 … … 0.9, 0.9-1.0, or evenly dividing any grade, which can be set according to the application requirement. Each similarity interval may be any number, one, or more than one.

By dividing a plurality of similarity intervals and selecting a plurality of target answer texts in the similarity intervals, answer-to-questions samples with different similarities can be generated, the diversity of the answer-to-questions samples is improved, the representativeness of the answer-to-questions samples is improved, and therefore the accuracy of the answer-to-questions correlation detection is improved.

According to the technical scheme, the literal similarity and/or grammatical structure similarity between each auxiliary answer text and the question text are calculated, the similarity is calculated in a literal and grammar combined mode, the similarity between the answers and the questions can be calculated in a multi-dimensional mode, so that question and answer samples with different similarities are generated, the diversity of the question and answer samples is improved, the representativeness of the question and answer samples is improved, and the accuracy of question and answer correlation detection is improved.

Fig. 5 is a flowchart of another question and answer sample generation method disclosed in an embodiment of the present application, which is further optimized and expanded based on the above technical solution, and can be combined with the above optional embodiments. Under the condition that a negative question-answer sample comprising the target question text and the target answer text is obtained, optimizing as follows: obtaining a question and answer sample; and training a question-answer correlation detection model by adopting the generated negative question-answer sample and the positive question-answer sample together.

S401, obtaining a target question text and an auxiliary answer text set.

S402, selecting a target answer text for the target question text according to the similarity between the target question text and at least two auxiliary answer texts in the auxiliary answer text set to obtain a question-answer sample comprising the target question text and the target answer text.

And S403, obtaining a positive question and answer sample.

The positive question-answer samples correspond to the negative question-answer samples. The positive question-answer sample is a question-answer pair formed by the relevant question text and answer text.

The positive question-answering sample can be obtained by manual labeling collection, and can also be obtained by screening in a community question-answering platform according to posterior data. For example, the posterior data may include at least one of authority information of the user of the answer text, an answer that the user asks for a pursuit, whether the answer text is the original text, and interactive statistical information of the answer text. The interactive statistical information of the answer text may include statistical information of at least one of comment data, like data, comment data, and the like. And generating a question-answer pair by using the reliable answer text and the affiliated question text according to the posterior data as a positive question-answer sample.

S404, training a question-answer correlation detection model by adopting the generated negative question-answer sample and the positive question-answer sample together.

The question-answer correlation detection model is used for detecting the correlation between a question and an answer. The question-answer correlation detection model may include semantic and syntactic parsing layers and a convolutional neural network, where the convolutional neural network may be replaced with a recurrent neural network or other neural network.

Optionally, the training question-answer correlation detection model includes: analyzing a question-answer sample to form semantic information and grammatical information, wherein the grammatical information comprises part-of-speech information and/or dependency relationship information, and the question-answer sample comprises the positive question-answer sample and the negative question-answer sample; and training a question-answer correlation detection model according to the semantic information and the grammatical information.

As shown in fig. 6, the question answering correlation detection model may include a parsing layer (parse), a coding layer, an embedding layer, a convolutional layer (which may include a plurality of convolution operations), a pooling layer, a splicing layer, a neural network (full-layers), and the like, wherein the neural network may adopt the aforementioned convolutional neural network or cyclic neural network, and the like. And analyzing the question-answer sample through an analysis layer to obtain semantic information and grammatical information, wherein the grammatical information comprises part-of-speech information and/or dependency relationship information. And coding, embedding, convolving and pooling the three information respectively to correspondingly obtain three vectors, splicing to form a vector, and processing the vector through a full connection layer to obtain a correlation detection result.

Accordingly, as shown in fig. 7, the question-answer pairs are analyzed into semantic information, part-of-speech information, and dependency relationship information, and are compressed. The information compression is as shown in fig. 6, and the information is mapped into vectors through operations such as an analysis layer, a coding layer, an embedding layer, a convolution layer and a pooling layer, three vectors obtained by compressing three paths of information are spliced to obtain one vector, and a correlation detection result is obtained through detection by a neural network. The target question text and the target answer text are respectively analyzed, such as a part-of-speech information (verbs, nouns, adjectives and the like) list, a dependency relationship (a guest-moving relationship, a predicate relationship, an association structure and the like) list and a semantic information list are analyzed. The information compression can adopt a convolutional neural network. The semantic information list, the part-of-speech information list, and the dependency information list are encoded separately, and one-bit efficient encoding (one-hot) may be used. And performing embedding operation on the encoding result, namely compressing the sparse matrix formed by one-hot into a dense matrix, namely performing data dimension reduction. The convolution and pooling may be compression into semantic vectors, part-of-speech vectors and dependency vectors through a plurality of one-dimensional convolutions and one-dimensional pooling. And splicing the semantic vector, the part-of-speech vector and the dependency relationship vector after information compression to form a new vector. And then, the spliced vectors pass through a plurality of full-connection layers, and finally 0 and 1 classification is obtained, namely, a correlation detection result whether the vectors are correlated or not is finally obtained, wherein 0 can mean uncorrelated and 1 can mean correlated.

By combining the grammatical features and the semantic features to detect the relevance of the question and answer, the feature information can be enriched, and the accuracy rate of relevance detection is improved.

According to the technical scheme, the positive question-answer samples are obtained, the generated negative question-answer sample training model with high representativeness and high generation efficiency is combined, the training efficiency of the question-answer correlation detection model can be improved, the labor cost of model training is reduced, and the question-answer correlation detection accuracy is improved.

Fig. 8 is a block diagram of a question-answer sample generation device according to an embodiment of the present application, which is applied to a case where an image sample for object detection of a truncated object is generated. The device is realized by software and/or hardware and is specifically configured in electronic equipment with certain data operation capacity.

A question-answer sample generation apparatus 500 shown in fig. 8 includes: a question-answering source data acquisition module 501 and a target answer text screening module 502; wherein the content of the first and second substances,

a question-answer source data obtaining module 501, configured to obtain a target question text and an auxiliary answer text set;

a target answer text screening module 502, configured to select a target answer text for the target question text according to a similarity between the target question text and at least two auxiliary answer texts in the auxiliary answer text set, so as to obtain a negative question-answer sample including the target question text and the target answer text.

Further, the question-answer sample generation device further includes: and the auxiliary answer text updating module is used for acquiring a new auxiliary answer text from an auxiliary unlabeled answer set under the condition that a target answer text is selected for the target question text, and replacing the selected target answer text with the new auxiliary answer text to update the auxiliary answer text set so as to generate a new question-answer sample.

Further, the auxiliary non-labeled answer set is obtained from the original non-labeled answer set.

Further, the target answer text filtering module 502 includes: a similarity value obtaining unit, configured to calculate similarity values between the target question text and at least two auxiliary answer texts in the auxiliary answer text set respectively; wherein the similarity value comprises a literal similarity value and/or a grammatical structure similarity value; and the target answer text determining unit is used for selecting a target answer text for the target question text according to the similarity value between the target question text and each auxiliary answer text.

Further, the similarity value obtaining unit includes: a similarity interval dividing subunit, configured to divide similarity values between the target question text and each of the auxiliary answer texts into at least two similarity intervals; the number of the similarity intervals is the same as the number of target answer texts to be selected; and the target answer text partition obtaining subunit is used for selecting at least two target answer texts for the target question text according to the at least two similarity intervals.

Further, the question-answer sample generation device further includes: a positive question-answer sample obtaining module, configured to obtain a positive question-answer sample when a negative question-answer sample including the target question text and the target answer text is obtained; and the model training module is used for training the question-answer correlation detection model by adopting the generated negative question-answer sample and the positive question-answer sample together.

Further, the model training module includes: the question-answer sample analyzing unit is used for analyzing question-answer samples to form semantic information and grammar information, wherein the grammar information comprises part-of-speech information and/or dependency relationship information, and the question-answer samples comprise the positive question-answer samples and the negative question-answer samples; and the semantic grammar fusion training unit is used for training a question-answer correlation detection model according to the semantic information and the grammar information.

The target detection device can execute the question and answer sample generation method provided by any embodiment of the application, and has the corresponding functional modules and beneficial effects of executing the question and answer sample generation method.

There is also provided, in accordance with an embodiment of the present application, an electronic device, a readable storage medium, and a computer program product.

FIG. 9 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 9, the apparatus 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 601 performs the respective methods and processes described above, such as the question and answer sample generation method. For example, in some embodiments, the question-answer sample generation method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When a computer program is loaded into RAM 603 and executed by computing unit 601, one or more steps of the question-and-answer sample generation method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the question-answer sample generation method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present application may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), blockchain networks, and the internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A question-answer sample generation method comprises the following steps:

acquiring a target question text and an auxiliary answer text set;

2. The method of claim 1, in a case where the target question text selects a target answer text, further comprising:

and acquiring a new auxiliary answer text from the auxiliary non-labeled answer set, and replacing the selected target answer text with the new auxiliary answer text to update the auxiliary answer text set for generating a new question-answer sample.

3. The method of claim 1, wherein the secondary unlabeled answer set is obtained from an original unlabeled answer set.

4. The method of claim 1, wherein the selecting a target answer text for the target question text according to a similarity between the target question text and at least two auxiliary answer texts in the set of auxiliary answer texts comprises:

respectively calculating similarity values between the target question text and at least two auxiliary answer texts in the auxiliary answer text set; wherein the similarity value comprises a literal similarity value and/or a grammatical structure similarity value;

and selecting a target answer text for the target question text according to the similarity value between the target question text and each auxiliary answer text.

5. The method according to claim 4, wherein selecting a target answer text for the target question text according to the similarity value between the target question text and each of the auxiliary answer texts comprises:

dividing the similarity value between the target question text and each auxiliary answer text into at least two similarity intervals; the number of the similarity intervals is the same as the number of target answer texts to be selected;

and selecting at least two target answer texts for the target question texts according to the at least two similarity intervals.

6. The method according to claim 1, in case of obtaining a negative question-and-answer sample including the target question text and the target answer text, further comprising:

obtaining a question and answer sample;

and training a question-answer correlation detection model by adopting the generated negative question-answer sample and the positive question-answer sample together.

7. The method of claim 6, wherein the training of the question-answer correlation detection model comprises:

analyzing a question-answer sample to form semantic information and grammatical information, wherein the grammatical information comprises part-of-speech information and/or dependency relationship information, and the question-answer sample comprises the positive question-answer sample and the negative question-answer sample;

and training a question-answer correlation detection model according to the semantic information and the grammatical information.

8. A question-answer sample generation apparatus comprising:

9. The apparatus of claim 8, further comprising:

and the auxiliary answer text updating module is used for acquiring a new auxiliary answer text from an auxiliary unlabeled answer set under the condition that a target answer text is selected for the target question text, and replacing the selected target answer text with the new auxiliary answer text to update the auxiliary answer text set so as to generate a new question-answer sample.

10. The apparatus according to claim 8, wherein the secondary unlabeled answer set is obtained from an original unlabeled answer set.

11. The apparatus of claim 8, wherein the target answer text filtering module comprises:

a similarity value obtaining unit, configured to calculate similarity values between the target question text and at least two auxiliary answer texts in the auxiliary answer text set respectively; wherein the similarity value comprises a literal similarity value and/or a grammatical structure similarity value;

and the target answer text determining unit is used for selecting a target answer text for the target question text according to the similarity value between the target question text and each auxiliary answer text.

12. The apparatus according to claim 11, wherein the similarity value obtaining unit includes:

a similarity interval dividing subunit, configured to divide similarity values between the target question text and each of the auxiliary answer texts into at least two similarity intervals; the number of the similarity intervals is the same as the number of target answer texts to be selected;

and the target answer text partition obtaining subunit is used for selecting at least two target answer texts for the target question text according to the at least two similarity intervals.

13. The apparatus of claim 8, further comprising:

a positive question-answer sample obtaining module, configured to obtain a positive question-answer sample when a negative question-answer sample including the target question text and the target answer text is obtained;

and the model training module is used for training the question-answer correlation detection model by adopting the generated negative question-answer sample and the positive question-answer sample together.

14. The apparatus of claim 13, wherein the model training module comprises:

the question-answer sample analyzing unit is used for analyzing question-answer samples to form semantic information and grammar information, wherein the grammar information comprises part-of-speech information and/or dependency relationship information, and the question-answer samples comprise the positive question-answer samples and the negative question-answer samples;

and the semantic grammar fusion training unit is used for training a question-answer correlation detection model according to the semantic information and the grammar information.

15. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the question-answer sample generation method of any one of claims 1-7.

16. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the question and answer sample generation method according to any one of claims 1 to 7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the question-answer sample generation method according to any one of claims 1 to 7.