CN114861625A - Method for obtaining target training sample, electronic device and medium - Google Patents

Method for obtaining target training sample, electronic device and medium Download PDF

Info

Publication number
CN114861625A
CN114861625A CN202210586079.5A CN202210586079A CN114861625A CN 114861625 A CN114861625 A CN 114861625A CN 202210586079 A CN202210586079 A CN 202210586079A CN 114861625 A CN114861625 A CN 114861625A
Authority
CN
China
Prior art keywords
sentence
statement
original
sentences
initial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210586079.5A
Other languages
Chinese (zh)
Inventor
韩佳
杜新凯
吕超
谷姗姗
张晗
史辉
李文灏
孙垚锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sunshine Insurance Group Co Ltd
Original Assignee
Sunshine Insurance Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sunshine Insurance Group Co Ltd filed Critical Sunshine Insurance Group Co Ltd
Priority to CN202210586079.5A priority Critical patent/CN114861625A/en
Publication of CN114861625A publication Critical patent/CN114861625A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a method, electronic equipment and a medium for obtaining a target training sample, wherein the method comprises the following steps: acquiring a plurality of original sentence sets, wherein each original sentence set in the plurality of original sentence sets comprises an original sentence, a first sentence set and a second sentence set, the similarity between the first sentence set and the original sentence is greater than a threshold value, and the similarity between the second sentence set and the original sentence is less than the threshold value; calculating a difference value between every two sentences in each original sentence set, wherein the difference value is used for representing the difference degree between the two sentences; determining whether to delete at least part of the plurality of original sentence sets based on the difference value, and obtaining a target training sample. Target training samples that are more valuable for model training can be taken from multiple original sentence sets by some embodiments of the present application.

Description

Method for obtaining target training sample, electronic device and medium
Technical Field
The embodiment of the application relates to the field of sample data selection, in particular to a method, electronic equipment and medium for obtaining a target training sample.
Background
In the field of semantic recognition, both classification tasks and ranking tasks need to complete training of a neural network model by means of sample data (or training data).
Sample data in the prior art is generally divided into positive samples and negative samples, wherein inappropriate negative samples (such as simpler negative samples) are selected to train a model, so that the model cannot accurately distinguish the sample data. For example, negative examples are typically used to generate difficult negative examples, and the neural network model is trained using the difficult negative examples. However, the difficult negative samples generated by the prior art do not conform to the conventional expression habit, so that the accuracy of the neural network model obtained after the neural network model is trained by using the difficult negative samples is not high.
Therefore, how to obtain the task execution result accuracy of the neural network model obtained by training by improving the quality of the sample data becomes a problem to be solved.
Disclosure of Invention
Embodiments of the present application provide a method, an electronic device, and a medium for obtaining a target training sample, and by calculating a difference value between every two sentences according to some embodiments of the present application, a target training sample that is more valuable for model training can be obtained from a plurality of original sentence sets.
In a first aspect, the present application provides a method of obtaining a target training sample, the method comprising: acquiring a plurality of original sentence sets, wherein each original sentence set in the original sentence sets comprises an original sentence, a first sentence set and a second sentence set, the similarity between the first sentence set and the original sentence is greater than a threshold value, and the similarity between the second sentence set and the original sentence is less than the threshold value; calculating a difference value between every two sentences in each original sentence set, wherein the difference value is used for representing the difference degree between every two sentences, one sentence in every two sentences is the original sentence, the other sentence is any sentence in the first sentence set, one sentence in every two sentences is the original sentence, the other sentence is any sentence in the second sentence set, or one sentence in every two sentences is any sentence in the first sentence set, and the other sentence is any sentence in the second sentence set; determining whether to delete at least part of the plurality of original sentence sets based on the difference value, and obtaining a target training sample.
Therefore, different from a method for directly generating a difficult negative sample in the related art, the method for generating the difficult negative sample in the embodiment of the application screens a target training sample set from a plurality of original sentence sets by calculating the difference degree between sentences, can obtain training data which is more valuable for model training, and can improve the semantic recognition capability of a model.
With reference to the first aspect, in an embodiment of the present application, the calculating a difference value between every two sentences in each original sentence set includes: calculating a first editing distance between the original statement and each statement in the second statement set to obtain a plurality of first difference values, wherein the first editing distance is determined by the rewriting times of rewriting each statement in the second statement set into the original statement; and calculating a second editing distance between each statement in the first statement set and each statement in the second statement set to obtain a plurality of second difference values, wherein the second editing distance is determined by the rewriting times of each statement in the second statement set into each statement in the first statement set.
Therefore, the embodiment of the application can quantify the difference between the original sentence and the second sentence set and the edit distance between the first sentence set and the second sentence set by calculating the edit distance, so as to screen out the target training data which is valuable to the model according to the difference.
With reference to the first aspect, in one implementation manner of the present application, the first sentence set includes an ith sentence, the second sentence set includes a jth sentence, and i and j are integers greater than or equal to 1; determining whether to delete at least part of the original sentence sets based on the difference value to obtain a target training sample, wherein the step of obtaining the target training sample comprises the step of selecting a smaller value between a first difference value corresponding to the ith sentence and a second difference value corresponding to the jth sentence; determining whether to delete at least a partial set of the plurality of original sentence sets based on the smaller value, obtaining a target training sample.
With reference to the first aspect, in an embodiment of the present application, the calculating a difference value between every two sentences in each original sentence set includes: calculating a third editing distance between the original sentence and each sentence in the first sentence set to obtain a plurality of third difference values, wherein the third editing distance is determined by the rewriting times of rewriting each sentence in the first sentence set into the original sentence; said determining whether to delete at least a partial set of the plurality of original sets of sentences based on the smaller value comprises: if the third difference value corresponding to the ith statement is larger than or equal to the smaller value, the ith statement, the jth statement and the original statement are reserved; and if the third difference value corresponding to the ith statement is smaller than the smaller value, deleting the ith statement and the jth statement.
Therefore, by comparing the third difference value with a smaller value, the embodiment of the application can accurately find the sentence pair with a smaller difference value, so as to obtain a valuable difficult negative sample.
With reference to the first aspect, in an embodiment of the present application, before the obtaining of the plurality of original sentence sets, the method further includes: acquiring an initial statement set, wherein the initial statement set comprises a plurality of initial statements and statements to be screened, and the plurality of initial statements are a plurality of standard question statements corresponding to one answer; and classifying the initial sentence sets according to the similarity between the sentences to be screened and the initial sentences to obtain a plurality of original sentence sets.
Therefore, by classifying the initial statement set, the embodiment of the application can divide the initial statement set into a first statement set (i.e., a statement set similar to the original statement) and a second statement set (i.e., a statement set dissimilar to the original statement), so that more valuable sample data can be selected from the first statement set and the second statement set.
With reference to the first aspect, in an embodiment of the present application, after the obtaining the initial sentence set, the method further includes: screening a plurality of recall sentences of which the similarity with a problem input by a user is higher than a similarity threshold value from the initial sentence set; classifying the initial sentence set according to the similarity between the sentence to be filtered and the initial sentences to obtain the original sentence sets, including: and classifying the plurality of recall sentences according to the similarity between each sentence in the plurality of recall sentences and the question input by the user to obtain the plurality of original sentence sets.
Therefore, the initial statement set is classified in a mode of recalling the statements to be screened, and the initial statement set can be divided into the first statement set and the second statement set.
With reference to the first aspect, in an embodiment of the present application, after the determining whether to delete at least a partial set of the plurality of original sentence sets based on the difference value to obtain a target training sample, the method further includes: and inputting the target training sample into an initial query statement model for retraining to obtain a target query statement model.
Therefore, according to the embodiment of the application, the initial query sentence model is retrained by using the target training sample, so that the semantic recognition capability of the initial query sentence can be improved, and the target sentence which is more matched with the problem input by the user is obtained.
With reference to the first aspect, in an embodiment of the present application, after the determining whether to delete at least a partial set of the plurality of original sentence sets based on the difference value to obtain a target training sample, the method further includes: and training the neural network model to be trained based on the target training sample to obtain the target neural network model.
In a second aspect, the present application provides an apparatus for obtaining a target training sample, the apparatus comprising: the system comprises an acquisition module, a storage module and a display module, wherein the acquisition module is configured to acquire a plurality of original sentence sets, each original sentence set in the original sentence sets comprises an original sentence, a first sentence set and a second sentence set, the similarity between the first sentence set and the original sentence is greater than a threshold, and the similarity between the second sentence set and the original sentence is less than a threshold; a calculating module configured to calculate a difference value between two sentences in each original sentence set, where the difference value is used to represent a difference degree between the two sentences, one sentence in each of the two sentences is the original sentence, and the other sentence is any sentence in the first sentence set, one sentence in each of the two sentences is the original sentence, and the other sentence is any sentence in the second sentence set, or one sentence in each of the two sentences is any sentence in the first sentence set, and the other sentence is any sentence in the second sentence set; determining whether to delete at least part of the original sentence sets based on the difference value to obtain a target training sample; a screening module configured to determine whether to delete at least a partial set of the plurality of original sentence sets based on the difference value, to obtain a target training sample.
With reference to the second aspect, in one embodiment of the application, the computing module is further configured to: calculating a first editing distance between the original statement and each statement in the second statement set to obtain a plurality of first difference values, wherein the first editing distance is determined by the rewriting times of rewriting each statement in the second statement set into the original statement; and calculating a second editing distance between each statement in the first statement set and each statement in the second statement set to obtain a plurality of second difference values, wherein the second editing distance is determined by the rewriting times of each statement in the second statement set into each statement in the first statement set.
With reference to the second aspect, in one embodiment of the present application, the first sentence set includes an ith sentence, the second sentence set includes a jth sentence, i and j are integers greater than or equal to 1; the screening module is further configured to: selecting a smaller value between a first difference value corresponding to the ith statement and a second difference value corresponding to the jth statement; determining whether to delete at least a partial set of the plurality of original sentence sets based on the smaller value, obtaining a target training sample.
With reference to the second aspect, in one embodiment of the present application, the computing module is further configured to: calculating a third editing distance between the original sentence and each sentence in the first sentence set to obtain a plurality of third difference values, wherein the third editing distance is determined by the rewriting times of rewriting each sentence in the first sentence set into the original sentence; the screening module is further configured to: if the third difference value corresponding to the ith statement is larger than or equal to the smaller value, the ith statement, the jth statement and the original statement are reserved; and if the third difference value corresponding to the ith statement is smaller than the smaller value, deleting the ith statement and the jth statement.
With reference to the second aspect, in an embodiment of the present application, the obtaining module is further configured to: acquiring an initial statement set, wherein the initial statement set comprises a plurality of initial statements and statements to be screened, and the plurality of initial statements are a plurality of standard question statements corresponding to one answer; and classifying the initial sentence sets according to the similarity between the sentences to be screened and the initial sentences to obtain a plurality of original sentence sets.
With reference to the second aspect, in one embodiment of the present application, the obtaining module is configured to: screening a plurality of recall sentences of which the similarity with a problem input by a user is higher than a similarity threshold value from the initial sentence set; and classifying the plurality of recall sentences according to the similarity between each sentence in the plurality of recall sentences and the question input by the user to obtain the plurality of original sentence sets.
With reference to the second aspect, in one embodiment of the present application, the screening module is further configured to: and inputting the target training sample into an initial query statement model for retraining to obtain a target query statement model.
In an embodiment of the application, in combination with the second aspect, the screening module 330 is further configured to: and training the neural network model to be trained based on the target training sample to obtain the target neural network model.
In a third aspect, the present application provides an electronic device, comprising: a processor, a memory, and a bus; the processor is connected to the memory via the bus, and the memory stores computer readable instructions for implementing the method according to any of the embodiments of the first aspect when the computer readable instructions are executed by the processor.
In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed, implements a method as in any of the embodiments of the first aspect.
Drawings
Fig. 1 is an application scenario diagram of a target training sample according to an embodiment of the present application;
FIG. 2 is a flow chart of a method for obtaining a target training sample according to an embodiment of the present disclosure;
FIG. 3 is a diagram of an apparatus for obtaining a target training sample according to an embodiment of the present application;
fig. 4 is a block diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as presented in the figures, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.
The following is an explanation of some nouns that may appear in the examples of the present application:
positive sample: samples that are consistent with the authentic sample tags.
Negative sample: samples that are inconsistent with the authentic sample label.
Simple samples: when the sample with small error with the real label is predicted, it can be understood that the model is very easy to make correct judgment on the simple sample.
Difficult samples: when a sample with a large error with a real label is predicted, it can be understood that the model is difficult to make a correct judgment on a difficult sample, and a large loss is often brought to the model during training. The difficult examples include difficult positive examples and difficult negative examples, i.e., for the difficult examples, the model has difficulty in distinguishing whether the difficult positive examples are consistent with the true example labels, and whether the difficult negative examples are inconsistent with the true example labels.
In the related art, a positive sample is basically determined. But negative examples have a very large choice space. For example, in the classification task, sentences with similar semantics are positive samples, and sentences with dissimilar semantics are negative samples. When an oversimplified negative sample is selected for training, the model can easily classify input sample data, so that the model does not learn accurate semantic representation capability. In practical situations, the error is often samples which look similar but have dissimilar semantics, so that when training the model, it is obvious that these simple negative samples cannot be used in their entirety, and therefore it becomes very important to select training samples which are valuable to the model (for example, to select valuable difficult negative samples).
The embodiments of the present application may be applied to a scenario in which a target training sample is obtained before a query sentence model is trained (it can be understood that the query sentence model is merely an example, and the present application does not limit a function corresponding to a model trained using the target training sample). For example: in some embodiments of the present application, a difference value between every two sentences in each of a plurality of original sentence sets is first calculated, and then it is determined whether to delete at least a part of the plurality of original sentence sets according to the difference value to obtain a target training sample, so that a target training sample that is more valuable for model training can be obtained from the plurality of original sentence sets.
The method steps in the embodiments of the present application are described in detail below with reference to the accompanying drawings.
FIG. 1 provides a scenario diagram of how target training samples obtained by some embodiments of the present application may be utilized, the scenario including target training samples 110, initial query statement models 120, and target query statement models 130 obtained by some embodiments of the present application. Specifically, after the target training sample is obtained, the initial query statement model 120 is trained by using the target training sample, and after the training is completed, the target query statement model 130 is obtained.
It can be understood that the application scenario of the target query statement model is as follows: the method comprises the steps of obtaining a problem to be inquired input by a user, then inputting the problem to be inquired into a target inquiry statement model, and matching and outputting a corresponding answer according to the problem to be inquired by the target inquiry statement model.
It should be noted that the initial query statement model 120 is obtained after the initial training of the query statement model to be trained, and has the basic semantic recognition capability and the function of querying the target statement according to the problem, but the model precision has not yet reached a higher requirement. The target query statement model 130 is obtained by further optimizing the parameters in the initial query statement model 120 (i.e., retraining the initial query statement model 120 through the target training sample), has a perfect semantic recognition function, and can achieve a higher accuracy.
Different from the embodiment of the present application, in the related art, a difficult negative sample is generally generated by using a negative sample, and the query statement model to be trained is trained by using the difficult negative sample. However, the accuracy of the obtained target query statement model is not high because the difficult negative examples generated by the related art do not conform to the conventional expression habit. In the embodiment of the application, a plurality of original sentence sets are screened by calculating the difference degree between sentences to obtain the target training sample (it can be understood that the target training sample comprises a difficult negative sample), so that the application can obtain the target training sample which is more valuable for model training.
The following exemplifies a method for obtaining a target training sample provided by the embodiment of the present application. It is understood that the method for obtaining the target training sample in the embodiment of the present application may be applied to any electronic device, for example, a client or a server.
At least to address the problems in the background art, as shown in fig. 2, some embodiments of the present application provide a method for obtaining a target training sample, the method including:
s210, a plurality of original sentence sets are obtained.
It should be noted that, before S210, data processing needs to be performed on the initial sentence set to obtain a plurality of original sentence sets.
For example, according to the questions input by the user in the historical time, the data set of the common questions is collated to obtain the initial sentence set. The initial sentence set comprises a standard question corresponding to the answer and an extended question corresponding to the standard question, and it can be understood that the extended question is obtained when the expression of the standard question is extended, and the standard question is similar to the semantic meaning of the extended question corresponding to the standard question. And classifying the initial sentence set according to the similarity between the sentences in the initial sentence set. The classification of the initial sentence set includes two embodiments:
in an embodiment of the present application, first, an initial sentence set is obtained, and then, the initial sentence set is classified according to similarities between a sentence to be filtered and a plurality of initial sentences, so as to obtain a plurality of original sentence sets.
It is understood that the initial sentence set includes a plurality of initial sentences and sentences to be filtered, the plurality of initial sentences are a plurality of standard question sentences corresponding to the answers, and the sentences to be filtered are all question sentences except the standard question in the initial sentence set, that is, the sentences to be filtered include sentences similar to the initial sentences and sentences dissimilar to the initial sentences.
Specifically, in the first step, the initial sentences (i.e., standard questions) and the corresponding sentences to be screened (i.e., similar questions) with higher similarity are placed in a cluster, that is, the initial sentence set is divided into multiple clusters, and a cluster includes one initial sentence and at least one sentence to be screened with higher similarity to the initial sentence. Marking a first symbol on a statement pair formed by 1 standard problem and any similar problem in the same cluster; the sentence pairs formed by 1 standard question and any similar question between different clusters are marked with a second symbol. And step three, acquiring at least one similar sentence (namely a first sentence set) and at least one dissimilar sentence (namely a second sentence set) corresponding to any one sentence (namely the original sentence) in the sentence pair according to the marked symbols to form an original sentence set.
As a specific example of step two above, similar questions to which standard question a corresponds (similar question a1 and similar question a2) are grouped in a first cluster, and similar questions to which standard question B corresponds (similar question B1 and similar question B2) are grouped in a second cluster. Then, the sentence pair formed by standard question a and similar question a1 is labeled 1, the sentence pair formed by standard question a and similar question a2 is labeled 1, the sentence pair formed by standard question B and similar question B1 is labeled 1, and the sentence pair formed by standard question B and similar question B2 is labeled 1. The sentence pair formed by standard question a and similar question B1 is labeled 0, the sentence pair formed by standard question a and similar question B2 is labeled 0, the sentence pair formed by standard question B and similar question a1 is labeled 0, and the sentence pair formed by standard question B and similar question a2 is labeled 0.
As a specific example of the third step, the 3 sentence pairs formed in the second step include a first sentence pair (including the sentence a1 and the sentence B1, labeled as a second symbol), a second sentence pair (including the sentence a2 and the sentence B2, labeled as a second symbol), and a third sentence pair (including the sentence A3 and the sentence B3, labeled as a second symbol). Similar sentences (namely marked as first symbols) and dissimilar sentences (namely marked as second symbols) corresponding to the sentences A1, B1, A2, B2, A3 and B3 are extracted respectively to form an original sentence set. It is understood that one of the original sentences included in the original sentence set is the sentence a1, the sentence B1, the sentence a2, the sentence B2, the sentence A3, or the sentence B3 described above.
It is understood that the first symbol indicates that the similarity of two sentences included in one sentence pair is greater than the threshold value, and the second symbol indicates that the similarity of two sentences included in one sentence pair is less than the threshold value. The first symbol may be 1 or a, and the second symbol may be 0 or b. The present application does not limit the representation of the first symbol and the second symbol.
Therefore, by classifying the initial statement set, the embodiment of the application can divide the initial statement set into a first statement set (i.e., a statement set with a higher similarity to the original statement) and a second statement set (i.e., a statement set with a lower similarity to the original statement), so that more valuable sample data can be selected from the first statement set and the second statement set.
In another embodiment of the present application, first, after an initial sentence set is obtained, a plurality of recall sentences having a similarity higher than a similarity threshold with respect to a question input by a user are screened from the initial sentence set. Then, the plurality of recall sentences are classified according to the similarity between each sentence in the plurality of recall sentences and the question input by the user, and a plurality of original sentence sets are obtained.
For example, in step one, a question input by a user is obtained, and a plurality of recall sentences with similarity higher than a similarity threshold with the question input by the user are recalled from the initial sentence set by using a method of Best matching (Best Match 25, BM 25). Setting a first similarity threshold, wherein the first similarity threshold is larger than the similarity threshold, selecting a recall sentence with the similarity larger than the first similarity threshold to obtain a first recall sentence set, and marking a first symbol on a formed sentence pair (namely, one sentence pair comprises a question input by a user and any recall sentence) by using the question input by the user and any recall sentence in the first recall sentence set; and selecting a recall sentence with the similarity smaller than or equal to the first similarity threshold value, obtaining a second recall sentence set, marking a second symbol on a sentence pair formed by the question input by the user and any recall sentence in the second recall sentence set. And step three, acquiring at least one similar sentence (namely a first sentence set) and at least one dissimilar sentence (namely a second sentence set) corresponding to any sentence (namely the original sentence) in the sentence pair according to the marked symbols to form an original sentence set.
For example, the plurality of recall sentences having a similarity higher than the threshold with the question a input by the user include: recall statement a1, recall statement a2, and recall statement A3. Recall sentences having a similarity to the user-entered question a above a first similarity threshold include: recall sentences a2 and A3, recall sentences having a similarity to the user-entered question a below a first similarity threshold comprising: statement a1 is recalled. The sentence pair consisting of the question A and the recall sentence A2 input by the user is marked as 1, the sentence pair consisting of the question A and the recall sentence A3 input by the user is marked as 1, and the sentence pair consisting of the question A and the recall sentence A1 input by the user is marked as 0.
In another embodiment of the present application, the question input by the user is obtained, and a plurality of recall sentences having a similarity higher than a similarity threshold with the question input by the user are recalled from the initial sentence set by using a Best Match (Best Match 25, BM25), that is, sentences corresponding to all questions input by the user, such as sentence a, sentence B, sentence C and sentence E, are obtained. If the recall sentence corresponding to the current question is sentence B, sentence B is marked as 1, and sentences a, C and E are marked as 0.
It is understood that BM25 is an unsupervised model for evaluating similarity in the ElasticSearch search engine, i.e., a similarity scoring mechanism.
The question of the user input may be a question of the user history input described in the system.
As a specific embodiment of the present application, the classification of the initial sentence set is shown in table 1 below, where a sentence marked as 1 indicates that two sentences are similar, and a sentence marked as 0 indicates that two sentences are not similar.
TABLE 1 initial statement set Classification Table
Figure BDA0003663475330000121
Figure BDA0003663475330000131
After the initial sentence set is classified, the question A input by the user and at least one similar sentence (namely, a first sentence set) and at least one dissimilar sentence (namely, a second sentence set) corresponding to a plurality of recall sentences B form an original sentence set, and the plurality of original sentence sets are subjected to a deduplication operation, so that the question A input by the user or the recall sentence B corresponds to the original sentence set. And then, preprocessing a plurality of original sentence sets after the duplication removal, including removing invalid characters such as line feed and blank space, normalizing common punctuations, performing reading conversion on contents with Chinese readings such as numbers, percentiles, addition, subtraction, multiplication, division and the like.
It is understood that, in the present embodiment, the original sentence is the question a input by the user or the arbitrary recall sentence B.
Therefore, the initial sentence set is classified in a mode of recalling the sentences to be screened, and the initial sentence set can be accurately divided into the first sentence set and the second sentence set.
It can be understood that each original sentence set in the plurality of original sentence sets comprises an original sentence, a first sentence set and a second sentence set, the similarity between each sentence in the first sentence set and the original sentence is greater than a threshold value, and the similarity between each sentence in the second sentence set and the original sentence is less than the threshold value.
For example, one of the original sentences included in the original sentence set is "how to mail insurance bought on the internet to a paper policy", the first sentence set includes "how to mail insurance policy bought in your company to me" and "whether to mail paper policy to insurance on the internet", and the second sentence set includes "how to send insurance policy to me, whether to mail or express" as does insurance bought on the internet and insurance bought in a store ".
S220, calculating the difference value between every two sentences in each original sentence set.
It can be understood that the difference value is used to represent the degree of difference between two sentences, one of the two sentences is an original sentence and the other sentence is any sentence in the first sentence set, one of the two sentences is an original sentence and the other sentence is any sentence in the second sentence set, or one of the two sentences is any sentence in the first sentence set and the other sentence is any sentence in the second sentence set.
The difference value between every two sentences is expressed through the editing distance. It is understood that edit distance is a quantitative measure of the degree of difference between two statements. The Edit Distance is also called Levenshtein Distance (Levenshtein Distance is also called Edit Distance), and refers to the minimum number of Edit operations required for converting one statement into another statement, and if the Distance is larger, the Edit operations are more different. Permitted editing operations include replacing one character with another, inserting one character, and deleting one character. For example: the editing distance between the sentence a that the user forgets the original password and the sentence b that the user forgets the password is 5.
In one embodiment of the present application, S220 includes: calculating a first editing distance between the original sentence and each sentence in the second sentence set to obtain a plurality of first difference values; and calculating a second editing distance between each statement in the first statement set and each statement in the second statement set to obtain a plurality of second difference values.
It is understood that the second edit distance is determined by the number of rewrites of each sentence in the second sentence set to each sentence in the first sentence set. The first edit distance is determined by the number of rewrites of each sentence in the second sentence set to the original sentence.
For example, the first statement set includes statement a1 and statement a2, the second statement set includes statement B1 and statement B2, and the original statement is statement C. A first difference value between statement C and statement B1, respectively, a first difference value between statement C and statement B2, a second difference value between statement a1 and statement B1, respectively, a second difference value between statement a1 and statement B2, a second difference value between statement a2 and statement B1, respectively, and a second difference value between statement a2 and statement B2, respectively, are calculated.
The specific edit distance calculation process includes: first, an initialization matrix is constructed. An edit distance between a sentence marker of one sentence and the other sentence in pairs of sentences (e.g., a sentence and b sentence) is calculated. It will be appreciated that a statement mark is a symbol placed between statements, e.g., the statement mark is "#".
Then, starting from the first word of the sentence a, the words in the sentence b are compared in sequence to obtain the corresponding minimum number of conversion steps. Specifically, if two words are the same (for example, "i" in the statement a is the same as "i" in the statement b), the minimum value is obtained from the values corresponding to the three positions on the left, right above and left above the position and stored; if not (e.g., my in statement a is different from the "hold" in statement b), then the smallest value is obtained from the three positions to the left of this position, right above, and left above, plus 1 is stored.
As a specific example of the present application, as shown in table 2:
table 2 edit distance calculation table
Figure BDA0003663475330000151
As shown in table 2 above, each pair of sentences includes the original sentence S1 and S2 in the first sentence set, S1 indicates that "i forgot the original password", and S2 indicates that "forgot the password.
Firstly, comparing the first word 'I' in S1 with the first word 'forget' in S2, finding that the two words are different, so comparing the left position, the right upper position and the left upper position to obtain a minimum value of 0, then adding 1 to the minimum value to obtain an editing distance of 1, and storing the editing distance in the position of T [1] [1 ].
Then, the ' me ' and the ' memory ', ' me ' and ' secret ', ' me ' and ' code ', ' me ' and ' za ', and ' me ' and ' office ' are compared in sequence, and then the editing distances of the 6 character strings of the ' me ' and the ' forgetting ', ' forgetting secret, and ' forgetting secret-office ' are obtained.
And then, sequentially adding the words of S1 into the substrings to obtain the edit distance between the substring and the 6 substrings of S2. For example, the editing distances of the six substrings "i forget" the substring and "forget", "forget the password, and" forget the password "are obtained.
Finally, the value of the edit distance at the position T [7] [6] is obtained, and as shown in table 2, T [7] [6] is 5, taking the value of this position as the edit distance value between S1 and S2.
Therefore, the embodiment of the application can quantify the difference between the original sentence and the second sentence set and between the first sentence set and the second sentence set by calculating the edit distance between the original sentence and the second sentence set, and therefore screening out the target training data which is valuable to the model according to the difference.
In an embodiment of the application, S220 further includes calculating a third edit distance between the original sentence and each sentence in the first sentence set, and obtaining a plurality of third difference values.
It is understood that the third edit distance is determined by the number of rewrites of each sentence in the first sentence set to the original sentence.
For example, the first sentence set includes sentence a1 and sentence a2, and the original sentence is sentence C. A third difference value between sentence C and sentence a1 and a third difference value between sentence C and sentence a2 are calculated, respectively.
And S230, determining whether to delete at least part of the original sentence sets based on the difference value, and obtaining a target training sample.
In one embodiment of the present application, the first sentence set includes an ith sentence, and the second sentence set includes a jth sentence. Firstly, selecting a smaller value between a first difference value corresponding to the ith sentence and a second difference value corresponding to the jth sentence, and then determining whether to delete at least part of the original sentence sets based on the smaller value to obtain a target training sample.
In one embodiment of the present application, if a third difference value corresponding to an ith statement is greater than or equal to a smaller value, the ith statement, a jth statement, and an original statement are retained; and if the third difference value corresponding to the ith statement is smaller than the smaller value, deleting the ith statement and the jth statement.
That is to say, an original sentence, an ith sentence and a jth sentence form a sentence pair, and a smaller value between a first difference value and a second difference value corresponding to the sentence pair and a third difference value between the original sentence and the ith sentence are taken. And comparing the third difference value with a smaller value, if the third difference value is greater than or equal to the smaller value, keeping the sentence pair, and if the third difference value is less than the smaller value, deleting the sentence pair, namely judging the sentence with the difference value greater than or equal to the smaller value as a difficult negative sample.
It can be understood that, in the embodiment of the present application, whether to delete at least part of the original sentence sets, that is, whether to delete a sentence pair composed of an original sentence, an ith sentence, and a jth sentence.
Therefore, by comparing the third difference value with a smaller value, the embodiment of the application can accurately find the sentence pair with a smaller difference value, so as to obtain a valuable difficult negative sample.
In one embodiment of the present application, after S230, the method further includes: and inputting the target training sample into the initial query statement model for retraining to obtain a target query statement model.
That is, the objective of obtaining the target training samples in the present application is to retrain the model using the difficult negative samples in the target training samples, so as to obtain a model with higher accuracy and stronger semantic recognition capability. It is understood that the target training samples can be applied to many classification tasks, and both difficult samples and simple samples are required in the process of training the model.
As an example, the target training samples obtained by the method in the embodiment of the present application are shown in table 3:
TABLE 3 target training sample example Table
Figure BDA0003663475330000171
Figure BDA0003663475330000181
In table 3, the difficult negative samples are similar to the positive samples in expression, but have different semantics, and the initial query statement model is retrained by using the samples, so that a more accurate target query statement model can be obtained.
It can be understood that retraining the initial query statement model is only an example, and other models may be retrained, or the model to be trained may be trained by directly using the target training sample.
Therefore, according to the embodiment of the application, the initial query sentence model is retrained by using the target training sample, and the semantic recognition capability of the initial query sentence can be improved, so that the model can distinguish sentences with similar expressions but different semantics, and further a target sentence which is more matched with a problem input by a user is obtained.
In one embodiment of the present application, after S230, the method further includes: and training the neural network model to be trained based on the target training sample to obtain the target neural network model.
It is understood that the neural network model to be trained may be a semantic recognition model having a semantic recognition function, a query sentence model having a search sentence function, or the like.
That is to say, as a specific embodiment of the present application, a semantic recognition model to be trained is trained based on a target training sample, so as to obtain a target semantic recognition model. As another specific embodiment of the present application, a query statement model to be trained is trained based on a target training sample, so as to obtain a target query statement model.
A method of obtaining a target training sample in the present application is described above, and an apparatus of obtaining a target training sample in the present application is described below.
As shown in fig. 3, an apparatus 300 for obtaining a target training sample includes: an acquisition module 310, a calculation module 320, and a screening module 330.
An obtaining module 310 configured to obtain a plurality of original sentence sets, wherein each of the original sentence sets includes an original sentence, a first sentence set, and a second sentence set, a similarity between the first sentence set and the original sentence is greater than a threshold, and a similarity between the second sentence set and the original sentence is less than a threshold.
A calculating module 320, configured to calculate a difference value between two sentences in each original sentence set, where the difference value is used to represent a difference degree between the two sentences, one sentence in the two sentences is the original sentence, and the other sentence is any sentence in the first sentence set, one sentence in the two sentences is the original sentence, and the other sentence is any sentence in the second sentence set, or one sentence in the two sentences is any sentence in the first sentence set, and the other sentence is any sentence in the second sentence set.
A screening module 330 configured to determine whether to delete at least a part of the plurality of original sentence sets based on the difference value, to obtain a target training sample.
In one embodiment of the present application, the calculation module 320 is further configured to: calculating a first editing distance between the original statement and each statement in the second statement set to obtain a plurality of first difference values, wherein the first editing distance is determined by the rewriting times of rewriting each statement in the second statement set into the original statement; and calculating a second editing distance between each statement in the first statement set and each statement in the second statement set to obtain a plurality of second difference values, wherein the second editing distance is determined by the rewriting times of each statement in the second statement set into each statement in the first statement set.
In one embodiment of the present application, the first sentence set includes an ith sentence, the second sentence set includes a jth sentence, and i and j are integers greater than or equal to 1; the screening module 330 is further configured to: selecting a smaller value between a first difference value corresponding to the ith statement and a second difference value corresponding to the jth statement; determining whether to delete at least a partial set of the plurality of original sentence sets based on the smaller value, obtaining a target training sample.
In one embodiment of the present application, the calculation module 320 is further configured to: calculating a third editing distance between the original sentence and each sentence in the first sentence set to obtain a plurality of third difference values, wherein the third editing distance is determined by the rewriting times of rewriting each sentence in the first sentence set into the original sentence; the screening module 330 is further configured to: if the third difference value corresponding to the ith statement is larger than or equal to the smaller value, the ith statement, the jth statement and the original statement are reserved; and if the third difference value corresponding to the ith statement is smaller than the smaller value, deleting the ith statement and the jth statement.
In one embodiment of the present application, the obtaining module 310 is further configured to: acquiring an initial statement set, wherein the initial statement set comprises a plurality of initial statements and statements to be screened, and the plurality of initial statements are a plurality of standard question statements corresponding to one answer; and classifying the initial sentence sets according to the similarity between the sentences to be screened and the initial sentences to obtain a plurality of original sentence sets.
In one embodiment of the present application, the obtaining module 310 is configured to: screening a plurality of recall sentences of which the similarity with a problem input by a user is higher than a similarity threshold value from the initial sentence set; and classifying the plurality of recall sentences according to the similarity between each sentence in the plurality of recall sentences and the question input by the user to obtain the plurality of original sentence sets.
In one embodiment of the present application, the screening module 330 is further configured to: and inputting the target training sample into an initial query statement model for retraining to obtain a target query statement model.
In one embodiment of the present application, the screening module 330 is further configured to: and training the neural network model to be trained based on the target training sample to obtain the target neural network model.
In the embodiment of the present application, the modules shown in fig. 3 can implement the processes in the method embodiments of fig. 1 and fig. 2. The operations and/or functions of the respective modules in fig. 3 are respectively for implementing the corresponding flows in the method embodiments in fig. 1 and 2. Reference may be made specifically to the description of the above method embodiments, and a detailed description is appropriately omitted herein to avoid redundancy.
As shown in fig. 4, an embodiment of the present application provides an electronic device 400, including: a processor 410, a memory 420 and a bus 430, wherein the processor is connected to the memory through the bus, the memory stores computer readable instructions, when the computer readable instructions are executed by the processor, for implementing the method according to any one of the above embodiments, specifically, the description of the above method embodiments can be referred to, and the detailed description is omitted here to avoid repetition.
Wherein the bus is used for realizing direct connection communication of the components. The processor in the embodiment of the present application may be an integrated circuit chip having signal processing capability. The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The Memory may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Read Only Memory (EPROM), an electrically Erasable Read Only Memory (EEPROM), and the like. The memory stores computer readable instructions that, when executed by the processor, perform the methods described in the embodiments above.
It will be appreciated that the configuration shown in fig. 4 is merely illustrative and may include more or fewer components than shown in fig. 4 or have a different configuration than shown in fig. 4. The components shown in fig. 4 may be implemented in hardware, software, or a combination thereof.
Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a server, the method in any of the above-mentioned all embodiments is implemented, which may specifically refer to the description in the above-mentioned method embodiments, and in order to avoid repetition, detailed description is appropriately omitted here.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method of obtaining a target training sample, the method comprising:
acquiring a plurality of original sentence sets, wherein each original sentence set in the original sentence sets comprises an original sentence, a first sentence set and a second sentence set, the similarity between the first sentence set and the original sentence is greater than a threshold value, and the similarity between the second sentence set and the original sentence is less than the threshold value;
calculating a difference value between every two sentences in each original sentence set, wherein the difference value is used for representing the difference degree between the two sentences, one sentence in each two sentences is the original sentence, the other sentence is any sentence in the first sentence set, one sentence in each two sentences is the original sentence, the other sentence is any sentence in the second sentence set, or one sentence in each two sentences is any sentence in the first sentence set, and the other sentence is any sentence in the second sentence set;
determining whether to delete at least part of the plurality of original sentence sets based on the difference value, and obtaining a target training sample.
2. The method of claim 1, wherein calculating the difference between two sentences in each original sentence set comprises:
calculating a first editing distance between the original statement and each statement in the second statement set to obtain a plurality of first difference values, wherein the first editing distance is determined by the rewriting times of rewriting each statement in the second statement set into the original statement;
and calculating a second editing distance between each statement in the first statement set and each statement in the second statement set to obtain a plurality of second difference values, wherein the second editing distance is determined by the rewriting times of each statement in the second statement set into each statement in the first statement set.
3. The method of claim 2, wherein the first set of statements comprises an ith statement, wherein the second set of statements comprises a jth statement, and wherein i and j are integers greater than or equal to 1;
the determining whether to delete at least a partial set of the plurality of original sentence sets based on the difference value to obtain a target training sample comprises:
selecting a smaller value between a first difference value corresponding to the ith statement and a second difference value corresponding to the jth statement;
and determining whether to delete at least part of the original sentence sets based on the smaller value to obtain a target training sample.
4. The method of claim 3, wherein calculating the difference between two sentences in each original sentence set comprises:
calculating a third editing distance between the original sentence and each sentence in the first sentence set to obtain a plurality of third difference values, wherein the third editing distance is determined by the rewriting times of rewriting each sentence in the first sentence set into the original sentence;
said determining whether to delete at least a partial set of the plurality of original sets of sentences based on the smaller value comprises:
if the third difference value corresponding to the ith statement is larger than or equal to the smaller value, the ith statement, the jth statement and the original statement are reserved;
and if the third difference value corresponding to the ith statement is smaller than the smaller value, deleting the ith statement and the jth statement.
5. The method according to any of claims 1-4, wherein prior to said obtaining a plurality of original sentence sets, the method further comprises:
acquiring an initial statement set, wherein the initial statement set comprises a plurality of initial statements and statements to be screened, and the plurality of initial statements are a plurality of standard question statements corresponding to one answer;
and classifying the initial sentence sets according to the similarity between the sentences to be screened and the initial sentences to obtain a plurality of original sentence sets.
6. The method of claim 5, wherein after the obtaining the initial set of statements, the method further comprises:
screening a plurality of recall sentences of which the similarity with the question input by the user is higher than a similarity threshold value from the initial sentence set;
classifying the initial sentence set according to the similarity between the sentence to be filtered and the initial sentences to obtain the original sentence sets, including:
and classifying the plurality of recall sentences according to the similarity between each sentence in the plurality of recall sentences and the question input by the user to obtain the plurality of original sentence sets.
7. The method of any of claims 1-4, wherein after said determining whether to delete at least a portion of the plurality of original sentence sets based on the difference values to obtain a target training sample, the method further comprises:
and inputting the target training sample into an initial query statement model for retraining to obtain a target query statement model.
8. The method of any of claims 1-4, wherein after said determining whether to delete at least a portion of the plurality of original sentence sets based on the difference values to obtain a target training sample, the method further comprises:
and training the neural network model to be trained based on the target training sample to obtain the target neural network model.
9. An electronic device, comprising: a processor, a memory, and a bus;
the processor is connected to the memory via the bus, the memory storing computer readable instructions for implementing the method of any one of claims 1-8 when the computer readable instructions are executed by the processor.
10. A computer-readable storage medium, having stored thereon a computer program which, when executed, implements the method of any one of claims 1-8.
CN202210586079.5A 2022-05-26 2022-05-26 Method for obtaining target training sample, electronic device and medium Pending CN114861625A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210586079.5A CN114861625A (en) 2022-05-26 2022-05-26 Method for obtaining target training sample, electronic device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210586079.5A CN114861625A (en) 2022-05-26 2022-05-26 Method for obtaining target training sample, electronic device and medium

Publications (1)

Publication Number Publication Date
CN114861625A true CN114861625A (en) 2022-08-05

Family

ID=82641913

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210586079.5A Pending CN114861625A (en) 2022-05-26 2022-05-26 Method for obtaining target training sample, electronic device and medium

Country Status (1)

Country Link
CN (1) CN114861625A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116167455A (en) * 2022-12-27 2023-05-26 北京百度网讯科技有限公司 Model training and data deduplication method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116167455A (en) * 2022-12-27 2023-05-26 北京百度网讯科技有限公司 Model training and data deduplication method, device, equipment and storage medium
CN116167455B (en) * 2022-12-27 2023-12-22 北京百度网讯科技有限公司 Model training and data deduplication method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
US10963685B2 (en) Generating variations of a known shred
CN110717034A (en) Ontology construction method and device
CN112163424B (en) Data labeling method, device, equipment and medium
EP3640847A1 (en) Systems and methods for identifying form fields
CN111767716B (en) Method and device for determining enterprise multi-level industry information and computer equipment
CN112632989B (en) Method, device and equipment for prompting risk information in contract text
US20170076152A1 (en) Determining a text string based on visual features of a shred
CN113312899B (en) Text classification method and device and electronic equipment
RU2768233C1 (en) Fuzzy search using word forms for working with big data
CN111125295A (en) Method and system for obtaining food safety question answers based on LSTM
CN113157867A (en) Question answering method and device, electronic equipment and storage medium
CN112395881A (en) Material label construction method and device, readable storage medium and electronic equipment
CN113111159A (en) Question and answer record generation method and device, electronic equipment and storage medium
CN111708870A (en) Deep neural network-based question answering method and device and storage medium
CN114861625A (en) Method for obtaining target training sample, electronic device and medium
CN114139537A (en) Word vector generation method and device
CN113505786A (en) Test question photographing and judging method and device and electronic equipment
Chiney et al. Handwritten data digitization using an anchor based multi-channel CNN (MCCNN) trained on a hybrid dataset (h-EH)
CN114842982B (en) Knowledge expression method, device and system for medical information system
CN113722477B (en) Internet citizen emotion recognition method and system based on multitask learning and electronic equipment
CN112651590B (en) Instruction processing flow recommending method
CN114023380A (en) Toxic organism identification method and device and server
CN110533035B (en) Student homework page number identification method based on text matching
CN111382246B (en) Text matching method, matching device, terminal and computer readable storage medium
CN114357990B (en) Text data labeling method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination