CN117951514A - Data integration method, device and equipment - Google Patents

Data integration method, device and equipment Download PDF

Info

Publication number
CN117951514A
CN117951514A CN202311015412.8A CN202311015412A CN117951514A CN 117951514 A CN117951514 A CN 117951514A CN 202311015412 A CN202311015412 A CN 202311015412A CN 117951514 A CN117951514 A CN 117951514A
Authority
CN
China
Prior art keywords
target
label
sample
training sample
tag
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311015412.8A
Other languages
Chinese (zh)
Inventor
白安琪
蒋宁
陆全
夏粉
吴海英
肖冰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mashang Consumer Finance Co Ltd
Original Assignee
Mashang Consumer Finance Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mashang Consumer Finance Co Ltd filed Critical Mashang Consumer Finance Co Ltd
Priority to CN202311015412.8A priority Critical patent/CN117951514A/en
Publication of CN117951514A publication Critical patent/CN117951514A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a data integration method, a device and equipment, wherein the method comprises the following steps: acquiring a plurality of annotation data, wherein each annotation data comprises a training sample and a label of the training sample; determining a plurality of target tag combinations which meet preset conditions in the tags; acquiring a plurality of training sample subsets corresponding to each target label combination from the training samples; determining target labels meeting the integration condition in the corresponding target label combination according to the similarity between the training sample subsets; and integrating the target labels and training samples corresponding to the target labels to obtain a target labeling data set. By the embodiment of the application, the multiplexing rate of the marking data can be improved.

Description

Data integration method, device and equipment
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a data integration method, apparatus, and device.
Background
Along with the continuous development of artificial intelligence, the neural network is trained by using the labeling data to obtain a target model, and the target model can be widely applied to a plurality of scenes, such as image processing, audio processing and the like. The labeling data is usually obtained by manually labeling the sample data based on historical experience, that is, the model training mainly comprises two links, one link is to label the sample data to obtain the labeling data, and the other link is to use the labeling data to perform training treatment to obtain the target model.
Disclosure of Invention
The application provides a data integration method, a device and equipment, which are used for improving the multiplexing rate of marked data.
In a first aspect, an embodiment of the present application provides a data integration method, including:
acquiring a plurality of labeling data; each annotation data comprises a training sample and a label of the training sample;
Determining a plurality of target tag combinations which meet preset conditions in the tags;
acquiring a plurality of training sample subsets corresponding to each target label combination from the training samples;
Determining target labels meeting the integration condition in the corresponding target label combination according to the similarity between the training sample subsets;
and integrating the target label and the training sample of the target label to obtain a target labeling data set.
It can be seen that in the embodiment of the present application, by acquiring a plurality of labeling data including training samples and labels, and determining a plurality of target label combinations in the labels that satisfy a preset condition; acquiring a plurality of training sample subsets corresponding to each target label combination from the training samples; determining target labels meeting the integration condition in the corresponding target label combination according to the similarity between the training sample subsets; and integrating the target labels and training samples corresponding to the target labels to obtain a target labeling data set. Therefore, the integration of the scattered annotation data is realized, so that the integrated target annotation data set can be used by the following related tasks, and the multiplexing rate of the annotation data can be greatly improved; moreover, based on the similarity between the training sample subsets, the target labels meeting the integration condition in the corresponding target label combinations are determined, and the accuracy of the target labels is guaranteed because the similarity between the training sample subsets can represent the internal aggregation degree of the corresponding labels, so that the accuracy of the target labeling data set is guaranteed.
In a second aspect, an embodiment of the present application provides a data integration apparatus, including:
The acquisition module is used for acquiring a plurality of marking data; each annotation data comprises a training sample and a label of the training sample;
the first determining module is used for determining a plurality of target tag combinations which meet preset conditions in the tags;
the acquisition module is used for acquiring a training sample subset corresponding to each target label combination from the training samples;
the second determining module is used for determining target labels meeting the integration condition in the corresponding target label combination according to the similarity between the training sample subsets;
and the integration module is used for integrating the target label and the training sample corresponding to the target label to obtain a target labeling data set.
In a third aspect, an embodiment of the present application provides an electronic device, including:
A processor; and a memory arranged to store computer executable instructions configured for execution by the processor, the executable instructions comprising steps for performing the data integration method provided in the first aspect described above.
In a fourth aspect, an embodiment of the present application provides a storage medium for storing computer executable instructions that cause a computer to perform the data integration method provided in the first aspect.
Drawings
In order to more clearly illustrate one or more embodiments of the present application or the prior art solutions, the drawings that are required in the embodiments or the prior art descriptions will be briefly described below, it being obvious that the drawings in the following description are only some embodiments described in the present application, and that other drawings may be obtained according to these drawings without inventive faculty for a person skilled in the art.
Fig. 1 is a schematic flow chart of a data integration method according to an embodiment of the present application;
FIG. 2 is a second flow chart of a data integration method according to an embodiment of the present application;
FIG. 3 is a third flow chart of a data integration method according to an embodiment of the present application;
fig. 4 is a fourth flowchart of a data integration method according to an embodiment of the present application;
FIG. 5 is a fifth flowchart of a data integration method according to an embodiment of the present application;
fig. 6 is a sixth flowchart of a data integration method according to an embodiment of the present application
Fig. 7 is a schematic diagram of module composition of a data integration device according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order that those skilled in the art will better understand the technical solutions of one or more embodiments of the present application, the technical solutions of one or more embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in one or more embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from one or more embodiments of the application, are intended to be within the scope of this document.
The embodiment of the application provides a data integration method, a device and equipment. Annotation data is common in the field of artificial intelligence, and supervised model training can be performed by using the annotation data. However, at present, the labeling data is often not used after model training is performed to obtain a target model; even in the same or similar model training scenes, new annotation data is used for model training, and more scattered historical annotation data is not acquired for model processing again. The multiplexing rate of the marking data is low at present; and for the same or similar model training scenes, the training samples are labeled again each time to obtain labeling data, and some training samples are inevitably labeled when the model is trained each time, so that the problem of repeated labeling exists, and the labor cost and the time cost are wasted. Based on the above, the embodiment of the application provides a data integration method, in which a plurality of labeling data including training samples and labels are obtained, and a plurality of target label combinations meeting preset conditions in the labels are determined; acquiring a plurality of training sample subsets corresponding to each target label combination from the training samples; determining target labels meeting the integration condition in the corresponding target label combination according to the similarity between the training sample subsets; and integrating the target labels and training samples corresponding to the target labels to obtain a target labeling data set. Therefore, the integration of the scattered annotation data is realized, so that the integrated target annotation data set can be used by the following related tasks, the multiplexing rate of the annotation data can be greatly improved, and the problems of repeated annotation and the like are avoided; moreover, based on the similarity between the training sample subsets, the target labels meeting the integration condition in the corresponding target label combinations are determined, and the accuracy of the target labels is guaranteed because the similarity between the training sample subsets can represent the internal aggregation degree of the corresponding labels, so that the accuracy of the target labeling data set is guaranteed.
Specifically, fig. 1 is a flow chart of a data integration method according to one or more embodiments of the present application, where the method in fig. 1 can be executed by a data integration device, and the training device may be disposed in a terminal device or may be disposed in a server. The terminal equipment can be a mobile phone, a tablet computer, a desktop computer, a portable notebook computer and the like; the server may be an independent server or a server cluster composed of a plurality of servers. As shown in fig. 1, the method comprises the steps of:
step S102, a plurality of labeling data are obtained; each annotation data includes a training sample and a label for the training sample;
In order to ensure accuracy of the integration process, the plurality of obtained annotation data may include annotation data used in a training process of the same or similar model applied to the same scene, and annotation data used in a training process of the same or similar model applied to the similar scene. For example, the annotation data includes annotation data 1 through 200 used by user A to train on model 1 of scene A, annotation data 300 through 600 used by user B to train on model 2 of scene A, and annotation data 700 through 900 used by user C to train on model 2 of scene B; wherein, the scene A (such as a furling scene) and the scene B (such as a pre-furling scene) are similar scenes, and the model 1 (such as a text-based intention recognition model) and the model 2 (such as a text-based keyword extraction model) are similar models.
Further, the plurality of obtained annotation data may include historical annotation data, current annotation data obtained by current annotation processing, target annotation data obtained by previous integration processing, and the like. As an example, the plurality of annotation data are historical annotation data used in the previous model training, that is, the data integration method provided by the application can be applied to the integration scene of the historical annotation data. As another example, the plurality of annotation data includes historical annotation data used when model training is performed previously and current annotation data obtained by current annotation processing, that is to say, the annotation data integration method provided by the application can be applied to an integration scene of the historical annotation data and the current annotation data. As another example, the plurality of annotation data includes target annotation data (i.e., historical target annotation data) obtained by previous integration processing and current annotation data obtained by current annotation processing, that is, the data integration method provided by the application can be applied to an integration scene of the historical target annotation data and the current annotation data. This is not a list of the present application.
The training samples can be any type of data such as images, audio, text and the like, and it is pointed out that in the process of integrating the labeling data at one time, the training samples are the same type of data, for example, the acquired training samples of a plurality of labeling data are all images, and for example, the acquired training samples of a plurality of labeling data are all texts and the like. The label of the training sample may be any content, and the present application is not particularly limited.
In consideration of that in practical applications, the same labeling data may be used in different model training processes, or the same label is adopted for different training samples, in order to ensure accuracy of the integration process, in one or more embodiments of the present application, step S102 may further include: if the plurality of label data which are different in attribute and identical in label exist in the obtained label data, a distinguishing mark is added to the labels of the plurality of label data. The attributes may include, among other things, the model to which they pertain, the project to which they pertain, the time period of generation, etc. The specific content of the distinguishing mark can be set according to the needs in practical application. In one embodiment, the different attributes may include different belonged items, and accordingly, the item name of the item to which the labeling data belongs may be used as the distinguishing identifier, for example, the label of the labeling data 1 is "principal", the item name of the item to which the labeling data 1 belongs is "collect-promoting", the label of the labeling data 2 is also "principal", the item name of the item to which the labeling data 2 belongs is "pre-collect-promoting", and after the distinguishing identifier is added, the label of the labeling data 1 may be "collect-principal", and the label of the labeling data 2 may be "pre-collect-principal".
Step S104, determining a plurality of target label combinations meeting preset conditions in the labels;
In order to improve accuracy of label integration, in the embodiment of the application, after a plurality of label data are acquired, a plurality of target label combinations meeting preset conditions in all labels included in the label data are determined. For convenience of distinction, two labels in the target label combination are referred to as a first label and a second label, and the second label is the label with the highest similarity with the first label in the labels included in the labeling data. It will be appreciated that for a first tag, a second tag is the tag that has the greatest similarity to the first tag, but for a second tag, the first tag is not necessarily the tag that has the greatest similarity to the second tag. It will further be appreciated that in different combinations of target tags, the first tag may be different and the second tag may be the same or different.
Step S106, obtaining a plurality of training sample subsets corresponding to each target label combination from the training samples;
Specifically, at least one training sample subset corresponding to a first label in each target label combination is obtained from the training sample set, at least one training sample subset corresponding to a second label in each target label combination is obtained, and the training sample subsets corresponding to the first label and the second label are determined to be a plurality of training sample subsets corresponding to the corresponding target label combination.
Step S108, determining target labels meeting the integration condition in the corresponding target label combination according to the similarity between the training sample subsets;
For a label, the similarity between the corresponding training sample subsets is considered, so that the internal aggregation degree of the label can be represented, namely, the higher the similarity between the training sample subsets is, the higher the internal aggregation degree of the label is represented, and the lower the separability of the label is. Therefore, in the embodiment of the application, the target tag meeting the integration condition in the corresponding target tag combination is determined according to the plurality of training sample subsets corresponding to each tag combination.
Step S110, integrating the target labels and training samples corresponding to the target labels to obtain a target labeling data set.
The integration process may include a splitting process and a fusing process, and particularly, reference will be made to the following related description.
In one or more embodiments of the present application, a plurality of labeling data including training samples and labels are obtained, and a plurality of target label combinations satisfying a preset condition in the labels are determined; acquiring a plurality of training sample subsets corresponding to each target label combination from the training samples; determining target labels meeting the integration condition in the corresponding target label combination according to the similarity between the training sample subsets; and integrating the target labels and training samples corresponding to the target labels to obtain a target labeling data set. Therefore, the integration of the scattered annotation data is realized, so that the integrated target annotation data set can be used by the following related tasks, the multiplexing rate of the annotation data can be greatly improved, and the problems of repeated annotation and the like are avoided; moreover, based on the similarity between the training sample subsets, the target labels meeting the integration condition in the corresponding target label combinations are determined, and the accuracy of the target labels is guaranteed because the similarity between the training sample subsets can represent the internal aggregation degree of the corresponding labels, so that the accuracy of the target labeling data set is guaranteed.
In order to improve accuracy of label integration, in the embodiment of the application, after a plurality of label data are acquired, a plurality of target label combinations meeting preset conditions in all labels included in the label data are determined. Specifically, as shown in fig. 2, step S104 may include the following steps S104-2 to S104-6:
Step S104-2, carrying out pairwise combination on each label to obtain a plurality of first label combinations;
step S104-4, determining the similarity between two labels in each first label combination;
Specifically, each tag is subjected to coding processing through a coding model, and a first feature vector of each tag is obtained; for each first label combination, determining a vector distance between two labels in the first label combination according to a first feature vector of the labels in the first label combination, and determining an editing distance between the two labels in the first label combination; and determining the similarity between the two labels in the first label combination according to the vector distance and the editing distance.
The coding model may be obtained by performing training processing in advance. In one embodiment, a plurality of unlabeled training samples (i.e., training samples without labeled labels) belonging to the same or similar scene as the plurality of labeled data in step S102 may be obtained, and the unlabeled training samples are used to train the pre-training model to obtain the coding model. Because the pre-training model is obtained by training based on a large amount of labeling data, the coding model is trained based on the pre-training model, so that the coding model has better generalization capability. For the specific training process of the coding model, reference is made to the training method in the related art, which is not described in detail in the present application.
Optionally, encoding each tag by using an encoding model to obtain a first feature vector of each tag, including: inputting each label into a coding model in sequence to carry out coding treatment to obtain a first feature vector of each label; or randomly dividing each label into a plurality of label subsets, and sequentially inputting each label subset into a coding model for coding processing to obtain a first feature vector of each label in the corresponding label subset.
It will be appreciated that the tags are typically in text form, and may be a word, or a phrase or sentence. Thus, in one embodiment, the coding model may include a word segmentation module (token), a word coding module (word embedding), a position coding module (positional _ embedding), and a sentence coding module (segment embedding) modules. When the coding model determines that the input label is a word or a word, the character coding module performs first coding processing on the label to obtain a first feature vector of the label, and outputs the first feature vector. When the coding model determines that the input tag is a phrase or sentence, word segmentation processing is carried out on the tag through a word segmentation module, so that a word segmentation result is obtained; performing first coding processing on each word in the word segmentation result through a word coding module to obtain a first coding result; performing second coding processing on each word in the word segmentation result according to the position information of each word in the word segmentation result in the label through a position coding module to obtain a second coding result; performing third coding processing on each word in the word segmentation result according to the sentence to which each word belongs in the word segmentation result by using a sentence coding module to obtain a third coding result; splicing the first coding result, the second coding result and the third coding result of each word to obtain a spliced vector of each word; and performing splicing processing on each spliced vector according to the position of each word in the tag to obtain a first characteristic vector of the tag, and outputting the first characteristic vector. It can be understood that, since the sentence to which each word belongs in the word segmentation result is the tag, the result of performing the third encoding processing on each word in the word segmentation result is the same, for example, 1.
Further, the determining the similarity between the two labels in the first label combination according to the vector distance and the edit distance may include: adding the vector distance and the editing distance, and determining the added result as the similarity between two labels in the first label combination; or determining weights corresponding to the vector distance and the editing distance, carrying out weighted addition on the vector distance and the editing distance according to the determined weights, and determining the weighted addition result as the similarity between two tags in the first tag combination. The specific determination manners of the vector distance and the edit distance may refer to related technologies, and will not be described in detail in this disclosure.
Therefore, the similarity between the two labels in each first label combination is determined according to the vector distance and the editing distance, rather than being based on a single vector distance or editing distance, and the accuracy of the determined similarity is greatly improved.
Step S104-6, determining the label combination corresponding to the maximum similarity corresponding to the label as the target label combination meeting the preset condition aiming at each label.
Specifically, for each tag, acquiring each candidate tag combination containing the tag from each first tag combination; and sequencing the similarity of the candidate label combinations, and determining the candidate label combination corresponding to the maximum similarity as the target label combination meeting the preset condition. The sorting process may be descending sorting or ascending sorting.
The first label combination corresponding to the maximum similarity corresponding to each label is determined to be the target label combination, and subsequent processing is carried out based on the target label combination, so that the effectiveness of label data integration is ensured.
For a certain label, the higher the similarity between training samples corresponding to the label is considered, the higher the internal polymerization degree of the label is often represented, and the lower the separability is. Based on this, in one or more embodiments of the present application, from each training sample included in the labeling data, a training sample subset corresponding to the first label and a training sample subset corresponding to the second label in each target label combination are obtained, so as to obtain a training sample subset corresponding to each target label combination. Specifically, as shown in FIG. 3, step S106 includes the following steps S106-2 to S106-10:
Step S106-2, dividing each training sample to obtain a plurality of training sample sets; each training sample set corresponds to a label;
specifically, according to the labels of all the training samples, the training samples of the same label are divided into one training sample set, and a plurality of training sample sets are obtained.
Step S106-4, determining whether the number of training samples included in the training sample set is smaller than a preset number for each training sample set;
The preset number can be set according to needs in practical application, for example, the preset number is 10.
Step S106-6, if the number of the training samples included in the training sample set is smaller than the preset number, determining the training sample set as a training sample subset of the corresponding label;
When the number of the training samples included in a certain training sample set is smaller than the preset number, the training sample set is characterized as not being subdivided again, so that the training sample set is determined to be a training sample subset of the corresponding label.
Step S106-8, if the number of the training samples included in the training sample set is not less than the preset number, clustering the training samples in the training sample set to obtain a plurality of clustering results; determining the clustering result as a training sample subset of the labels corresponding to the training sample set;
Specifically, if the number of training samples included in the training sample set is not less than the preset number, inputting the training samples in the training sample set into a conversion model for conversion processing to obtain a second feature vector of each training sample; and clustering based on the second feature vector to obtain a plurality of clustering results. It will be appreciated that the conversion model may vary with the type of training sample. For example, the training sample is text in type, and the conversion model may be a Bert model; as another example, the training sample type is an image, the transformation model may be ResNet model, or the like. The specific implementation manner of the clustering process can be set according to the needs in practical application, in one implementation manner, the clustering process can be performed based on the K-means algorithm, and for the specific processing process of the K-means algorithm, reference is made to related technology, and detailed description of the method is omitted.
And S106-10, determining the training sample subset of the labels as the training sample subset corresponding to the target label combination where the labels are located.
For example, target label combination 1 includes label 1 and label 3, and target label combination 2 includes label 2 and label 3; the training sample subset of the label 1 is a training sample subset 1, the training sample subset of the label 2 is a training sample subset 2 and a training sample subset 3, and the training sample subset of the label 3 is a training sample subset 4, a training sample subset 5 and a training sample subset 6; determining the training sample subset 1, the training sample subset 4, the training sample subset 5 and the training sample subset 6 as the training sample subset corresponding to the target label combination 1; training sample subset 2, training sample subset 3, training sample subset 4, training sample subset 5, and training sample subset 6 are determined as the training sample subset corresponding to target tag combination 2.
After obtaining the training sample subset corresponding to the target label combination, determining a first integration degree of each label in the target label combination based on the similarity between the training sample subsets, thereby determining the target labels meeting the integration condition in the target label combination according to the first integration degree. Specifically, as shown in FIG. 4, step S108 may include the following steps S108-2 and S108-4:
step S108-2, determining a first integration degree of the labels in the target label combination according to the similarity among a plurality of training sample subsets corresponding to the target label combination aiming at each target label combination;
specifically, as shown in FIG. 5, step S108-2 may include the following steps S108-22 to S108-26:
step S108-22, determining a sample distance threshold corresponding to the target label combination according to the training sample subset corresponding to the target label combination for each target label combination;
As previously described, the target tag combination includes a first tag and a second tag, where for ease of distinction, the training sample subset of the first tag is referred to as a first training sample subset and the training sample subset of the second tag is referred to as a second training sample subset. Accordingly, step S108-22 may include: inputting each training sample in the first training sample subset and the second training sample subset corresponding to the target label combination into a feature detection model for feature detection processing to obtain sample features of each training sample; determining a first distance between each first training sample subset and each second training sample subset according to the sample characteristics; the first distance characterizes a similarity between the first training sample subset and the second training sample subset; and determining an average distance of the first distances, and determining the average distance as a sample distance threshold corresponding to the target tag combination.
It will be appreciated that the feature detection model varies with the type of training sample. In one embodiment, the training sample is text, and the feature detection model may include a sentence pattern detection model, a sentence class detection model, and a keyword detection model. The sentence pattern detection model can be obtained by setting sentence pattern classification tasks and training on the basis of the coding model; the training sample is text, and the label of the training sample is sentence pattern. The sentence pattern detection model can be obtained by setting sentence pattern classification tasks and training on the basis of the coding model; the training sample is text, and the label of the training sample is sentence pattern. The sentence detection model can be obtained by setting sentence classification tasks and training on the basis of the coding model; the training sample is text, and the label of the training sample is sentence. The sentence pattern detection model can be obtained by setting sentence pattern classification tasks and training on the basis of the coding model; the training sample is text, and the label of the training sample is sentence pattern. The keyword detection model can be obtained by setting a keyword classification task and training on the basis of the coding model; the training sample is text, and the label of the training sample is a keyword. The specific training process of each detection model can refer to the model training process of the related technology, and the detailed description of the specific training process is omitted in the present application. Correspondingly, the step of inputting each training sample in the first training sample subset and the second training sample subset corresponding to the target label combination into the feature detection model to perform feature detection processing to obtain sample features of each training sample may include: inputting each training sample in the first training sample subset and the second training sample subset corresponding to the target label combination into a sentence pattern detection model for sentence pattern detection processing to obtain sentence pattern characteristics of each training sample; inputting each training sample in the first training sample subset and the second training sample subset corresponding to the target label combination into a sentence pattern detection model for sentence pattern detection processing to obtain sentence pattern characteristics of each training sample; inputting each training sample in the first training sample subset and the second training sample subset corresponding to the target label combination into a sentence detection model for sentence detection processing to obtain sentence characteristics of each training sample; inputting each training sample in the first training sample subset and the second training sample subset corresponding to the target label combination into a keyword detection model for keyword detection processing to obtain keywords of each training sample; and determining sentence pattern features, sentence class features and keywords as sample features of the training samples. The sentence pattern features comprise main predicates and non-main predicates, the sentence pattern features comprise words, continuous-to-speak sentences and the like, and the sentence class features comprise statement sentences, praying sentences, exclamation sentences, question sentences and the like.
Further, when the training sample is text, the first distance=sentence pattern distance+sentence class distance+keyword distance, corresponding to the sample feature. That is, the determining the first distance between each first training sample subset and each second training sample subset according to the sample characteristics may include: determining a period distance between each first training sample subset and each second training sample subset according to the period characteristics; determining sentence pattern distances between each first training sample subset and each second training sample subset according to sentence pattern characteristics; determining sentence distance between each first training sample subset and each second training sample subset according to sentence characteristics; determining a keyword distance between each first training sample subset and each second training sample subset according to the keywords; and adding the sentence pattern distance, the sentence class distance and the keyword distance to obtain a first distance between each first training sample subset and each second training sample subset.
Wherein, determining the period distance between the first training sample subset i and the second training sample subset j according to the period feature may include: determining, for each first training sample in the first training sample subset i, a clause distance between the first training sample and each second training sample according to a sentence pattern feature of the first training sample and a sentence pattern feature of each second training sample in the second training sample subset; and adding the sentence pattern distances corresponding to the first training sample subset i and the second training sample subset j to obtain the sentence pattern distance between the first training sample subset i and the second training sample subset j. The first training sample subset i is any one of at least one first training sample subset of first labels in the target label combination, and the second training sample subset j is any one of at least one second training sample subset of second labels in the target label combination. As an example, if the first training sample subset of the first tag in the target tag combination is 2, and the second training sample subset of the second tag in the target tag combination is 3,6 period distances may be determined.
The process of determining the sentence pattern distance, the sentence class distance and the keyword distance is the same as the process of determining the sentence pattern distance, and the related description of determining the sentence pattern distance can be referred to, and the repetition is not repeated here.
It should be noted that, when the training sample is of another type, such as an image, an audio, etc., the specific content of the feature detection model and the sample feature may be set according to the needs in practical application, which is not specifically limited in the present application.
Step S108-24, obtaining a target sample corresponding to each training sample corresponding to the target label combination based on the sample generation model and the sample distance threshold;
specifically, based on a sample generation model and a sample distance threshold, a sample enhancement mode is adopted to obtain a target sample corresponding to each training sample in a training sample subset corresponding to the target label combination.
More specifically, each first training sample in the first training sample subset is input into a sample generation model to be subjected to sample generation processing, so as to obtain at least one first generation sample corresponding to each first training sample; a sample distance between each first generated sample and the first training sample is less than a sample distance threshold; inputting each second training sample in the second training sample subset to a sample generation model for sample generation processing to obtain at least one second generated sample corresponding to each second training sample; the sample distance between each second generated sample and the second training sample is smaller than a sample distance threshold; aiming at each first training sample, determining a first generation sample with the most target characteristics in the first generation samples corresponding to the first training samples as a target first generation sample corresponding to the first training samples; and aiming at each second training sample, determining the second generation sample with the most target characteristics in the second generation samples corresponding to the second training samples as the target second generation sample corresponding to the second training sample. The target features comprise sample features corresponding to the first tag and sample features corresponding to the second tag; that is, the target feature may be understood as a set of sample features corresponding to the first tag and the second tag.
The sample generation model is obtained by training in advance, and the type of the training sample included in the labeling data is taken as a text for illustration, the training sample of the sample generation model can be a first text, and the label of the training sample of the sample generation model can be a second text and a first distance between the first text and the second text. For the training process of the sample generation model, reference may be made to a model training process in the related art, and this will not be described in detail in the present application.
The determining, by the first generating sample corresponding to the first training sample and including the first generating sample having the most target features, the target first generating sample corresponding to the first training sample may include: matching the characteristics included in the first generated samples with target characteristics aiming at each first generated sample corresponding to the first training samples, and counting the first quantity of the target characteristics successfully matched; comparing the first numbers to obtain the maximum first number, and determining a first generation sample corresponding to the maximum first number as a target first generation sample corresponding to the first training sample.
The determining, by the determining unit, that the second generated sample corresponding to the second training sample contains the second generated sample having the most target features as the target second generated sample corresponding to the second training sample may include: for each second generated sample corresponding to the second training sample, matching the characteristics included in the second generated sample with the target characteristics, and counting the second number of the target characteristics successfully matched; comparing the second numbers to obtain the maximum second number, and determining a second generated sample corresponding to the maximum second number as a target second generated sample corresponding to the second training sample.
In this way, the sample distance threshold is used as a sample enhancement condition, and each training sample is subjected to sample enhancement through the sample generation model, so that the target sample containing a plurality of target features corresponding to each training sample is obtained, and the first integration degree of the corresponding label can be accurately determined based on the target sample later.
Step S108-26, determining the first integration degree of the labels in the target label combination according to the target sample.
Specifically, a first target generation sample corresponding to each first training sample is input into a multi-classification model to be subjected to first label prediction processing, so that a first multi-classification result of each first target generation sample is obtained; inputting a first target generation sample corresponding to each first training sample into a classification model to perform second label prediction processing, so as to obtain a first two-class result of each first target generation sample; determining a first integration degree of the corresponding first label according to each first multi-classification result and each first second multi-classification result; inputting the second target generation samples corresponding to each second training sample into the multi-classification model to perform first label prediction processing, so as to obtain a second multi-classification result of each second target generation sample; inputting a second target generation sample corresponding to each second training sample into a classification model to perform second label prediction processing, so as to obtain a second classification result of each second target generation sample; and determining the first integration degree of the corresponding second label according to each second multi-classification result and each second classification result.
Wherein, determining the first integration degree of the corresponding first label according to each first multi-classification result and each first two-classification result may include: in a first multi-classification result, acquiring a first probability that a label of a corresponding first target generation sample is a first label; in the first two-class result, obtaining a second probability that the label of the corresponding first target generation sample is the first label; determining a first average value of the first probability and a second average value of the second probability as a first integration degree of the corresponding first tag;
Determining the first degree of integration of the corresponding second tag according to each second multi-classification result and each second classification result may include: in the second multi-classification result, obtaining a third probability that the label of the corresponding second target generation sample is the second label; in the second classification result, obtaining a fourth probability that the label of the corresponding second target generation sample is the second label; and determining a third average value of the third probability and a fourth average value of the fourth probability as a first integration degree of the corresponding second label.
It should be noted that the specific training process of the multi-classification model and the two-classification model may refer to the training process in the related art, and will not be described in detail in this disclosure. The first multi-classification result comprises a first probability that a label of a first target generation sample belongs to each preset label in a plurality of preset labels, wherein the plurality of preset labels comprise the first label; similarly, the second multi-classification result includes a second probability that a label of the second target generation sample belongs to each preset label among a plurality of preset labels, and the plurality of preset labels include the second label. The first second classification result comprises a second probability that the label of the first target generated sample is the first label and a fifth probability that the label is not the first label; similarly, the second classification result includes a fourth probability that the label of the second target generation sample is the second label and a sixth probability that the label is not the second label.
Therefore, the first integration degree of the target sample is determined from different angles based on the multi-classification model and the classification model, and the accuracy of the first integration degree is guaranteed.
Step S108-4, determining the target label meeting the integration condition in the target label combination according to the first integration degree.
Wherein the first degree of integration may characterize the internal degree of polymerization and the internal portion of the tag. It will be appreciated that the internal degree of polymerization is opposed to the internal degree of cleavage, the higher the internal degree of polymerization, the lower the internal degree of cleavage. In one embodiment, the target tag may be determined based on the internal degree of polymerization. Specifically, for any one of the first tag and the second tag in each target tag combination, if the first integration degree of the tag represents that the internal polymerization degree of the tag is smaller than a preset threshold, determining that the tag is a first target tag meeting the integration condition; and if the first integration degree of the first label in each target label combination represents that the internal polymerization degree of the first label is not smaller than a preset threshold value and the first integration degree of the second label represents that the internal polymerization degree of the second label is not smaller than the preset threshold value, determining the first label and the second label in the target label combination as the second target label meeting the integration condition.
More specifically, for each target tag combination, it may be determined first whether a first probability and a second probability included in a first integration degree of a first tag in the target tag combination are both smaller than a preset threshold, if yes, the internal polymerization degree of the first tag is small (i.e., the internal cracking degree is high), and the first tag is determined to be the first target tag meeting the integration condition; then, whether the third probability and the fourth probability included in the first integration degree of the second tag in the target tag combination are smaller than a preset threshold value or not can be determined, if yes, the internal integration degree of the second tag is small (namely the internal part crack degree is high), and the second tag is determined to be the first target tag meeting the integration condition; then, determining whether a first probability and a second probability included in a first integration degree of a first tag in the target tag combination are not smaller than a preset threshold, if so, representing that the internal integration degree of the first tag is not smaller than the preset threshold, namely that the internal integration degree of the first tag is high (the internal splitting degree is small), and the first tag is not detachable; determining whether a third probability and a fourth probability included in the first integration degree of the second tag in the target tag combination are not smaller than a preset threshold, if so, representing that the internal integration degree of the second tag is not smaller than the preset threshold, namely that the internal integration degree of the second tag is high (the internal splitting degree is small), and the second tag is not detachable; and if the first integration degree of the first label in the target label combination represents that the internal polymerization degree of the first label is not smaller than a preset threshold value, and the first integration degree of the second label represents that the internal polymerization degree of the second label is not smaller than a preset threshold value, the first label and the second label in the target label combination can be fused, and the first label and the second label in the target label combination are determined to be second target labels meeting the integration condition. The preset threshold value can be set according to the needs in practical application, for example, 0.65, 0.7, etc.
It is noted that the target tag may also be determined based on the internal split. That is, for any one of the first tag and the second tag in each target tag combination, if the first integrity of the tag represents that the internal splitting degree of the tag is not less than a preset threshold, determining that the tag is a first target tag meeting the integration condition; and if the first integrity of the first label in each target label combination represents that the internal splitting degree of the first label is smaller than a preset threshold value and the first integrity of the second label represents that the internal splitting degree of the second label is smaller than the preset threshold value, determining the first label and the second label in the target label combination as the second target label meeting the integration condition.
Because the first integration degree of the tag represents the internal polymerization degree (internal splitting degree) of the tag, and the internal polymerization degree represents the self-splitting property of the tag and the fusion property between two non-splitting tags in the target tag combination, the target tag meeting the integration condition is determined based on the first integration degree of the tag, and the accuracy of the target tag is ensured.
Corresponding to the first target tag and the second target tag described above, as shown in fig. 6, step S110 may include the following steps S110-2 to S110-6:
Step S110-2, if the target label comprises a first target label, splitting the first target label to obtain a plurality of sub-labels; obtaining a sub-sample corresponding to each sub-label from a training sample corresponding to the first target label; determining each sub-label as a third target label of the corresponding sub-sample to obtain target labeling data;
Specifically, when the training sample subset corresponding to the first target label is obtained through the clustering process, splitting the first target label to obtain a plurality of sub-labels may include: matching the first target label with the designated keyword set, and determining a plurality of successfully matched keywords in the first target label as a plurality of sub-labels obtained by splitting the first target label; the keyword set comprises a centroid corresponding to the clustering result and synonyms of the centroid. Or inputting the first target label into a label splitting model to carry out label splitting treatment to obtain a plurality of sub labels. When the training sample subset corresponding to the first target label is not obtained through the clustering process, splitting the first target label to obtain a plurality of sub-labels may include: inputting the first target label into a label splitting model to carry out label splitting treatment to obtain a plurality of sub labels. The specific training process of the label splitting model can refer to a model training process in a related technology, and the specific limitation is not made in the application.
In one embodiment, in step S110-2, obtaining a sub-sample corresponding to each sub-tag from the training samples corresponding to the first target tag may include: and inputting the first target label and each sub-label into a sample splitting model to carry out sample splitting treatment to obtain sub-samples corresponding to each sub-label. The specific training process of the sample splitting model can refer to a model training process in a related technology, and the specific limitation is not made in the application.
As an example, the first target tag is split into a sub-tag 1 and a sub-tag 2, where the sub-tag 1 corresponds to the sub-sample 1, and the sub-tag 2 corresponds to the sub-sample 2, and the sub-tag 1 is determined as a third target tag of the sub-sample 1, so as to obtain target labeling data 1; and determining the sub-label 2 as a third target label of the sub-sample 2 to obtain target labeling data 2.
It should be noted that, the specific modes of the label splitting process and the sample splitting process are not limited to the above modes, and can be set according to the needs in practical applications.
Step S110-4, if the target label comprises a second target label, performing fusion processing on the first label and the second label in the second target label to obtain a fusion label; determining the fusion label as a third target label of each training sample corresponding to the first label and the second label to obtain target labeling data;
The fusing processing is performed on the first tag and the second tag in the second target tag to obtain a fused tag, which may include: splicing the first label and the second label to obtain a fusion label; or generating a new label according to the first label and the second label, and determining the new label as a fusion label.
As an example, if the first tag in the second target tag is tag a and the second tag is tag B, then AB is determined to be a fusion tag. As another example, if the first tag in the second target tag is tag C and the second tag is tag D, the generated tag F is determined to be a fusion tag.
And step S110-6, determining each target annotation data as a target annotation data set.
Therefore, the integration of the scattered annotation data is realized, so that the integrated target annotation data set can be used in the subsequent related tasks, namely, the multiplexing rate of the annotation data is improved. For example, in a related model training task, the training data can be added into a training set as new training data, so that the training data is expanded, the model learns more related knowledge, and the generalization capability of the model is improved.
In order to ensure accuracy of the target annotation data set, in one or more embodiments of the present application, after step S110, the method may further include: verifying the rationality of the target labeling data set; and if the verification is not passed, the integration processing is carried out on the target labeling data set again. Specifically, each label in the target labeling data set can be combined in pairs to obtain a plurality of second label combinations; acquiring a training sample subset corresponding to each second label combination from the target labeling data set; for each second label combination, determining a second integration degree of the labels in the second label combination according to the similarity among a plurality of training sample subsets corresponding to the second label combination; and if the fourth target label meeting the integration condition does not exist according to the second integration degree, determining that verification is passed. The process of obtaining the training sample subset can participate in the related description; the process of determining the second integration degree is the same as the process of determining the first integration degree, and reference may be made to the related description above, and the repetition is not repeated here.
In one embodiment, the training sample is training text; correspondingly, the labeling data can also be called labeling corpus, the sample generation model can be a text generation model, the sample distance threshold can be a text distance threshold, and the data integration method can comprise the following steps of A2 to A34:
A2, acquiring a plurality of annotation corpus, wherein each annotation data comprises a training text and a label of the training text;
step A4, if a plurality of labeling corpuses with different attributes and same labels exist in the obtained labeling corpuses, distinguishing marks are added to the labels of the labeling corpuses;
Step A6, carrying out pairwise combination on each label to obtain a plurality of first label combinations;
Step A8, determining the similarity between two labels in each first label combination;
Step A10, determining a label combination corresponding to the maximum similarity corresponding to each label as a target label combination meeting a preset condition;
Step A12, dividing each training text to obtain a plurality of training text sets; each training text set corresponds to a label;
Step A14, determining whether the number of training texts included in the training text set is smaller than a preset number for each training text set;
Step A16, if the number of the training texts included in the training text set is smaller than the preset number, determining the training text set as a training text subset of the corresponding label;
step A18, if the number of training samples included in the training text set is not less than the preset number, clustering the training texts in the training text set to obtain a plurality of clustering results; determining the clustering result as a training text subset of the labels corresponding to the training text set;
step A20, determining a training text subset of the labels as a training text subset corresponding to a target label combination where the labels are located;
Step A22, determining a text distance threshold corresponding to the target label combination according to the training text subset corresponding to the target label combination aiming at each target label combination;
step A24, acquiring a target text corresponding to each training text corresponding to the target tag combination based on the text generation model and the text distance threshold;
Step A26, determining a first integration degree of the labels in the target label combination according to the target text;
Step A28, determining target labels meeting the integration conditions in the target label combination according to the first integration degree;
Step A30, if the target label comprises a first target label, splitting the first target label to obtain a plurality of sub-labels; acquiring a sub-text corresponding to each sub-label from the training text corresponding to the first target label; determining each sub-label as a third target label of the corresponding sub-text to obtain a target labeling corpus;
Step A32, if the target label comprises a second target label, performing fusion processing on the first label and the second label in the second target label to obtain a fusion label; determining the fusion tag as a third target tag of each training text corresponding to the first tag and the second tag, and obtaining a target labeling corpus;
And step A34, determining each target annotation corpus as a target annotation corpus set.
The specific implementation manner of the above steps A2 to a34 may refer to the foregoing related description, and the repetition is not repeated here.
In one or more embodiments of the present application, a plurality of labeling data including training samples and labels are obtained, and a plurality of target label combinations satisfying a preset condition in the labels are determined; acquiring a plurality of training sample subsets corresponding to each target label combination from the training samples; determining target labels meeting the integration condition in the corresponding target label combination according to the similarity between the training sample subsets; and integrating the target labels and training samples corresponding to the target labels to obtain a target labeling data set. Therefore, the integration of the scattered annotation data is realized, so that the integrated target annotation data set can be used by the following related tasks, the multiplexing rate of the annotation data can be greatly improved, and the problems of repeated annotation and the like are avoided; moreover, based on the similarity between the training sample subsets, the target labels meeting the integration condition in the corresponding target label combinations are determined, and the accuracy of the target labels is guaranteed because the similarity between the training sample subsets can represent the internal aggregation degree of the corresponding labels, so that the accuracy of the target labeling data set is guaranteed.
Corresponding to the data integration method described above, one or more embodiments of the present application further provide a data integration device based on the same technical concept. Fig. 7 is a schematic block diagram of a data integration device according to one or more embodiments of the present application, and as shown in fig. 7, the device includes:
A first obtaining module 201, configured to obtain a plurality of labeling data; each annotation data comprises a training sample and a label of the training sample;
a first determining module 202, configured to determine a plurality of target tag combinations that satisfy a preset condition in the tags;
a second obtaining module 203, configured to obtain, from the training samples, a training sample subset corresponding to each of the target tag combinations;
a second determining module 204, configured to determine, according to the similarity between the training sample subsets, a target tag that satisfies an integration condition in the corresponding target tag combination;
An integration module 205, configured to integrate the target tag and the training sample corresponding to the target tag to obtain a target labeling data set
The data integration device provided by the embodiment of the application acquires a plurality of marking data comprising training samples and labels, and determines a plurality of target label combinations meeting preset conditions in the labels; acquiring a plurality of training sample subsets corresponding to each target label combination from the training samples; determining target labels meeting the integration condition in the corresponding target label combination according to the similarity between the training sample subsets; and integrating the target labels and training samples corresponding to the target labels to obtain a target labeling data set. Therefore, the integration of the scattered annotation data is realized, so that the integrated target annotation data set can be used by the following related tasks, the multiplexing rate of the annotation data can be greatly improved, and the problems of repeated annotation and the like are avoided; moreover, based on the similarity between the training sample subsets, the target labels meeting the integration condition in the corresponding target label combinations are determined, and the accuracy of the target labels is guaranteed because the similarity between the training sample subsets can represent the internal aggregation degree of the corresponding labels, so that the accuracy of the target labeling data set is guaranteed.
It should be noted that, the embodiment of the data integration device in the present application and the embodiment of the data integration method in the present application are based on the same inventive concept, so the implementation of this embodiment may refer to the implementation of the corresponding data integration method, and the repetition is not repeated.
The respective modules in the data integration apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or independent of a processor in the terminal device or a processor in the server, or may be stored in software in a memory in the terminal device or a memory in the server, so that the processor may call and execute operations corresponding to the above modules.
Further, according to the above-described data integration method, based on the same technical concept, one or more embodiments of the present application further provide an electronic device, where the electronic device is configured to perform the above-described data integration method, and fig. 8 is a schematic structural diagram of an electronic device provided by one or more embodiments of the present application.
As shown in fig. 8, the electronic device may have a relatively large difference due to different configurations or performances, and may include one or more processors 301 and a memory 302, where the memory 302 may store one or more storage applications or data. Wherein the memory 302 may be transient storage or persistent storage. The application programs stored in memory 302 may include one or more modules (not shown), each of which may include a series of computer-executable instructions in the electronic device. Still further, the processor 301 may be arranged to communicate with the memory 302 and execute a series of computer executable instructions in the memory 302 on an electronic device. The electronic device may also include one or more power supplies 303, one or more wired or wireless network interfaces 304, one or more input/output interfaces 305, one or more keyboards 306, and the like.
In one particular embodiment, an electronic device includes a memory, and one or more programs, where the one or more programs are stored in the memory, and the one or more programs may include one or more modules, and each module may include a series of computer-executable instructions for the electronic device, and execution of the one or more programs by one or more processors includes instructions for:
acquiring a plurality of labeling data; each annotation data comprises a training sample and a label of the training sample;
Determining a plurality of target tag combinations which meet preset conditions in the tags;
acquiring a plurality of training sample subsets corresponding to each target label combination from the training samples;
Determining target labels meeting the integration condition in the corresponding target label combination according to the similarity between the training sample subsets;
And integrating the target label and the training sample corresponding to the target label to obtain a target labeling data set.
The electronic device provided by one or more embodiments of the present application obtains a plurality of labeling data including training samples and labels, and determines a plurality of target label combinations in the labels that satisfy a preset condition; acquiring a plurality of training sample subsets corresponding to each target label combination from the training samples; determining target labels meeting the integration condition in the corresponding target label combination according to the similarity between the training sample subsets; and integrating the target labels and training samples corresponding to the target labels to obtain a target labeling data set. Therefore, the integration of the scattered annotation data is realized, so that the integrated target annotation data set can be used by the following related tasks, the multiplexing rate of the annotation data can be greatly improved, and the problems of repeated annotation and the like are avoided; moreover, based on the similarity between the training sample subsets, the target labels meeting the integration condition in the corresponding target label combinations are determined, and the accuracy of the target labels is guaranteed because the similarity between the training sample subsets can represent the internal aggregation degree of the corresponding labels, so that the accuracy of the target labeling data set is guaranteed.
It should be noted that, the embodiment of the present application related to the electronic device and the embodiment of the present application related to the data integration method are based on the same inventive concept, so the specific implementation of this embodiment may refer to the implementation of the corresponding data integration method, and the repetition is not repeated.
Further, in accordance with the above-described data integration method, based on the same technical concept, one or more embodiments of the present application further provide a storage medium, which is used to store computer executable instructions, and in a specific embodiment, the storage medium may be a U disc, an optical disc, a hard disc, etc., where the computer executable instructions stored in the storage medium can implement the following flows when executed by a processor:
acquiring a plurality of labeling data; each annotation data comprises a training sample and a label of the training sample;
Determining a plurality of target tag combinations which meet preset conditions in the tags;
acquiring a plurality of training sample subsets corresponding to each target label combination from the training samples;
Determining target labels meeting the integration condition in the corresponding target label combination according to the similarity between the training sample subsets;
And integrating the target label and the training sample corresponding to the target label to obtain a target labeling data set.
When the computer executable instructions stored in the storage medium provided by one or more embodiments of the present application are executed by a processor, a plurality of labeling data including training samples and labels are obtained, and a plurality of target label combinations satisfying preset conditions in the labels are determined; acquiring a plurality of training sample subsets corresponding to each target label combination from the training samples; determining target labels meeting the integration condition in the corresponding target label combination according to the similarity between the training sample subsets; and integrating the target labels and training samples corresponding to the target labels to obtain a target labeling data set. Therefore, the integration of the scattered annotation data is realized, so that the integrated target annotation data set can be used by the following related tasks, the multiplexing rate of the annotation data can be greatly improved, and the problems of repeated annotation and the like are avoided; moreover, based on the similarity between the training sample subsets, the target labels meeting the integration condition in the corresponding target label combinations are determined, and the accuracy of the target labels is guaranteed because the similarity between the training sample subsets can represent the internal aggregation degree of the corresponding labels, so that the accuracy of the target labeling data set is guaranteed.
It should be noted that, the embodiments related to the storage medium and the embodiments related to the data integration method in the present application are based on the same inventive concept, so the specific implementation of this embodiment may refer to the implementation of the corresponding data integration method, and the repetition is not repeated.
The foregoing describes certain embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable GATE ARRAY, FPGA)) is an integrated circuit whose logic functions are determined by user programming of the device. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented with "logic compiler (logic compiler)" software, which is similar to the software compiler used in program development and writing, and the original code before being compiled is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but HDL is not just one, but a plurality of kinds, such as ABEL(Advanced Boolean Expression Language)、AHDL(Altera Hardware Description Language)、Confluence、CUPL(Cornell University Programming Language)、HDCal、JHDL(Java Hardware Description Language)、Lava、Lola、MyHDL、PALASM、RHDL(Ruby Hardware Description Language), and VHDL (Very-High-SPEED INTEGRATED Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.
The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application SPECIFIC INTEGRATED Circuits (ASICs), programmable logic controllers, and embedded microcontrollers, examples of controllers include, but are not limited to, the following microcontrollers: ARC625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each unit may be implemented in the same piece or pieces of software and/or hardware when implementing the embodiments of the present application.
It will be appreciated by those skilled in the art that one or more embodiments of the application may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
One or more embodiments of the application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. One or more embodiments of the application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
The embodiments of the present application are described in a progressive manner, and the same and similar parts of the embodiments are all referred to each other, and each embodiment is mainly described in the differences from the other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.
The foregoing description is by way of example only and is not intended to limit the present disclosure. Various modifications and changes may occur to those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. that fall within the spirit and principles of the present document are intended to be included within the scope of the claims of the present document.

Claims (15)

1. A method of data integration, comprising:
acquiring a plurality of labeling data; each annotation data comprises a training sample and a label of the training sample;
Determining a plurality of target tag combinations which meet preset conditions in the tags;
acquiring a plurality of training sample subsets corresponding to each target label combination from the training samples;
Determining target labels meeting the integration condition in the corresponding target label combination according to the similarity between the training sample subsets;
And integrating the target label and the training sample corresponding to the target label to obtain a target labeling data set.
2. The method of claim 1, wherein the determining a plurality of target tag combinations of the tags that satisfy a preset condition comprises:
the labels are combined in pairs to obtain a plurality of first label combinations;
Determining the similarity between two tags in each first tag combination;
And determining a first label combination corresponding to the maximum similarity corresponding to each label as a target label combination meeting a preset condition.
3. The method of claim 2, wherein said determining a similarity between two tags in each of said first tag combinations comprises:
Coding each tag through a coding model to obtain a first feature vector of each tag;
for each first label combination, determining a vector distance between two labels in the first label combination according to the first feature vector, and determining an editing distance between the two labels in the first label combination;
and determining the similarity between two labels in the label combination according to the vector distance and the editing distance.
4. The method of claim 1, wherein the obtaining a training sample subset corresponding to each of the target tag combinations from the training samples comprises:
Dividing each training sample to obtain a plurality of training sample sets; each training sample set corresponds to one label;
For each training sample set, determining whether the number of the training samples included in the training sample set is smaller than a preset number;
if the number of the training samples is smaller than the preset number, determining the training sample set as a training sample subset of the corresponding label;
If the number of the training samples is not smaller than the preset number, clustering the training samples to obtain a plurality of clustering results; determining the clustering result as a training sample subset of the labels corresponding to the training sample set;
and determining the training sample subset of the label as the training sample subset corresponding to the target label combination where the label is located.
5. The method of claim 4, wherein clustering the training samples to obtain a plurality of clustered results comprises:
Inputting the training samples into a conversion model for conversion treatment to obtain a second feature vector of each training sample;
And clustering based on the second feature vector to obtain a plurality of clustering results.
6. The method according to claim 1, wherein determining, according to the similarity between the training sample subsets, a target tag that satisfies an integration condition from the corresponding target tags includes:
For each target label combination, determining a first integration degree of labels in the target label combination according to the similarity among the training sample subsets corresponding to the target label combination;
and determining the target label meeting the integration condition in the target label combination according to the first integration degree.
7. The method of claim 6, wherein determining a first degree of integration of the tags in the target tag combination based on the degree of similarity between the plurality of training sample subsets to which the target tag combination corresponds comprises:
Determining a sample distance threshold corresponding to the target label combination according to the training sample subset corresponding to the target label combination;
Acquiring a target sample corresponding to each training sample corresponding to the target label combination based on a sample generation model and the sample distance threshold;
and determining a first integration degree of the labels in the target label combination according to the target sample.
8. The method of claim 7, wherein the target tag combination comprises a first tag and a second tag, the training sample subset of the first tag being a first training sample subset and the training sample subset of the second tag being a second training sample subset;
The determining a sample distance threshold corresponding to the target label combination according to the training sample subset corresponding to the target label combination includes:
inputting each training sample in the first training sample subset and the second training sample subset corresponding to the target label combination into a feature detection model for feature detection processing to obtain sample features of each training sample;
Determining a first distance between each first training sample subset and each second training sample subset according to the sample characteristics; the first distance characterizes a similarity between the first subset of training samples and the second subset of training samples;
And determining the average distance of each first distance, and determining the average distance as a sample distance threshold corresponding to the target label combination.
9. The method of claim 8, wherein the obtaining, based on the sample generation model and the sample distance threshold, a target sample for each of the training samples for the target tag combination comprises:
Inputting each first training sample in the first training sample subset and the sample distance threshold value into the sample generation model for sample generation processing to obtain at least one first generation sample corresponding to each first training sample; a sample distance between the first generated sample and the first training sample is less than the sample distance threshold;
Inputting each second training sample in the second training sample subset into the sample generation model for sample generation processing to obtain at least one second generated sample corresponding to each second training sample; a sample distance between the second generated sample and the second training sample is less than the sample distance threshold;
For each first training sample, determining a first generation sample with the largest target feature in the first generation samples corresponding to the first training samples as a target first generation sample corresponding to the first training samples; the target features comprise sample features corresponding to the first tag and sample features corresponding to the second tag;
And determining, for each second training sample, a second generated sample with the largest target feature among the second generated samples corresponding to the second training samples as a target second generated sample corresponding to the second training sample.
10. The method of claim 9, wherein determining a first degree of integration of the tags in the target tag combination from the target sample comprises:
Inputting the first target generation samples into a multi-classification model to perform first label prediction processing to obtain a first multi-classification result of each first target generation sample; inputting the first target generation samples into a classification model to perform second label prediction processing to obtain first two-class classification results of each first target generation sample;
determining a first integration degree of the first tag according to the first multi-classification result and the first two-classification result;
Inputting the second target generation samples into a multi-classification model to perform first label prediction processing to obtain a second multi-classification result of each second target generation sample; inputting the second target generation samples into a classification model to perform second label prediction processing to obtain a second classification result of each second target generation sample;
and determining the first integration degree of the second label according to the second multi-classification result and the second classification result.
11. The method of claim 10, wherein determining a first degree of integration of the first tag based on the first multi-class result and the first two-class result comprises:
acquiring a first probability that a label of the corresponding first target generation sample is the first label in the first multi-classification result; acquiring a second probability that the label of the corresponding first target generation sample is the first label in the first two-class classification result; determining a first average value of the first probability and a second average value of the second probability as a first integration degree of the first tag;
The determining the first integration degree of the second tag according to the second multi-classification result and the second classification result includes:
acquiring a third probability that the label of the corresponding second target generation sample is the second label in the second multi-classification result; acquiring a fourth probability that the label of the corresponding second target generation sample is the second label in the second classification result; and determining a third average value of the third probability and a fourth average value of the fourth probability as a first integration degree of the second label.
12. The method of claim 6, wherein the target tag combination comprises a first tag and a second tag; the determining, according to the first integration degree, the target tag satisfying the integration condition in the target tag combination includes:
For any one of the first tag and the second tag, if the first integration degree of the tag represents that the internal polymerization degree of the tag is smaller than a preset threshold value, determining that the tag is a first target tag meeting an integration condition;
And if the first integration degree of the first tag represents that the internal polymerization degree of the first tag is not smaller than a preset threshold value and the first integration degree of the second tag represents that the internal polymerization degree of the second tag is not smaller than the preset threshold value, determining the first tag and the second tag as second target tags meeting the integration condition.
13. A data integration apparatus, comprising:
the first acquisition module is used for acquiring a plurality of annotation data; each annotation data comprises a training sample and a label of the training sample;
the first determining module is used for determining a plurality of target tag combinations which meet preset conditions in the tags;
The second acquisition module is used for acquiring a training sample subset corresponding to each target label combination from the training samples;
the second determining module is used for determining target labels meeting the integration condition in the corresponding target label combination according to the similarity between the training sample subsets;
and the integration module is used for integrating the target label and the training sample corresponding to the target label to obtain a target labeling data set.
14. An electronic device, comprising:
A processor; and
A memory arranged to store computer executable instructions configured to be executed by the processor, the executable instructions comprising steps for performing the data integration method of any one of claims 1-12.
15. A computer-readable storage medium storing computer-executable instructions for causing a computer to perform the data integration method of any one of claims 1-12.
CN202311015412.8A 2023-08-11 2023-08-11 Data integration method, device and equipment Pending CN117951514A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311015412.8A CN117951514A (en) 2023-08-11 2023-08-11 Data integration method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311015412.8A CN117951514A (en) 2023-08-11 2023-08-11 Data integration method, device and equipment

Publications (1)

Publication Number Publication Date
CN117951514A true CN117951514A (en) 2024-04-30

Family

ID=90795004

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311015412.8A Pending CN117951514A (en) 2023-08-11 2023-08-11 Data integration method, device and equipment

Country Status (1)

Country Link
CN (1) CN117951514A (en)

Similar Documents

Publication Publication Date Title
CN112084337B (en) Training method of text classification model, text classification method and equipment
CN112685565B (en) Text classification method based on multi-mode information fusion and related equipment thereof
CN107066464B (en) Semantic natural language vector space
US10762439B2 (en) Event clustering and classification with document embedding
CN111950269A (en) Text statement processing method and device, computer equipment and storage medium
CN116227474B (en) Method and device for generating countermeasure text, storage medium and electronic equipment
CN110597961B (en) Text category labeling method and device, electronic equipment and storage medium
US20190236135A1 (en) Cross-lingual text classification
CN117235226A (en) Question response method and device based on large language model
CN112131883B (en) Language model training method, device, computer equipment and storage medium
CN113221555B (en) Keyword recognition method, device and equipment based on multitasking model
CN111401062B (en) Text risk identification method, device and equipment
CN110674297B (en) Public opinion text classification model construction method, public opinion text classification device and public opinion text classification equipment
CN114298121A (en) Multi-mode-based text generation method, model training method and device
CN112417093B (en) Model training method and device
CN114722834A (en) Semantic recognition model training method, equipment and medium based on contrast learning
CN114519120A (en) Image searching method and device based on multi-modal algorithm
CN114691864A (en) Text classification model training method and device and text classification method and device
CN116543264A (en) Training method of image classification model, image classification method and device
CN110728147A (en) Model training method and named entity recognition method
CN111145914A (en) Method and device for determining lung cancer clinical disease library text entity
CN114722832A (en) Abstract extraction method, device, equipment and storage medium
CN113051910A (en) Method and device for predicting emotion of character role
CN113342935A (en) Semantic recognition method and device, electronic equipment and readable storage medium
CN113515593A (en) Topic detection method and device based on clustering model and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination