CN112085080B - Sample equalization method, device, equipment and storage medium - Google Patents

Sample equalization method, device, equipment and storage medium Download PDF

Info

Publication number
CN112085080B
CN112085080B CN202010899784.1A CN202010899784A CN112085080B CN 112085080 B CN112085080 B CN 112085080B CN 202010899784 A CN202010899784 A CN 202010899784A CN 112085080 B CN112085080 B CN 112085080B
Authority
CN
China
Prior art keywords
samples
target
sample set
balanced
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010899784.1A
Other languages
Chinese (zh)
Other versions
CN112085080A (en
Inventor
杨晨
杨天行
彭彬
张一麟
宋勋超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202010899784.1A priority Critical patent/CN112085080B/en
Publication of CN112085080A publication Critical patent/CN112085080A/en
Application granted granted Critical
Publication of CN112085080B publication Critical patent/CN112085080B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Signal Processing For Digital Recording And Reproducing (AREA)

Abstract

The application discloses a sample equalization method, a sample equalization device, sample equalization equipment and a sample equalization storage medium, relates to the technical field of data processing, and particularly relates to deep learning, artificial intelligence and intelligent search technologies. The specific implementation scheme is as follows: determining a target label to be balanced from at least two labels associated with a sample set to be balanced according to the number of samples corresponding to the labels in the sample set to be balanced, and taking the samples corresponding to the target label in the sample set to be balanced as target samples; adding the target samples to enable the number of samples corresponding to the target labels in the sample set to be balanced to reach the number of target samples, and obtaining a new sample set; and if the number of samples corresponding to the other labels except the target label in the new sample set is smaller than the number of target samples, adding the samples corresponding to the other labels except the target label in the new sample set. According to the technology, the balance degree of the number of the corresponding samples of each tag in the sample set is improved.

Description

Sample equalization method, device, equipment and storage medium
Technical Field
The application relates to the technical field of data processing, in particular to deep learning, artificial intelligence and intelligent searching technologies. The application provides a sample equalization method, a sample equalization device, sample equalization equipment and a storage medium.
Background
In the process of training a multi-classification model, due to the specificity of multi-label classification tasks, data samples of the multi-label classification tasks marked by clients generally meet sample equalization requirements of model training. The model obtained through unbalanced sample training often divides data into label categories with high sample ratio, so that the model classification errors are caused.
Disclosure of Invention
The present disclosure provides a sample equalization method, apparatus, device, and storage medium.
According to an aspect of the present disclosure, there is provided a sample equalization method, including:
determining a target label to be balanced from at least two labels associated with a sample set to be balanced according to the number of samples corresponding to the labels in the sample set to be balanced, taking the samples corresponding to the target label in the sample set to be balanced as target samples, wherein the sample set to be balanced comprises at least two samples, and each sample is provided with at least one label;
adding the target samples to enable the number of samples corresponding to the target labels in the sample set to be balanced to reach the number of target samples, and obtaining a new sample set;
if the number of samples corresponding to other tags except the target tag in the new sample set is smaller than the number of target samples, the samples corresponding to the other tags except the target tag in the new sample set are added, so that the number of samples corresponding to the other tags in the sample set to be balanced reaches the number of target samples.
According to another aspect of the present disclosure, there is provided a sample equalization apparatus, comprising:
the label determining module is used for determining a target label to be balanced from at least two labels related to a sample set to be balanced according to the number of samples corresponding to the labels in the sample set to be balanced, and taking the samples corresponding to the target label in the sample set to be balanced as target samples, wherein the sample set to be balanced comprises at least two samples, and each sample is provided with at least one label;
the first adding module is used for adding the target samples to enable the number of samples corresponding to the target labels in the sample set to be balanced to reach the number of target samples, and a new sample set is obtained;
and the second increasing module is used for increasing the samples corresponding to the other labels except the target label in the new sample set if the number of the samples corresponding to the other labels except the target label in the new sample set is smaller than the number of the target samples, so that the number of the samples corresponding to the other labels in the sample set to be balanced reaches the number of the target samples.
According to still another aspect of the present disclosure, there is provided an electronic device, including:
at least one processor; and
A memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the embodiments of the present application.
According to yet another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method of any one of the embodiments of the present application.
According to the technology, the balance degree of the number of the corresponding samples of each tag in the sample set is improved.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for better understanding of the present solution and do not constitute a limitation of the present application. Wherein:
fig. 1 is a flowchart of a sample equalization method provided in an embodiment of the present application;
FIG. 2 is a flow chart of another sample equalization method provided by an embodiment of the present application;
FIG. 3 is a flow chart of yet another sample equalization method provided by an embodiment of the present application;
FIG. 4 is a flow chart of yet another sample equalization method provided by an embodiment of the present application;
FIG. 5 is a flow chart of yet another sample equalization method provided by an embodiment of the present application;
fig. 6 is a schematic diagram of a sample equalization apparatus according to an embodiment of the present application;
fig. 7 is a block diagram of an electronic device of a sample equalization method according to an embodiment of the present application.
Detailed Description
Exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a flowchart of a sample equalization method provided in an embodiment of the present application. The embodiment can be applied to the situation of balancing training samples of the multi-label classification model. The method may be performed by a sample equalization apparatus. The apparatus may be implemented in software and/or hardware. Referring to fig. 1, a sample equalization method provided in an embodiment of the present application includes:
S110, determining a target label to be balanced from at least two labels associated with a sample set to be balanced according to the number of samples corresponding to the labels in the sample set to be balanced, taking samples corresponding to the target label in the sample set to be balanced as target samples, wherein the sample set to be balanced comprises at least two samples, and each sample is provided with at least one label.
The sample set to be balanced refers to a sample set to be balanced, namely, the sample distribution of different labels in the sample set is unbalanced.
At least two labels associated with a sample set to be equalized refer to labels that the samples in the sample set to be equalized have.
Illustratively, taking the text to be classified as legal documents as an example, the labels can be traffic incidents and non-traffic incidents, property disputes, illegal financing and the like.
The target tag refers to the tag of the sample to be equalized.
In one embodiment, the determination of the target tag may include:
and taking the label corresponding to the minimum sample number as a target label according to the sample number corresponding to the label in the sample set to be balanced.
Optionally, in another embodiment, the determining of the target tag may also include:
And taking the label corresponding to the sample number smaller than the set number threshold value as the target label according to the sample number corresponding to the label in the sample set to be equalized.
The target sample refers to a sample corresponding to the target label in the sample set to be balanced, namely, a sample labeled with the target label in the sample set to be balanced.
And S120, adding the target samples, so that the number of samples corresponding to the target labels in the sample set to be balanced reaches the target number of samples, and obtaining a new sample set.
The target sample number refers to the sample number that the target tag should have after equalization.
Optionally, the determining of the target sample number includes:
and determining the number of the target samples according to the maximum value or the average value of the number of the samples corresponding to each tag in the sample set to be balanced.
In one embodiment, adding the target sample may include:
inputting the target sample into a pre-trained model, and adding the similar samples of the output target sample into a sample set to be equalized; or,
and copying the target sample, and adding the copied sample into a sample set to be equalized.
The new sample set refers to a sample set to be equalized, wherein after the target sample is added, the sample number of the target label reaches the target sample number.
And S130, if the number of samples corresponding to other labels except the target label in the new sample set is smaller than the number of target samples, adding samples corresponding to other labels except the target label in the new sample set, and enabling the number of samples corresponding to the other labels in the sample set to be balanced to reach the number of target samples.
Wherein, other labels refer to labels in the new sample set except for the target label.
The samples corresponding to other labels except the target label in the new sample set refer to samples marked with other labels in the new sample set.
According to the technical scheme, if the number of samples corresponding to other tags except the target tag in the new sample set is smaller than the number of target samples, the samples corresponding to the other tags except the target tag in the new sample set are added, so that balance of the other tags except the target tag is achieved, sample balance of the multi-tag sample set is achieved, and the balance degree of the multi-tag sample set is improved.
Fig. 2 is a flowchart of another sample equalization method provided in an embodiment of the present application. The scheme is an extension of the scheme based on the scheme. Referring to fig. 2, the sample equalization method provided in the embodiment of the present application includes:
S210, determining a target label to be balanced from at least two labels associated with a sample set to be balanced according to the number of samples corresponding to the labels in the sample set to be balanced, taking samples corresponding to the target label in the sample set to be balanced as target samples, wherein the sample set to be balanced comprises at least two samples, and each sample is provided with at least one label.
S220, adding the target samples, enabling the number of samples corresponding to the target labels in the sample set to be balanced to reach the target number of samples, and obtaining a new sample set.
S230, counting the sample number of other labels except the target label in the new sample set based on the added target sample.
When the target sample is a multi-label sample, the target sample has other labels in addition to the target label.
For example, the target sample is a first sentence and the target tag is a property dispute. Other tags that the first sentence has may be illegal financing or the like.
And S240, if the number of samples corresponding to other tags except the target tag in the new sample set is smaller than the number of target samples, adding samples corresponding to other tags except the target tag in the new sample set, so that the number of samples corresponding to the other tags in the sample set to be equalized reaches the number of target samples.
In one embodiment, if the number of samples corresponding to the other tags in the new sample set except the target tag is smaller than the target number of samples, adding samples corresponding to the other tags in the new sample set except the target tag to make the number of samples corresponding to the other tags in the sample set to be equalized reach the target number of samples, including:
if the number of samples corresponding to other tags except the target tag in the new sample set is smaller than the number of target samples, determining a tag to be balanced from the other tags according to the number of samples corresponding to the other tags and the number of target samples, and determining a sample to be balanced corresponding to the tag to be balanced in the new sample set;
and adding the samples to be balanced, so that the number of the samples of the labels to be balanced in the new sample set reaches the target number of samples.
The label to be balanced refers to a label to be balanced.
The samples to be equalized refer to samples having tags to be equalized in the new sample set.
In one embodiment, determining the tag to be balanced from the other tags according to the number of samples corresponding to the other tags and the target number of samples includes:
Respectively comparing the sample number of other labels except the target label in the new sample set with the target sample number;
and if the number of samples of the other tags is smaller than the target number of samples, taking the other tags as the tags to be balanced.
According to the scheme, based on the added target samples, the number of samples corresponding to other tags except the target tag in the new sample set is counted; and triggering sample equalization processing on other tags according to the statistical result to realize sample equalization on other tags except the target tag in the sample set.
Fig. 3 is a flowchart of yet another sample equalization method provided in an embodiment of the present application. Based on the scheme, the method comprises the steps of adding the target samples in the step of ' adding the number of samples corresponding to the target labels in the sample set to be balanced ' to reach the number of target samples, and obtaining the specific optimization of a new sample set '. Referring to fig. 3, the sample equalization method provided in the embodiment of the present application includes:
s310, determining a target label to be balanced from at least two labels associated with a sample set to be balanced according to the number of samples corresponding to the labels in the sample set to be balanced, taking samples corresponding to the target label in the sample set to be balanced as target samples, wherein the sample set to be balanced comprises at least two samples, and each sample is provided with at least one label.
S320, determining a difference value between the sample number of the target tag in the sample set to be balanced and the target sample number.
S330, adding the target samples according to the determined difference value, so that the number of samples of the target labels reaches the number of the target samples, and obtaining a new sample set.
And S340, if the number of samples corresponding to other tags except the target tag in the new sample set is smaller than the number of target samples, adding samples corresponding to other tags except the target tag in the new sample set, so that the number of samples corresponding to the other tags in the sample set to be equalized reaches the number of target samples.
According to the method, the sample of the target tag in the sample set to be balanced is increased to the target sample number according to the difference value between the sample number of the target tag in the sample set to be balanced and the target sample number, so that the sample of the target tag in the sample set to be balanced is increased to the target sample number.
Fig. 4 is a flowchart of yet another sample equalization method provided in an embodiment of the present application. Based on the scheme, the method comprises the steps of adding the target samples in the step of ' adding the number of samples corresponding to the target labels in the sample set to be balanced ' to reach the number of target samples, and obtaining the specific optimization of a new sample set '. Referring to fig. 3, the sample equalization method provided in the embodiment of the present application includes:
S410, determining a target label to be balanced from at least two labels associated with a sample set to be balanced according to the number of samples corresponding to the labels in the sample set to be balanced, taking samples corresponding to the target label in the sample set to be balanced as target samples, wherein the sample set to be balanced comprises at least two samples, and each sample is provided with at least one label.
And S420, if the types of the target samples are at least two, and the number of the target samples is at least two, adding at least two target samples according to the number of the target samples, so that the number of the target labels in the sample set to be balanced reaches the number of the target samples, and obtaining a new sample set.
Wherein the at least two target samples comprise two, three or all kinds of target samples.
If the samples of the target label in the sample set to be balanced are the first sample and the second sample, the first sample and the second sample are copied respectively, and the copied first sample and second sample are added into the sample set to be balanced, so that the number of the first sample and the second sample in the sample set to be balanced is equivalent, and the richness of the sample set to be balanced is improved.
In one embodiment, the adding at least two target samples according to the target sample number to make the sample number of the target tag in the sample set to be equalized reach the target sample number, to obtain a new sample set includes:
determining the increase number of various target samples in the at least two target samples according to the target sample number, wherein the difference value between the increase numbers of the various target samples is smaller than a set difference value threshold;
and adding the at least two target samples according to the determined added quantity of the various target samples to obtain a new sample set.
The set difference threshold may be determined according to actual needs, which is not limited in this embodiment.
For example, if the number of target samples is 8, the types of target samples are 2, and the number of target samples included in the sample set to be equalized before the equalization process is 2, the number of increase of each target sample is determined to be 3, and the calculation formula is: (8-2)/(2).
And S430, if the number of samples corresponding to other tags except the target tag in the new sample set is smaller than the number of target samples, adding samples corresponding to other tags except the target tag in the new sample set, so that the number of samples corresponding to the other tags in the sample set to be equalized reaches the number of target samples.
According to the scheme, if the types of the target samples are at least two, and the number of the target samples is at least two, the target samples of at least two types are increased according to the number of the target samples, so that the richness of the samples in the new sample set is improved.
Fig. 5 is a flowchart of yet another sample equalization method provided in an embodiment of the present application. The present embodiment is an extension of the above-described scheme on the basis of the above-described embodiment. Referring to fig. 5, the sample equalization method provided in the embodiment of the present application includes:
s510, determining a target label to be balanced from at least two labels associated with a sample set to be balanced according to the number of samples corresponding to the labels in the sample set to be balanced, and taking the samples corresponding to the target label in the sample set to be balanced as target samples, wherein the sample set to be balanced comprises at least two samples, and each sample is provided with at least one label.
S520, adding the target samples, and enabling the number of samples corresponding to the target labels in the sample set to be balanced to reach the target number of samples, so as to obtain a new sample set.
And S530, if the number of samples corresponding to other labels except the target label in the new sample set is smaller than the number of target samples, adding samples corresponding to other labels except the target label in the new sample set, so that the number of samples corresponding to the other labels in the sample set to be balanced reaches the number of target samples.
S540, determining the high-frequency label of the new sample set.
The high-frequency tag refers to a tag with a sample appearance frequency higher than that of other tags in the sample set.
In one embodiment, the determining the high frequency signature of the new sample set includes:
determining the difference value between the sample number of each label in the new sample set and the target sample number;
and determining the high-frequency tag from at least two tags associated with the sample set according to the determined difference value.
For example, a tag corresponding to the sample number having the largest difference is used as the high-frequency tag.
Optionally, the determining the high frequency tag of the increased sample set includes:
determining sample number differences among different tags in the new sample set;
and determining the high-frequency tag from at least two tags associated with the sample set according to the determined difference value.
S550, determining the number of labels to be one from the new sample set, and labeling the high-frequency samples of the high-frequency labels.
Wherein the high frequency sample is a sample labeled with only a high frequency tag.
S560, reducing the number of the high-frequency samples to reduce the difference between the number of samples of the high-frequency tag and the number of samples of other tags in the new sample set except the high-frequency tag.
In one embodiment, the high frequency samples may be deleted by a set number.
In order to improve the deleting accuracy, the high-frequency samples can be deleted according to the difference value between the number of samples of the high-frequency tag and the number of target samples.
Taking samples in a sample set to be balanced as multi-label samples as an example, the method specifically comprises the following steps:
sentence1 a,b,c
sentence2 a,c
sentence3 a,d
sentence4 a,e
where, sense 1, sense 2, sense 3, and sense 4 are different samples, and a, b, c, d and e are different labels of the samples. The number of target samples is set to be 4, and the number of samples corresponding to each tag before equalization processing is as follows: a is 4, b is 1, c is 2, d is 1, e is 1. Sample equalization is performed by using the sample equalization method provided by the embodiment of the application, and the specific process is as follows:
taking the label with the smallest sample number as a first label to be balanced, taking e as an example, copying the content 4, and adding 3 content 4 obtained by copying into a sample set to obtain a first sample set;
carrying out reckoning on the first sample set to obtain a first statistical result as follows: a is 7, b is 1, c is 2, d is 1, e is 4;
determining a second label to be balanced according to the first statistical result, taking b as an example, copying the content 1, and adding 3 content 1 obtained by copying into the first sample set to obtain a second sample set;
Re-counting the second sample set to obtain a second statistical result as follows: a is 10, b is 4, c is 5, d is 1, e is 4;
determining a third label to be balanced as d according to the second statistical result, copying the content 3, and adding 3 content 3 obtained by copying into the second sample set to obtain a third sample set;
re-counting the third sample set to obtain a third statistical result as follows: 13, b:4, c:5, d:4, e:4;
because the high-frequency label sample size is too high after equalization, when a large number of samples are analyzed, if the number of other labels is unchanged while the high-frequency label samples are reduced, the number of samples corresponding to the high-frequency labels can be properly reduced. If a sample such as "presence 5 a" exists, the sample can be deleted appropriately to ensure that the samples corresponding to each tag are balanced.
To ensure the richness of the samples, the two samples corresponding to c are respectively referred to as a source 1 and a source 2. When the number of samples of c is increased, the two samples are repeatedly constructed, instead of constructing only one sample.
According to the scheme, after the duty ratio of the low-frequency samples is improved, the number of high-frequency samples caused by the improvement of the number of the samples of the low-frequency samples is reduced, so that the balance degree of each label sample in the sample set is further improved.
Fig. 6 is a schematic diagram of a sample equalization apparatus according to an embodiment of the present application. Referring to fig. 6, a sample equalization apparatus 600 provided in an embodiment of the present application includes: a tag determination module 601, a first augmentation module 602, and a second augmentation module 603.
The tag determining module 601 is configured to determine a target tag to be balanced from at least two tags associated with a sample set to be balanced according to the number of samples corresponding to the tags in the sample set to be balanced, and take a sample corresponding to the target tag in the sample set to be balanced as a target sample, where the sample set to be balanced includes at least two samples, and each sample has at least one tag;
a first adding module 602, configured to add the target samples, so that the number of samples corresponding to the target tag in the sample set to be equalized reaches the target number of samples, and obtain a new sample set;
and a second adding module 603, configured to add samples corresponding to other tags in the new sample set except the target tag if the number of samples corresponding to other tags in the new sample set except the target tag is smaller than the target sample number, so that the number of samples corresponding to the other tags in the sample set to be balanced reaches the target sample number.
According to the technical scheme, if the number of samples corresponding to other tags except the target tag in the new sample set is smaller than the number of target samples, the samples corresponding to the other tags except the target tag in the new sample set are added, so that balance of the other tags except the target tag is achieved, sample balance of the multi-tag sample set is achieved, and the balance degree of the multi-tag sample set is improved.
Further, the apparatus further comprises:
and the quantity counting module is used for increasing the samples corresponding to the other labels except the target label in the new sample set if the quantity of the samples corresponding to the other labels except the target label in the new sample set is smaller than the quantity of the target sample, so that before the quantity of the samples corresponding to the other labels in the sample set to be balanced reaches the quantity of the target sample, counting the quantity of the samples of the other labels except the target label in the new sample set based on the increased target sample.
Further, the second adding module includes:
the label determining unit is configured to determine a label to be balanced from the other labels according to the number of samples corresponding to the other labels and the number of target samples if the number of samples corresponding to the other labels in the new sample set except the target label is smaller than the number of target samples, and determine a sample to be balanced corresponding to the label to be balanced in the new sample set;
And the sample adding unit is used for adding the samples to be balanced so that the number of the samples of the tags to be balanced in the new sample set reaches the target number of samples.
Further, the first adding module includes:
a sample difference unit, configured to determine a difference between the number of samples of the target tag and the number of target samples in the sample set to be equalized;
and the sample adding unit is used for adding the target samples according to the determined difference value, so that the number of the samples of the target labels reaches the number of the target samples, and a new sample set is obtained.
Further, the first adding module includes:
and the sample adding unit is used for adding at least two target samples according to the number of the target samples if the number of the target samples is at least two, so that the number of the samples of the target labels in the sample set to be balanced reaches the number of the target samples, and a new sample set is obtained.
Further, the sample adding unit is specifically configured to:
determining the increase number of various target samples in the at least two target samples according to the target sample number, wherein the difference value between the increase numbers of the various target samples is smaller than a set difference value threshold;
And adding the at least two target samples according to the determined added quantity of the various target samples to obtain a new sample set.
Further, the apparatus further comprises:
the high-frequency tag determining module is configured to, if the number of samples corresponding to the other tags in the new sample set except the target tag is smaller than the number of target samples, increase samples corresponding to the other tags in the new sample set except the target tag, and determine the high-frequency tag of the new sample set after the number of samples corresponding to the other tags in the sample set to be equalized reaches the number of target samples;
the high-frequency sample determining module is used for determining the high-frequency samples, the number of which is one, from the new sample set and are marked with the high-frequency labels;
a sample number reducing module, configured to reduce the number of high frequency samples, so as to reduce a difference between the number of samples of the high frequency tag and the number of samples of other tags in the new sample set except the high frequency tag.
Further, the high frequency tag determination module includes:
the difference value determining unit is used for determining the difference value between the number of samples corresponding to each label in the new sample set and the number of target samples;
And the high-frequency tag determining unit is used for determining the high-frequency tag from at least two tags corresponding to the new sample set according to the determined difference value.
According to embodiments of the present application, an electronic device and a readable storage medium are also provided.
As shown in fig. 7, a block diagram of an electronic device according to a sample equalization method according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the application described and/or claimed herein.
As shown in fig. 7, the electronic device includes: one or more processors 701, memory 702, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 701 is illustrated in fig. 7.
Memory 702 is a non-transitory computer-readable storage medium provided herein. Wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the sample equalization methods provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the sample equalization methods provided herein.
The memory 702 is used as a non-transitory computer readable storage medium for storing non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules (e.g., the tag determination module 601, the first augmentation module 602, and the second augmentation module 603 shown in fig. 6) corresponding to the sample equalization method in the embodiments of the present application. The processor 701 executes various functional applications of the server and data processing, i.e., a method of implementing sample equalization in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 702.
Memory 702 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created from the use of the sample equalization electronics, and the like. In addition, the memory 702 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 702 optionally includes memory remotely located with respect to processor 701, which may be connected to the sample equalization electronics via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device of the sample equalization method may further include: an input device 703 and an output device 704. The processor 701, the memory 702, the input device 703 and the output device 704 may be connected by a bus or otherwise, in fig. 7 by way of example.
The input device 703 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the sample equalization electronic device, such as a touch screen, keypad, mouse, trackpad, touchpad, pointer stick, one or more mouse buttons, trackball, joystick, and like input devices. The output device 704 may include a display apparatus, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibration motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.
According to the technical scheme, the equalization degree of the number of the corresponding samples of each tag in the sample set is improved.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions disclosed in the present application can be achieved, and are not limited herein.
The above embodiments do not limit the scope of the application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims (14)

1. A method of sample equalization, comprising:
determining a target label to be balanced from at least two labels associated with a sample set to be balanced according to the number of the samples corresponding to the labels in the sample set to be balanced, taking the samples corresponding to the target label in the sample set to be balanced as target samples, wherein the sample set to be balanced comprises at least two samples, each sample is provided with at least one label, and the samples comprise multi-label samples;
Adding the target samples to enable the number of samples corresponding to the target labels in the sample set to be balanced to reach the number of target samples, and obtaining a new sample set;
if the number of samples corresponding to other tags except the target tag in the new sample set is smaller than the number of target samples, adding samples corresponding to other tags except the target tag in the new sample set, and enabling the number of samples corresponding to the other tags in the sample set to be balanced to reach the number of target samples;
determining a high frequency tag of the new sample set;
determining the number of labels from the new sample set as one, and marking the high-frequency samples with the high-frequency labels;
reducing the number of high frequency samples to reduce the difference between the number of samples of the high frequency tag and the number of samples of other tags in the new sample set than the high frequency tag;
the adding the target samples to enable the number of samples corresponding to the target labels in the sample set to be balanced to reach the target number of samples, and obtaining a new sample set comprises the following steps:
if the types of the target samples are at least two, and the number of the target samples is at least two, adding at least two target samples according to the number of the target samples, so that the number of the target labels in the sample set to be balanced reaches the number of the target samples, and obtaining a new sample set; the multi-tag content of each target sample is the same.
2. The method of claim 1, wherein if the number of samples corresponding to the other tags in the new sample set than the target tag is smaller than the target number of samples, adding the samples corresponding to the other tags in the new sample set than the target tag, so that the number of samples corresponding to the other tags in the sample set to be equalized reaches the target number of samples, the method further comprises:
based on the added target samples, counting the sample number of other tags in the new sample set except the target tag.
3. The method according to claim 1 or 2, wherein if the number of samples corresponding to other tags in the new sample set than the target tag is smaller than the target number of samples, increasing the samples corresponding to other tags in the new sample set than the target tag to the number of samples corresponding to the other tags in the sample set to be equalized includes:
if the number of samples corresponding to other tags except the target tag in the new sample set is smaller than the number of target samples, determining a tag to be balanced from the other tags according to the number of samples corresponding to the other tags and the number of target samples, and determining a sample to be balanced corresponding to the tag to be balanced in the new sample set;
And adding the samples to be balanced, so that the number of the samples of the labels to be balanced in the new sample set reaches the target number of samples.
4. The method according to claim 1 or 2, wherein the adding the target samples to the number of samples corresponding to the target tag in the sample set to be equalized reaches a target number of samples, and obtaining a new sample set further includes:
determining a difference value between the sample number of the target tag in the sample set to be balanced and the target sample number;
and adding the target samples according to the determined difference value, so that the number of the samples of the target labels reaches the number of the target samples, and obtaining a new sample set.
5. The method of claim 1, wherein the adding at least two target samples according to the target sample number to make the sample number of the target tag in the sample set to be equalized reach the target sample number, to obtain a new sample set, includes:
determining the increase number of various target samples in the at least two target samples according to the target sample number, wherein the difference value between the increase numbers of the various target samples is smaller than a set difference value threshold;
And adding the at least two target samples according to the determined added quantity of the various target samples to obtain a new sample set.
6. The method of claim 1, wherein the determining the high frequency signature of the new sample set comprises:
determining the difference value between the number of samples corresponding to each label in the new sample set and the number of target samples;
and determining the high-frequency tag from at least two tags corresponding to the new sample set according to the determined difference value.
7. An apparatus for sample equalization, comprising:
the label determining module is used for determining a target label to be balanced from at least two labels related to a sample set to be balanced according to the number of the samples corresponding to the labels in the sample set to be balanced, and taking the samples corresponding to the target label in the sample set to be balanced as target samples, wherein the sample set to be balanced comprises at least two samples, each sample is provided with at least one label, and the samples comprise multi-label samples;
the first adding module is used for adding the target samples to enable the number of samples corresponding to the target labels in the sample set to be balanced to reach the number of target samples, and a new sample set is obtained;
A second adding module, configured to add samples corresponding to other tags in the new sample set, except the target tag, if the number of samples corresponding to other tags in the new sample set, except the target tag, is smaller than the number of target samples, so that the number of samples corresponding to the other tags in the sample set to be equalized reaches the number of target samples;
the high-frequency tag determining module is configured to, if the number of samples corresponding to the other tags in the new sample set except the target tag is smaller than the number of target samples, increase samples corresponding to the other tags in the new sample set except the target tag, and determine the high-frequency tag of the new sample set after the number of samples corresponding to the other tags in the sample set to be equalized reaches the number of target samples;
the high-frequency sample determining module is used for determining the high-frequency samples, the number of which is one, from the new sample set and are marked with the high-frequency labels;
a sample number reducing module, configured to reduce the number of high-frequency samples, so as to reduce a difference between the number of samples of the high-frequency tag and the number of samples of other tags in the new sample set except the high-frequency tag;
Wherein the first adding module includes:
and the sample adding unit is used for adding at least two target samples according to the number of the target samples if the number of the target samples is at least two, so that the number of the target samples in the sample set to be balanced reaches the number of the target samples, a new sample set is obtained, and the multi-label content of each target sample is the same.
8. The apparatus of claim 7, the apparatus further comprising:
and the quantity counting module is used for increasing the samples corresponding to the other labels except the target label in the new sample set if the quantity of the samples corresponding to the other labels except the target label in the new sample set is smaller than the quantity of the target sample, so that before the quantity of the samples corresponding to the other labels in the sample set to be balanced reaches the quantity of the target sample, counting the quantity of the samples of the other labels except the target label in the new sample set based on the increased target sample.
9. The apparatus of claim 7 or 8, wherein the second adding module comprises:
the label determining unit is configured to determine a label to be balanced from the other labels according to the number of samples corresponding to the other labels and the number of target samples if the number of samples corresponding to the other labels in the new sample set except the target label is smaller than the number of target samples, and determine a sample to be balanced corresponding to the label to be balanced in the new sample set;
And the sample adding unit is used for adding the samples to be balanced so that the number of the samples of the tags to be balanced in the new sample set reaches the target number of samples.
10. The apparatus of claim 7 or 8, wherein the first augmentation module further comprises:
a sample difference unit, configured to determine a difference between the number of samples of the target tag and the number of target samples in the sample set to be equalized;
and the sample adding unit is used for adding the target samples according to the determined difference value, so that the number of the samples of the target labels reaches the number of the target samples, and a new sample set is obtained.
11. The apparatus of claim 7, wherein the sample addition unit is specifically configured to:
determining the increase number of various target samples in the at least two target samples according to the target sample number, wherein the difference value between the increase numbers of the various target samples is smaller than a set difference value threshold;
and adding the at least two target samples according to the determined added quantity of the various target samples to obtain a new sample set.
12. The apparatus of claim 7, wherein the high frequency tag determination module comprises:
The difference value determining unit is used for determining the difference value between the number of samples corresponding to each label in the new sample set and the number of target samples;
and the high-frequency tag determining unit is used for determining the high-frequency tag from at least two tags corresponding to the new sample set according to the determined difference value.
13. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.
14. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-6.
CN202010899784.1A 2020-08-31 2020-08-31 Sample equalization method, device, equipment and storage medium Active CN112085080B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010899784.1A CN112085080B (en) 2020-08-31 2020-08-31 Sample equalization method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010899784.1A CN112085080B (en) 2020-08-31 2020-08-31 Sample equalization method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112085080A CN112085080A (en) 2020-12-15
CN112085080B true CN112085080B (en) 2024-03-08

Family

ID=73731635

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010899784.1A Active CN112085080B (en) 2020-08-31 2020-08-31 Sample equalization method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112085080B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106874279A (en) * 2015-12-11 2017-06-20 腾讯科技(深圳)有限公司 Generate the method and device of applicating category label
CN108628971A (en) * 2018-04-24 2018-10-09 深圳前海微众银行股份有限公司 File classification method, text classifier and the storage medium of imbalanced data sets
WO2019169704A1 (en) * 2018-03-08 2019-09-12 平安科技(深圳)有限公司 Data classification method, apparatus, device and computer readable storage medium
CN110852379A (en) * 2019-11-11 2020-02-28 北京百度网讯科技有限公司 Training sample generation method and device and electronic equipment
CN111061581A (en) * 2018-10-16 2020-04-24 阿里巴巴集团控股有限公司 Fault detection method, device and equipment
CN111079811A (en) * 2019-12-06 2020-04-28 西安电子科技大学 Sampling method for multi-label classified data imbalance problem
KR20200054121A (en) * 2019-11-29 2020-05-19 주식회사 루닛 Method for machine learning and apparatus for the same
CN111198906A (en) * 2019-12-20 2020-05-26 天阳宏业科技股份有限公司 Data processing method, device and system and storage medium
CN111400499A (en) * 2020-03-24 2020-07-10 网易(杭州)网络有限公司 Training method of document classification model, document classification method, device and equipment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106874279A (en) * 2015-12-11 2017-06-20 腾讯科技(深圳)有限公司 Generate the method and device of applicating category label
WO2019169704A1 (en) * 2018-03-08 2019-09-12 平安科技(深圳)有限公司 Data classification method, apparatus, device and computer readable storage medium
CN108628971A (en) * 2018-04-24 2018-10-09 深圳前海微众银行股份有限公司 File classification method, text classifier and the storage medium of imbalanced data sets
CN111061581A (en) * 2018-10-16 2020-04-24 阿里巴巴集团控股有限公司 Fault detection method, device and equipment
CN110852379A (en) * 2019-11-11 2020-02-28 北京百度网讯科技有限公司 Training sample generation method and device and electronic equipment
KR20200054121A (en) * 2019-11-29 2020-05-19 주식회사 루닛 Method for machine learning and apparatus for the same
CN111079811A (en) * 2019-12-06 2020-04-28 西安电子科技大学 Sampling method for multi-label classified data imbalance problem
CN111198906A (en) * 2019-12-20 2020-05-26 天阳宏业科技股份有限公司 Data processing method, device and system and storage medium
CN111400499A (en) * 2020-03-24 2020-07-10 网易(杭州)网络有限公司 Training method of document classification model, document classification method, device and equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Using high-frequency water quality data to assess sampling strategies for the EU Water Framework Directive;R. A. Skeffington 等;Hydrology and Earth System Sciences;全文 *
一种多标签随机均衡采样算法;李思豪;陈福才;黄瑞阳;;计算机应用研究(10);全文 *

Also Published As

Publication number Publication date
CN112085080A (en) 2020-12-15

Similar Documents

Publication Publication Date Title
CN112560912B (en) Classification model training method and device, electronic equipment and storage medium
CN111753914B (en) Model optimization method and device, electronic equipment and storage medium
CN111104514B (en) Training method and device for document tag model
CN111522967B (en) Knowledge graph construction method, device, equipment and storage medium
JP7235817B2 (en) Machine translation model training method, apparatus and electronic equipment
CN112650907A (en) Search word recommendation method, target model training method, device and equipment
CN111078878B (en) Text processing method, device, equipment and computer readable storage medium
CN112380847B (en) Point-of-interest processing method and device, electronic equipment and storage medium
CN111860769A (en) Method and device for pre-training neural network
CN111241838B (en) Semantic relation processing method, device and equipment for text entity
CN111858905B (en) Model training method, information identification device, electronic equipment and storage medium
CN110569370B (en) Knowledge graph construction method and device, electronic equipment and storage medium
CN111127191B (en) Risk assessment method and risk assessment device
EP3822815A1 (en) Method and apparatus for mining entity relationship, electronic device, storage medium, and computer program product
CN111241810A (en) Punctuation prediction method and device
CN111984774B (en) Searching method, searching device, searching equipment and storage medium
CN111783427B (en) Method, device, equipment and storage medium for training model and outputting information
CN111563198B (en) Material recall method, device, equipment and storage medium
CN114492788A (en) Method and device for training deep learning model, electronic equipment and storage medium
CN113312451B (en) Text label determining method and device
CN112597288B (en) Man-machine interaction method, device, equipment and storage medium
CN112100530B (en) Webpage classification method and device, electronic equipment and storage medium
CN111310058B (en) Information theme recommendation method, device, terminal and storage medium
CN111125445B (en) Community theme generation method and device, electronic equipment and storage medium
CN111311309A (en) User satisfaction determining method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant