CN110796200A - Data classification method, terminal, device and storage medium - Google Patents

Data classification method, terminal, device and storage medium Download PDF

Info

Publication number
CN110796200A
CN110796200A CN201911044522.0A CN201911044522A CN110796200A CN 110796200 A CN110796200 A CN 110796200A CN 201911044522 A CN201911044522 A CN 201911044522A CN 110796200 A CN110796200 A CN 110796200A
Authority
CN
China
Prior art keywords
data
sub
training set
verification
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911044522.0A
Other languages
Chinese (zh)
Other versions
CN110796200B (en
Inventor
陈瑞钦
黄启军
李诗琦
唐兴兴
林冰垠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN201911044522.0A priority Critical patent/CN110796200B/en
Publication of CN110796200A publication Critical patent/CN110796200A/en
Application granted granted Critical
Publication of CN110796200B publication Critical patent/CN110796200B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data classification method, which comprises the following steps: when a data classification instruction is received, a target characteristic identifier is obtained, data in a data set is partitioned based on the target characteristic identifier, classification operation is carried out on each data block based on a preset classification rule to obtain a sub-training set, a sub-verification set and a sub-test set, and the sub-training set, the sub-verification set and the sub-test set are respectively sent to the training set, the sub-verification set and the sub-test set and the test set. The invention also discloses a device, a terminal and a storage medium. The data with the same characteristic data corresponding to the target characteristic identification are divided into the data blocks on the terminal, and then the data blocks are classified into the sub-training set, the sub-verification set and the sub-test set according to the preset proportion and are sent, so that the corresponding data are directly sent to the training set, the verification set and the test set on the terminal.

Description

Data classification method, terminal, device and storage medium
Technical Field
The present invention relates to the field of terminal technologies, and in particular, to a data classification method, a terminal, an apparatus, and a storage medium.
Background
Today is a big data age, and big data analysis is generally done on distributed systems, such as: the distributed machine learning system is used for constructing a machine learning model under big data. In the process of training a machine learning model, original data is generally split according to a certain proportion and divided into a training set, a verification set and a test set, so that the original data stored on each computing node needs to be split according to requirements. For example, if a user selects a feature to be split as x, where the feature includes three values a, b, and c, data of the three values needs to be split respectively, and then split results of the three values are combined to obtain a final training set, a final verification set, and a final test set.
The current pain point: the large data are generally stored on a plurality of computing nodes in a distributed manner, the data are required to be subdivided according to characteristics in a traditional map-reduce computing mode or a group operator equal-division layer splitting method based on the map-reduce computing mode, namely the data stored on each computing node are required to be subdivided according to the value of the characteristic x, so that single characteristics or selected characteristic data are stored on each computing node, then each characteristic data on each computing node is randomly split according to the proportion, and the splitting results are converged to obtain the final layer splitting result. This splitting method requires a large amount of data to be transmitted between different computing nodes, resulting in poor computing performance.
Disclosure of Invention
The invention mainly aims to provide a data classification method, a terminal, a device and a storage medium, and aims to solve the technical problems that when data classification is carried out on a sample data set of an existing training machine learning model, data movement needs to be carried out among distributed terminals, so that the system processing burden is heavy, and the data classification time is long.
In order to achieve the above object, the present invention provides a data classification method applied to a terminal, the data classification method comprising the steps of:
when a data classification instruction is received, acquiring a target characteristic identifier;
partitioning the data in the data set based on the target feature identification to obtain a plurality of data blocks;
classifying each data block based on a preset classification rule to obtain a sub-training set, a sub-verification set and a sub-test set;
and respectively sending the sub-training set to a training set, the sub-verification set to a verification set and the sub-test set to a test set.
Further, in an embodiment, the data includes a target feature identifier, the feature data corresponding to the target feature identifier has m values, m is a positive integer, and the step of blocking the data in the data set based on the target feature identifier to obtain a plurality of data blocks includes:
and dividing the data with the same value of the characteristic data corresponding to the target characteristic identifier in the data set into one data block to obtain data blocks corresponding to the m values.
Further, in an embodiment, the step of performing a classification operation on each data block based on a preset classification rule to obtain a sub-training set, a sub-verification set, and a sub-test set includes:
acquiring proportional data of the sub-training set, the sub-verification set and the sub-test set;
and traversing each data block, and correspondingly distributing each data block to the sub-training set, the sub-verification set or the sub-test set based on the proportion data.
Further, in an embodiment, the data set includes first data, and after the step of obtaining the target feature identifier when receiving the data classification instruction, the method further includes:
and when the feature data corresponding to the target feature identifier in the first data meets the admission condition of a training set, determining the first data as the training set data, and sending the training set data to the training set.
Further, in an embodiment, when the feature data corresponding to the target feature identifier of the first data satisfies a training set admission condition, the step of determining that the first data is training set data includes:
and when the quantity of the data required by the training set is greater than or equal to a threshold value and the feature data of the first data exists in the feature data of the required data, determining the first data as the training set data, wherein the required data comprises the feature data corresponding to the target feature identifier.
Further, in an embodiment, the data set includes second data, and after the step of obtaining the target feature identifier when receiving the data classification instruction, the method further includes:
and when the feature data corresponding to the target feature identifier in the second data meets the admission condition of a verification set, determining the second data as the verification set data, and sending the verification set data to the verification set.
Further, in an embodiment, the data set includes third data, and after the step of obtaining the target feature identifier when receiving the data classification instruction, the method further includes:
and when the feature data corresponding to the target feature identifier in the third data meets the test set admission condition, determining the third data as test set data, and sending the test set data to the test set.
Further, in an embodiment, the data classification apparatus includes:
the acquisition module is used for acquiring the target characteristic identification when a data classification instruction is received;
the blocking module is used for blocking the data in the data set based on the target feature identification to obtain a plurality of data blocks;
the classification module is used for performing classification operation on each data block based on a preset classification rule to obtain a sub-training set, a sub-verification set and a sub-test set;
and the sending module is used for respectively sending the sub-training set to the training set, the sub-verification set to the verification set and the sub-test set to the test set.
In order to achieve the above object, the present invention further provides a terminal, including: a memory, a processor and a data classification program stored on the memory and executable on the processor, the data classification program when executed by the processor implementing the steps of the data classification method as described above.
In addition, to achieve the above object, the present invention further provides a storage medium having a data classification program stored thereon, the data classification program implementing the steps of any one of the data classification methods described above when executed by a processor.
The method comprises the steps of obtaining a target characteristic identifier when a data classification instruction is received, partitioning data in a data set based on the target characteristic identifier to obtain a plurality of data blocks, classifying the data blocks based on a preset classification rule to obtain a sub-training set, a sub-verification set and a sub-test set, and finally respectively sending the sub-training set to the training set, the sub-verification set to the verification set and the sub-test set to the test set. The data with the same characteristic data corresponding to the target characteristic identification are divided into one data block, each data block is directly classified into a training set, a verification set and a test set according to a preset proportion, and finally, the sub-training set is respectively sent to the training set, the sub-verification set is sent to the verification set and the sub-test set is sent to the test set, so that the corresponding data are directly sent to the training set, the verification set and the test set at a terminal.
Drawings
Fig. 1 is a schematic structural diagram of a terminal in a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a data classification method according to a first embodiment of the present invention;
FIG. 3 is a flow chart illustrating a prior art method of data classification according to an embodiment of the present invention;
FIG. 4 is a schematic diagram illustrating a whole process of classification processing according to an embodiment of the data classification method of the present invention;
FIG. 5 is a flowchart illustrating a data classification method according to a second embodiment of the present invention;
FIG. 6 is a functional block diagram of an embodiment of a data sorting apparatus according to the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, fig. 1 is a schematic structural diagram of a terminal in a hardware operating environment according to an embodiment of the present invention.
As shown in fig. 1, the terminal may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.
Optionally, the terminal may further include a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WiFi module, and the like. Such as light sensors, motion sensors, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display screen according to the brightness of ambient light, and a proximity sensor that may turn off the display screen and/or the backlight when the mobile terminal is moved to the ear. As one of the motion sensors, the attitude sensor can detect the magnitude of acceleration in each direction (generally, three axes), detect the magnitude and direction of gravity when stationary, and can be used for applications (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration) for recognizing the attitude of the mobile terminal, and related functions (such as pedometer and tapping) for vibration recognition; of course, the terminal may also be configured with other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which are not described herein again.
Those skilled in the art will appreciate that the terminal structure shown in fig. 1 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a data sorting program.
In the terminal shown in fig. 1, the network interface 1004 is mainly used for connecting to a backend server and performing data communication with the backend server; the user interface 1003 is mainly used for connecting a user terminal and performing data communication with the user terminal; and the processor 1001 may be used to invoke the data sorting program stored in the memory 1005.
In this embodiment, the terminal includes: the system comprises a memory 1005, a processor 1001 and a data classification program which is stored in the memory 1005 and can be run on the processor 1001, wherein when the processor 1001 calls the data classification program stored in the memory 1005, the steps of the data classification method provided by each embodiment of the application are executed.
The invention also provides a data classification method, and referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of the data classification method of the invention.
While a logical order is shown in the flow chart, in some cases, the steps shown or described may be performed in an order different than presented herein.
The data classification method of the first embodiment of the present invention is applied to a terminal, and a plurality of terminal devices are in communication connection with a server, and in this embodiment, the data classification method includes:
step S10, when a data classification instruction is received, a target feature identifier is obtained;
in this embodiment, in deep machine model learning, an available data set is often divided into a training set, a verification set, and a test set, where the training set refers to a sample set for training and is mainly used to train parameters in a neural network; the verification set is understood in a literal sense and is a sample set for verifying the performance of the model, and after training of different machine models on the training set is finished, the performance of each model is compared and judged through the verification set; the test set is used for objectively evaluating the performance of the machine model for the trained machine model.
In particular, the data set is stored in the terminal, and the data set of a plurality of terminals can be considered as one original data set, and the original data set needs to be divided into a training set, a verification set and a test set for deep machine model learning. One data is composed of characteristic data and target data corresponding to the characteristic data. The feature data comprises a plurality of features, and one feature has a plurality of value conditions. For example, for characterizing data of a person, the features may include gender, height, weight, age range, etc., and possible values for the feature "age range" are under 10 years, 10-18 years, 18-30 years, 30-50 years, and over 50 years. Note that the data in the data set includes the same kind of features, but the specific values of the features are different. The data set is divided into a training set, a verification set and a test set, a certain characteristic is required to be determined as a target characteristic, and then the data are finally distributed to the training set, the verification set or the test set according to the specific value of the target characteristic corresponding to each data. Therefore, when a data classification instruction is received, the target feature identification is synchronously obtained, the specific target feature identification is determined according to the actual condition, the target feature is generally one feature, and the target feature identification can be determined according to historical data and expert experience.
Step S20, partitioning the data in the data set based on the target feature identification to obtain a plurality of data blocks;
in this embodiment, after the target feature identifier is determined, that is, the classification basis of the data is determined, the data set is partitioned according to the specific value of the target feature identifier corresponding to each data, so as to obtain a plurality of data blocks.
Further, as shown in fig. 3, there are n terminals, the target feature x has m feature values, when data set classification is performed, data on each terminal is moved in the prior art, data in all terminals participating in data classification is moved according to the target feature, data with the same value corresponding to the target feature is divided into the same terminal, then data blocks in each terminal are split, and the split data are respectively output to a training set, a verification set and a test set. The data classification method provided by the invention is to perform blocking and classification in one terminal without involving data movement between terminals, so that compared with the prior art, the data classification method avoids data movement between terminals during data classification, thereby reducing system resource consumption, saving data classification time and improving data classification efficiency.
Specifically, step S20 includes: and dividing the data with the same value of the characteristic data corresponding to the target characteristic identifier in the data set into one data block to obtain data blocks corresponding to the m values.
In this embodiment, the target feature identifier is a feature in the data, and one feature has multiple value-taking conditions, and the data is partitioned according to a value range of the feature value, that is, data with the same feature data corresponding to the target feature identifier in the data set is aggregated to obtain data blocks corresponding to different feature data. Assuming that the target feature x includes m different values, the data set may be divided into m data blocks according to the value of the target feature in the data, and the value of the target feature corresponding to the data in each data block is the same.
For example, the characteristic data used for describing a person may include sex, height, weight, age group, and the like, and assuming that there may be 5 kinds of characteristic "age group" values, which are respectively under 10 years old, 10-18 years old, 18-30 years old, 30-50 years old, and over 50 years old, a data set including 1000 pieces of data is identified by using "age group" as a target characteristic, data in the data set with "age group" under 10 years old are divided into one data block, similarly, data in "age group" between 10-18 years old are divided into one data block, data in "age group" between 18-30 years old are divided into one data block, data in "age group" between 30-50 years old are divided into one data block, and data in "age group" over 50 years old are divided into one data block, so that 5 data blocks are finally obtained. In general, when the number of data included in the data set is large, the number of the divided data blocks is equal to the number of values in the value range of the target feature, that is, the target feature has 5 possible values, and the data set is divided into 5 data blocks.
Step S30, classifying each data block based on a preset classification rule to obtain a sub-training set, a sub-verification set and a sub-test set;
in this embodiment, after the data set is partitioned according to the values of the target features of the data, the data blocks corresponding to different feature data are obtained, that is, the values of the target features corresponding to each data in one data block are the same. And then, continuously classifying each data block according to a preset classification rule to further obtain a sub-training set, a sub-verification set and a sub-test set, wherein the preset classification rule is as follows: after obtaining the proportional data of the sub-training set, the sub-verification set and the sub-test set, dividing the data in each data block into sub-training set data, sub-verification set data and sub-test set data according to the proportional data, and then dividing the sub-training set data into the sub-training set, the sub-verification set data into the sub-verification set and the sub-test set data into the sub-test set
Specifically, step S30 includes:
step S31, obtaining the proportion data of the sub training set, the sub verification set and the sub test set;
in this embodiment, the scale data commonly used for small scale data sets is a sub-training set: and (4) sub-verification set: subtest set 6: 2: 2, for example, 10000 data are in total, the sub training set is divided into 6000 data, the sub verification set is 2000 data, and the sub test set is 2000 data; for a large sample set, the proportion of the sub-verification set to the sub-test set is reduced a lot, because a certain sample size is sufficient to verify (compare) the model performance and the test model performance, for example, 10000000 samples are total, the training set is divided into 9980000 samples, the verification set is divided into 10000 samples, and the test set is divided into 10000 samples.
Specifically, the proportion data of the sub-training set, the sub-verification set and the sub-test set is obtained according to the data quantity included in the data set. Optionally, the preset proportional data list stores the data quantity and proportional data corresponding to the data quantity, and the proportional data of the sub-training set, the sub-verification set and the sub-test set corresponding to the data set can be acquired in the proportional data list according to the data quantity of the terminal in a table look-up manner.
Step S32, traversing each data block, and correspondingly allocating each data block to the sub-training set, the sub-verification set, or the sub-test set based on the ratio data.
In this embodiment, after obtaining the proportional data of the sub-training set, the sub-verification set, and the sub-test set, all the data blocks are traversed, and then each data block is allocated to the sub-training set, the sub-verification set, or the sub-test set according to the proportional data, it should be noted that the data in each data block needs to be divided into sub-training set data, sub-verification set data, and sub-test set data according to the proportional data, and the sub-training set data is divided into the sub-training set, the sub-verification set data is divided into the sub-verification set, and the sub-test set data is divided into the sub-test set. Optionally, according to the proportional data, sequentially traversing all data blocks in the data set, completing data block allocation of the sub-training set, then completing data block allocation of the sub-verification set, and finally completing data block allocation of the sub-test set, or simultaneously allocating currently traversed data to the sub-training set, the sub-verification set, and the sub-test set according to the proportional data, and when the number of data blocks in the sub-verification set or the sub-test set meets the requirement of the proportional data, then no data block is allocated to the sub-verification set or the sub-test set. Finally, after traversing all the data blocks, distributing all the data blocks according to the proportion data to obtain a sub-training set, a sub-verification set and a sub-test set.
And step S40, respectively sending the sub training set to the training set, the sub verification set to the verification set and the sub test set to the test set.
In this embodiment, a data set of a plurality of terminals is considered as an original data set, and the original data set is finally divided into a training set, a verification set, and a test set, so that in one terminal, data in the data set is partitioned according to a value obtaining condition of a target feature, that is, data with the same feature data corresponding to a target feature identifier in the data set is aggregated to obtain data blocks corresponding to different feature data. And then, acquiring proportional data of the sub-training set, the sub-verification set and the sub-test set corresponding to the terminal, and distributing all the data blocks according to the proportional data after traversing all the data blocks to obtain the sub-training set, the sub-verification set and the sub-test set. And finally, respectively sending the sub-training set, the sub-verification set and the sub-test set on the terminal to the training set, the verification set and the test set.
For example, referring to fig. 4, the data classification method of the present invention processes as follows:
the first step is as follows: storing an original data set D in n terminals, and partitioning data on each terminal according to the value of the target characteristic x;
the second step is that: each terminal divides the data blocks corresponding to different characteristic data into a sub-training set, a sub-verification set and a sub-test set according to a proportion;
the third step: each terminal outputs the sub-training set, the sub-verification set and the sub-test set;
the fourth step: and combining the sub-training set, the sub-verification set and the sub-test set of each terminal to obtain a training set, a verification set and a test set.
In the data classification method provided in this embodiment, when a data classification instruction is received, a target feature identifier is obtained, data in a data set is partitioned based on the target feature identifier to obtain a plurality of data blocks, then, a classification operation is performed on each data block based on a preset classification rule to obtain a sub-training set, a sub-verification set and a sub-test set, and finally, the sub-training set, the sub-verification set and the sub-test set are respectively sent to the training set, the verification set and the test set. The data with the same characteristic data corresponding to the target characteristic identification are divided into one data block, each data block is directly classified into a training set, a verification set and a test set according to a preset proportion, and finally, the sub-training set is respectively sent to the training set, the sub-verification set is sent to the verification set and the sub-test set is sent to the test set, so that the corresponding data are directly sent to the training set, the verification set and the test set at a terminal.
A second embodiment of the data classification method of the present invention is proposed based on the first embodiment, with reference to fig. 5, and in this embodiment, after step S10, the method includes:
step S50, when the feature data corresponding to the target feature identifier in the first data meets the admission condition of the training set, determining that the first data is the training set data, and sending the training set data to the training set.
In this embodiment, when a data classification instruction is received, a target feature identifier is obtained, then it is determined whether feature data corresponding to the target feature identifier of first data meets an admission condition for a training set, when the feature data corresponding to the target feature identifier of the first data meets the admission condition for the training set, the first data is determined to be training set data, and the first data is sent to the training set.
Specifically, step S50 includes: and when the quantity of the data required by the training set is greater than or equal to a threshold value and the feature data of the first data exists in the feature data of the required data, determining the first data as the training set data, wherein the required data comprises the feature data corresponding to the target feature identifier.
In this embodiment, after a data classification instruction is received, a target feature identifier is obtained, and then proportional data corresponding to a training set, a verification set, and a test set is obtained when different feature values are obtained, where the feature values refer to all possible values corresponding to the target feature identifier, for example, if a target feature x includes m different values, proportional data corresponding to m different values need to be obtained respectively. Optionally, all the scale data of the training set, the verification set, and the test set are stored in a preset scale data list, and may be obtained by table lookup.
Further, after the proportional data corresponding to the different feature values are obtained, the total amount of data to be classified into the training set, the total amount of data to be classified into the verification set and the amount of data to be classified into the training set, which correspond to the different feature values, are calculated according to the proportional data. And then, traversing all data in the data set in sequence, judging whether the data meets the admission condition of the training set, and when the data meets the admission condition of the training set, taking the data as the data of the training set and sending the data to the training set. Wherein, the training set admission condition refers to: the quantity of the data required by the current training set is larger than or equal to a threshold value, the threshold value can be set to be 1, the data set comprises first data, if the first data is traversed, a characteristic value corresponding to a target characteristic of the first data is obtained, when the characteristic data of the first data exists in the characteristic data of the required data, the first data meets the admission condition of the training set, the first data is used as the training set data at the moment and is sent to the training set, meanwhile, the quantity of the data required by the training set is updated, namely, after the operation of the first data is completed, the quantity of the data required by the training set is reduced by 1.
Step S60, when the feature data corresponding to the target feature identifier in the second data meets the verification set admission condition, determining that the second data is verification set data, and sending the verification set data to the verification set.
In this embodiment, after a data classification instruction is received, a target feature identifier is obtained, and then proportional data corresponding to a training set, a verification set, and a test set is obtained when different feature values are obtained, where the feature values refer to all possible values corresponding to the target feature identifier, for example, if a target feature x includes m different values, proportional data corresponding to m different values need to be obtained respectively. Optionally, all the scale data of the training set, the verification set, and the test set are stored in a preset scale data list, and may be obtained by table lookup.
Further, after the proportion data corresponding to the different feature values are obtained, the total amount of data to be classified into the training set, the total amount of data to be classified into the verification set and the amount of data to be classified into the test set, which correspond to the different feature values, are calculated according to the proportion data. And then, sequentially traversing all data in the data set, judging whether the data meets the access condition of the verification set, and when the data meets the access condition of the verification set, taking the data as the data of the verification set and sending the data to the verification set. Wherein, verifying the admission condition refers to: the number of data required by the current verification set is larger than or equal to a threshold value, the threshold value can be set to be 1, the data set comprises second data, if the second data is traversed currently, a characteristic value corresponding to a target characteristic of the second data is obtained, when the characteristic data of the second data exists in the characteristic data of the required data, the second data meets the admission condition of the verification set, the second data is used as the data of the verification set at the moment and is sent to the verification set, meanwhile, the number of data required by the verification set is updated, namely, after the operation of the second data is completed, the number of data required by the verification set is reduced by 1.
Step S70, when the feature data corresponding to the target feature identifier in the third data meets the test set admission condition, determining that the third data is test set data, and sending the test set data to the test set.
In this embodiment, after a data classification instruction is received, a target feature identifier is obtained, and then proportional data corresponding to a training set, a verification set, and a test set is obtained when different feature values are obtained, where the feature values refer to all possible values corresponding to the target feature identifier, for example, if a target feature x includes m different values, proportional data corresponding to m different values need to be obtained respectively. Optionally, all the scale data of the training set, the verification set, and the test set are stored in a preset scale data list, and may be obtained by table lookup.
Further, after the proportion data corresponding to the different feature values are obtained, the total amount of data to be classified into the training set, the total amount of data to be classified into the verification set and the amount of data to be classified into the test set, which correspond to the different feature values, are calculated according to the proportion data. And then, traversing all data in the data set in sequence, judging whether the data meets the test set access condition, and when the data meets the test set access condition, taking the data as the test set data and sending the data to the test set. Wherein, the test set admission condition refers to: the number of data required by the current test set is larger than or equal to a threshold value, the threshold value can be set to be 1, the data set comprises third data, if the third data is traversed currently, a characteristic value corresponding to a target characteristic of the third data is obtained, when the characteristic data of the third data exists in the characteristic data of the required data, the third data meets the access condition of the test set, the third data is used as the test set data and is sent to the test set, meanwhile, the number of the data required by the test set is updated, namely, after the operation of the third data is completed, the number of the data required by the test set is reduced by 1.
The data classification method provided by this embodiment determines the classification type of the data by traversing the data in the terminal and using the training set admission condition, the verification set admission condition, or the test set admission condition, and sends the classification type to the server, thereby directly sending the corresponding data to the training set, the verification set, and the test set at the terminal. Compared with the prior art (all terminals participating in data classification move data according to target characteristics and are used for realizing data classification to the same terminal with the same value of the target characteristics), the data do not need to move at the terminals when the terminals classify the data, and the influence on the system performance and the processing speed due to the data movement among the terminals is avoided, so that the resource consumption of the system is reduced, meanwhile, the data classification time is saved, and the data classification efficiency is improved.
The present invention further provides a data classifying device, referring to fig. 6, fig. 6 is a functional module schematic diagram of an embodiment of the data classifying device of the present invention.
The acquiring module 10 is configured to acquire a target feature identifier when a data classification instruction is received;
a blocking module 20, configured to block data in the data set based on the target feature identifier to obtain a plurality of data blocks;
the classification module 30 is configured to perform classification operations on the data blocks based on preset classification rules to obtain a sub-training set, a sub-verification set, and a sub-test set;
and a sending module 40, configured to send the sub-training set to the training set, the sub-verification set to the verification set, and the sub-test set to the test set, respectively.
Further, the blocking module 20 is further configured to:
and dividing the data with the same value of the characteristic data corresponding to the target characteristic identifier in the data set into one data block to obtain data blocks corresponding to the m values.
Further, the classification module 30 is further configured to:
acquiring proportional data of the sub-training set, the sub-verification set and the sub-test set;
and traversing each data block, and correspondingly distributing each data block to the sub-training set, the sub-verification set or the sub-test set based on the proportion data.
Further, the data classification device further includes:
and the first processing module is used for determining the first data as training set data when the characteristic data corresponding to the target characteristic identification in the first data meets the admission condition of a training set, and sending the training set data to the training set.
Further, the first processing module is further configured to:
and when the quantity of the data required by the training set is greater than or equal to a threshold value and the feature data of the first data exists in the feature data of the required data, determining the first data as the training set data, wherein the required data comprises the feature data corresponding to the target feature identifier.
Further, the data classification device further includes:
and the second processing module is used for determining the second data as verification set data when the feature data corresponding to the target feature identifier in the second data meets the verification set admission condition, and sending the verification set data to the verification set.
Further, the data classification device further includes:
and the third processing module is used for determining the third data as test set data when the feature data corresponding to the target feature identifier in the third data meets the test set admission condition, and sending the test set data to the test set.
In addition, an embodiment of the present invention further provides a storage medium, where the storage medium stores a data classification program, and the data classification program, when executed by a processor, implements the steps of the data classification method in the foregoing embodiments.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be substantially or partially embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for causing a system device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A data classification method is applied to a terminal, and the data classification method comprises the following steps:
when a data classification instruction is received, acquiring a target characteristic identifier;
partitioning the data in the data set based on the target feature identification to obtain a plurality of data blocks;
classifying each data block based on a preset classification rule to obtain a sub-training set, a sub-verification set and a sub-test set;
and respectively sending the sub-training set to a training set, the sub-verification set to a verification set and the sub-test set to a test set.
2. The data classification method according to claim 1, wherein the data includes a target feature identifier, the feature data corresponding to the target feature identifier has m values, m is a positive integer, and the step of blocking the data in the data set based on the target feature identifier to obtain a plurality of data blocks includes:
and dividing the data with the same value of the characteristic data corresponding to the target characteristic identifier in the data set into one data block to obtain data blocks corresponding to the m values.
3. The data classification method according to claim 1, wherein the step of performing classification operation on each data block based on a preset classification rule to obtain a sub-training set, a sub-verification set and a sub-test set comprises:
acquiring proportional data of the sub-training set, the sub-verification set and the sub-test set;
and traversing each data block, and correspondingly distributing each data block to the sub-training set, the sub-verification set or the sub-test set based on the proportion data.
4. The data classification method of claim 1, wherein the data set includes first data, and wherein the step of obtaining the target feature identifier upon receiving the data classification command further comprises:
and when the feature data corresponding to the target feature identifier in the first data meets the admission condition of a training set, determining the first data as the training set data, and sending the training set data to the training set.
5. The data classification method according to claim 4, wherein the step of determining the first data as training set data when the feature data corresponding to the target feature identifier of the first data satisfies a training set admission condition comprises:
and when the quantity of the data required by the training set is greater than or equal to a threshold value and the feature data of the first data exists in the feature data of the required data, determining the first data as the training set data, wherein the required data comprises the feature data corresponding to the target feature identifier.
6. The data classification method of claim 1, wherein the data set includes second data, and wherein the step of obtaining the target feature identifier upon receiving the data classification command further comprises:
and when the feature data corresponding to the target feature identifier in the second data meets the admission condition of a verification set, determining the second data as the verification set data, and sending the verification set data to the verification set.
7. The data classification method according to claim 1, wherein the data set includes third data, and further comprising, after the step of obtaining the target feature identifier upon receiving the data classification command:
and when the feature data corresponding to the target feature identifier in the third data meets the test set admission condition, determining the third data as test set data, and sending the test set data to the test set.
8. A data sorting apparatus, characterized in that the data sorting apparatus comprises:
the acquisition module is used for acquiring the target characteristic identification when a data classification instruction is received;
the blocking module is used for blocking the data in the data set based on the target feature identification to obtain a plurality of data blocks;
the classification module is used for performing classification operation on each data block based on a preset classification rule to obtain a sub-training set, a sub-verification set and a sub-test set;
and the sending module is used for respectively sending the sub-training set to the training set, the sub-verification set to the verification set and the sub-test set to the test set.
9. A terminal, characterized in that the terminal comprises: memory, processor and a data classification program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the data classification method according to any one of claims 1 to 7.
10. A storage medium having the data sorting program stored thereon, the data sorting program when executed by a processor implementing the steps of the data sorting method according to any one of claims 1 to 7.
CN201911044522.0A 2019-10-30 2019-10-30 Data classification method, terminal, device and storage medium Active CN110796200B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911044522.0A CN110796200B (en) 2019-10-30 2019-10-30 Data classification method, terminal, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911044522.0A CN110796200B (en) 2019-10-30 2019-10-30 Data classification method, terminal, device and storage medium

Publications (2)

Publication Number Publication Date
CN110796200A true CN110796200A (en) 2020-02-14
CN110796200B CN110796200B (en) 2022-11-25

Family

ID=69442256

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911044522.0A Active CN110796200B (en) 2019-10-30 2019-10-30 Data classification method, terminal, device and storage medium

Country Status (1)

Country Link
CN (1) CN110796200B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116416884A (en) * 2023-06-12 2023-07-11 深圳市彤兴电子有限公司 Testing device and testing method for display module

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678512A (en) * 2013-12-26 2014-03-26 大连民族学院 Data stream merge sorting method under dynamic data environment
CN104317658A (en) * 2014-10-17 2015-01-28 华中科技大学 MapReduce based load self-adaptive task scheduling method
WO2016118402A1 (en) * 2015-01-22 2016-07-28 Microsoft Technology Licensing, Llc Optimizing multi-class multimedia data classification using negative data
CN106228120A (en) * 2016-07-14 2016-12-14 南京航空航天大学 The extensive human face data mask method of query driven
CN106599798A (en) * 2016-11-25 2017-04-26 南京蓝泰交通设施有限责任公司 Face recognition method facing face recognition training method of big data processing
CN109255480A (en) * 2018-08-30 2019-01-22 中国平安人寿保险股份有限公司 Between servant lead prediction technique, device, computer equipment and storage medium
CN109858886A (en) * 2019-02-18 2019-06-07 国网吉林省电力有限公司电力科学研究院 It is a kind of that control success rate promotion analysis method is taken based on integrated study
CN110084291A (en) * 2019-04-12 2019-08-02 湖北工业大学 A kind of students ' behavior analysis method and device based on the study of the big data limit

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103678512A (en) * 2013-12-26 2014-03-26 大连民族学院 Data stream merge sorting method under dynamic data environment
CN104317658A (en) * 2014-10-17 2015-01-28 华中科技大学 MapReduce based load self-adaptive task scheduling method
WO2016118402A1 (en) * 2015-01-22 2016-07-28 Microsoft Technology Licensing, Llc Optimizing multi-class multimedia data classification using negative data
CN106228120A (en) * 2016-07-14 2016-12-14 南京航空航天大学 The extensive human face data mask method of query driven
CN106599798A (en) * 2016-11-25 2017-04-26 南京蓝泰交通设施有限责任公司 Face recognition method facing face recognition training method of big data processing
CN109255480A (en) * 2018-08-30 2019-01-22 中国平安人寿保险股份有限公司 Between servant lead prediction technique, device, computer equipment and storage medium
CN109858886A (en) * 2019-02-18 2019-06-07 国网吉林省电力有限公司电力科学研究院 It is a kind of that control success rate promotion analysis method is taken based on integrated study
CN110084291A (en) * 2019-04-12 2019-08-02 湖北工业大学 A kind of students ' behavior analysis method and device based on the study of the big data limit

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ROOZBEH RF ET AL: "《Adaptive Incremental Ensemble of Extreme Learning Machines for Fault Diagnosis in Induction Motors》", 《IEEE》 *
吴泽伦: "《基于HADOOP的数据挖掘算法并行化研究与实现》", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116416884A (en) * 2023-06-12 2023-07-11 深圳市彤兴电子有限公司 Testing device and testing method for display module
CN116416884B (en) * 2023-06-12 2023-08-18 深圳市彤兴电子有限公司 Testing device and testing method for display module

Also Published As

Publication number Publication date
CN110796200B (en) 2022-11-25

Similar Documents

Publication Publication Date Title
CN111222647A (en) Federal learning system optimization method, device, equipment and storage medium
CN110084317B (en) Method and device for recognizing images
CN110365503A (en) A kind of Index and its relevant device
CN109688183B (en) Group control equipment identification method, device, equipment and computer readable storage medium
CN111144584A (en) Parameter tuning method, device and computer storage medium
CN107807841B (en) Server simulation method, device, equipment and readable storage medium
US20170184410A1 (en) Method and electronic device for personalized navigation
CN106980571A (en) The construction method and equipment of a kind of test use cases
CN115145801B (en) A/B test flow distribution method, device, equipment and storage medium
CN112084959B (en) Crowd image processing method and device
CN108833515B (en) Block chain node optimization method and device and computer readable storage medium
CN114880310A (en) User behavior analysis method and device, computer equipment and storage medium
CN111385598A (en) Cloud device, terminal device and image classification method
CN110796200B (en) Data classification method, terminal, device and storage medium
CN110069997B (en) Scene classification method and device and electronic equipment
CN110580171B (en) APP classification method, related device and product
CN111814117A (en) Model interpretation method, device and readable storage medium
CN111368045B (en) User intention recognition method, device, equipment and computer readable storage medium
CN111368998A (en) Spark cluster-based model training method, device, equipment and storage medium
KR101966423B1 (en) Method for image matching and apparatus for executing the method
CN115797267A (en) Image quality evaluation method, system, electronic device, and storage medium
CN111339196B (en) Data processing method and system based on block chain and computer readable storage medium
CN108009393B (en) Data processing method, device and computer readable storage medium
CN114237182A (en) Robot scheduling method and system
CN112905792A (en) Text clustering method, device and equipment based on non-text scene and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant