WO2019200782A1 - Sample data classification method, model training method, electronic device and storage medium - Google Patents

Sample data classification method, model training method, electronic device and storage medium Download PDF

Info

Publication number
WO2019200782A1
WO2019200782A1 PCT/CN2018/100157 CN2018100157W WO2019200782A1 WO 2019200782 A1 WO2019200782 A1 WO 2019200782A1 CN 2018100157 W CN2018100157 W CN 2018100157W WO 2019200782 A1 WO2019200782 A1 WO 2019200782A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
distance
value
density
samples
Prior art date
Application number
PCT/CN2018/100157
Other languages
French (fr)
Chinese (zh)
Inventor
王晨羽
王健宗
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019200782A1 publication Critical patent/WO2019200782A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification

Definitions

  • the present application relates to the field of data processing, and in particular, to a sample data classification method, a model training method, an electronic device, and a storage medium.
  • a sample data classification method comprising:
  • the sample data is clustered into a plurality of subsets based on the at least one cluster center and features of each sample.
  • a model training method comprising:
  • the sample data of each category is classified by using the sample data classification method described in any embodiment to obtain a plurality of subsets of each category;
  • the subsets with the same sorting position are read from the plurality of sorted subsets of each category in turn as training samples of the model, and the model is trained.
  • An electronic device comprising: a memory for storing at least one instruction, the processor for executing the at least one instruction to implement a sample data classification method as in any of the embodiments, And/or the model training method of any of any of the embodiments.
  • a non-volatile readable storage medium storing at least one instruction that, when executed by a processor, implements a sample data classification method as described in any embodiment, And/or the model training method as described in any embodiment.
  • the present application calculates features of each sample in the sample data; calculates a distance set of each sample according to characteristics of each sample, and the distance set of each sample includes each sample corresponding to each sample The distance between each sample in the remaining samples; calculate the density value of each sample and calculate the density distance value of each sample according to the distance set of each sample; according to the density value of each sample and the density distance of each sample a value, determining at least one cluster center; clustering the sample data into a plurality of subsets based on the at least one cluster center and features of each sample.
  • the present application trains from easy to difficult to avoid the difficult training samples being eliminated, thereby improving the adaptability of the model parameters.
  • FIG. 1 is a flow chart of a first preferred embodiment of a sample data classification method of the present application.
  • FIG. 2 is a flow chart of a first preferred embodiment of the method of training a model of the present application.
  • FIG. 3 is a block diagram showing the program of a first preferred embodiment of the sample data sorting apparatus of the present application.
  • FIG. 4 is a block diagram of a program of a first preferred embodiment of the model training device of the present application.
  • FIG. 5 is a schematic structural diagram of a preferred embodiment of an electronic device in at least one example of the present application.
  • FIG. 1 it is a flowchart of a first preferred embodiment of the sample data classification method of the present application.
  • the order of the steps in the flowchart may be changed according to different requirements, and some steps may be omitted.
  • the electronic device calculates characteristics of each sample of the sample data.
  • the sample data includes, but is not limited to, pre-acquired data, data crawled from the network. Therefore, in the process of large-scale sample data collection, there is a low correlation with the category indicated by the sample data, or erroneous data appears. In order to improve the accuracy of model training, it is necessary to classify the sample data, automatically detect the simple samples that are easy to be learned in the model training process, and the difficult samples that are not easy to be learned in the model training process, thus achieving Classification of sample data.
  • the features of each sample are extracted using a feature extraction model.
  • the feature extraction model includes, but is not limited to, a deep convolutional neural network model.
  • the sample data is extracted by a deep convolutional neural network.
  • a deep convolutional neural network For example, any network (VGG-16, ResNet-50, etc.) in front of the Soft-max classification layer can be regarded as a feature extractor, and the output of this layer is taken.
  • the deep convolutional neural network model is composed of one input layer, 20 convolution layers, 6 pooling layers, 3 hidden layers, and 1 sorting layer.
  • the model architecture of the deep convolutional neural network model is shown in FIG. 3, wherein Conv ab (for example, Conv 3-64) indicates that the dimension of the layer convolution kernel is a ⁇ a, and the number of convolution kernels of the layer b; Maxpool2 indicates that the pooled core of the pooled layer has a dimension of 2 ⁇ 2; FC-c (for example, FC-6) indicates that the hidden layer (ie, the fully connected layer) has c output nodes; -max indicates that the classification layer classifies the input image using the Soft-max classifier.
  • the training sample is used for training learning to obtain a trained deep convolutional neural network model.
  • Importing the sample data into the trained deep convolutional neural network model can accurately and automatically extract features of each sample in the sample data.
  • the larger the size of the training samples the more accurate the extracted features of the post-training deep convolutional neural network model.
  • the deep convolutional neural network model can also be in other forms of expression, and the present application does not impose any limitation.
  • the electronic device calculates a distance set of each sample according to characteristics of each sample.
  • the distance set of each sample includes a distance between each sample of each of the remaining samples corresponding to each sample, wherein the remaining samples corresponding to each sample include each sample in the sample data.
  • the distance matrix is a 3*2 or 2*3 matrix.
  • the distance includes, without limitation, a European distance, a cosine distance, and the like.
  • Each distance value in the distance matrix is greater than zero. For example, when the calculated cosine distance is less than 0, the absolute value of the calculated cosine distance is taken.
  • the electronic device calculates a density value of each sample according to a distance set of each sample and calculates a density distance value of each sample.
  • the distance set of each sample is compared with the distance threshold, the number of distances greater than the distance threshold is determined, and the number of distances corresponding to each sample is taken as the density value of each sample.
  • the density value of any one of the samples is calculated as follows:
  • ⁇ i represents the density value of the ith sample in the sample data
  • D ij represents the distance between the ith sample and the jth sample in the sample data
  • d c represents the distance threshold
  • the calculating the density distance value of each sample comprises:
  • the maximum distance is selected as the density distance value of the sample having the largest density value from the distance concentration of the sample having the largest density value.
  • the second set of samples includes other samples of the sample data from which the sample having the largest density value is removed.
  • the density distance value of each sample is calculated as follows:
  • ⁇ i represents the density distance value of the i-th sample
  • ⁇ i represents the density value of the i-th sample
  • ⁇ j represents the density value of the j-th sample
  • the distance between the i-th sample and the j-th sample of D ij is the distance between the i-th sample and the j-th sample of D ij .
  • the electronic device determines at least one cluster center according to a density value of each sample and a density distance value of each sample.
  • the determining, according to the density value of each sample and the density distance value of each sample, determining at least one cluster center comprises:
  • cluster metric value of each sample is equal to the product of the density value of each sample and the density distance value of each sample.
  • determining, according to the cluster metric value of each sample, the at least one cluster center includes:
  • the sample with the clustering metric value greater than the threshold is filtered as the clustering center point.
  • the threshold is configured according to a cluster metric value of each sample, for example, a mean value is calculated according to a cluster metric value of each sample, and an average value is taken as the threshold value.
  • the electronic device clusters the sample data into a plurality of subsets based on the at least one cluster center and characteristics of each sample.
  • the sample data is clustered into a plurality of subsets according to a distance set of samples corresponding to each cluster center in the at least one cluster center by using a clustering algorithm.
  • the clustering algorithm includes, but is not limited to, a k-means clustering algorithm, a hierarchical clustering algorithm, and the like.
  • a sample whose distance from each cluster center of the at least one cluster center exceeds a distance threshold is determined as an error sample. This can effectively eliminate the wrong sample.
  • the larger the density value of the sample the more samples are represented similar to the sample.
  • the larger the density distance value of the sample the further the distance between the subset of the sample points and other subsets. Therefore, after clustering according to the above embodiment, the distance between samples in the same subset can be shortened, and the distance between samples in different subsets becomes larger.
  • the denser the samples of a certain subset the more similar the features representing the pictures, the more similar the data representing the subset is to the categories represented by the sample data, which are simple samples, and the model can easily learn the characteristics of simple samples.
  • clustering the sample data through the cluster center selected in the above embodiment can also effectively eliminate the erroneous samples, thereby improving the accuracy of the parameters of the subsequent training model.
  • FIG. 2 is a flow chart of a second preferred embodiment of the model training method of the present application.
  • the order of the steps in the flowchart may be changed according to different requirements, and some steps may be omitted.
  • the electronic device acquires sample data of each category.
  • the trained model is used to identify the category to which the picture to be detected belongs, for example, the model is a vehicle part identification model, and the vehicle part identification model is used to identify which part of the vehicle in the picture to be tested belongs to Part. In this way, it is necessary to obtain sample data of various parts of the vehicle, and the sample data of one part belongs to one category.
  • the electronic device classifies sample data of each category to obtain multiple subsets of each category.
  • the sample data of each category is classified in the first preferred embodiment.
  • step S21 the processing of the step S21 is the same as the data classification method in the first preferred embodiment, and will not be described in detail herein.
  • the electronic device calculates a correlation between each subset of the plurality of subsets of each category and a category of each subset.
  • the distance between samples in the same subset can be shortened, and the distance between samples in different subsets becomes larger.
  • the denser the samples of a certain subset the more similar the characteristics of the representative pictures, the more the data in the subset is related to the category in which the subset is located, and the higher the similarity, the simple samples.
  • the sample of a subset is sparse, the more representative the picture, the more difficult the sample.
  • the number of samples included in each subset is taken as the relevance of each subset to the category in which each subset is located. For example, for a category, three subsets are obtained after clustering: a first subset, a second subset, and a third subset. If the number of samples in the first subset is 40, the number of samples in the second subset is 100, and the number of samples in the third subset is 10, a value of 40 is used to indicate that the first subset is similar to the category of the first subset. degree.
  • the electronic device sorts multiple subsets of each category according to the relevance of each subset and category in multiple subsets of each category, and obtains multiple sorted sub-categories of each category. set.
  • three subsets are obtained after clustering: a first subset, a second subset, and a third subset. If the number of samples in the first subset is 40, the number of samples in the second subset is 100, and the number of samples in the third subset is 10, then the plurality of sorted subsets of the one category is: Two subsets, a first subset, and a third subset.
  • the electronic device sequentially reads, from a plurality of sorted subsets of each category, a subset with the same sorting position as a training sample of the model, and trains the model.
  • the first subset of each category is read as a training sample of the model, and the model is trained to reach the first termination condition and then read each A second subset of the categories, the second subset of each category is added to the model's training samples, and the model continues to be trained until all subsets of each category are used as training samples.
  • the subsets of each subset in each subset of each category are sorted from high to low, based on the relevance of each subset to the category, such simple samples are ranked first, and when training the model, simple The sample is easier to train, and the difficult sample is ranked later. It is more difficult to train. This will divide the training of the model into multiple subtasks. According to the difficulty of the task, it will be trained from easy to difficult to avoid the difficult training samples being rejected. So that the model can learn the characteristics of each category from easy to difficult, thereby improving the adaptability of the model parameters.
  • the higher the ranking position the larger the weight corresponding to the subset.
  • category A there are two categories, category A and category B.
  • the sorted subsets in category A are: subset A1, subset A2, the weight corresponding to subset A1 is 1, and the weight corresponding to subset A2 is 0.5.
  • the sorted subset in category B is subset B1 and subset B2, the weight corresponding to subset B1 is 1, and the weight corresponding to subset B2 is 0.5.
  • First read the subset A1 and the subset B1 to train the model. After reaching the first termination condition, read the subset A2 and the subset B2, and add the subset A2 and the subset B2 to the training samples of the model.
  • Set A1, subset B1, subset A2, and subset B2 are all used as training samples, and the model is trained until the end of training.
  • a sample picture of each part of the vehicle is obtained, and the sample of one part is a picture of one category, and the sample of any part is processed by the sample data classification method in the first preferred embodiment to obtain multiple subsets of each part. And sorting the plurality of subsets of each part by using the method in the second preferred embodiment, and training the vehicle part identification model based on the plurality of sorted subsets of each part.
  • the training of the vehicle part recognition model is divided into a plurality of subtasks, and according to the difficulty degree of the task, the training vehicle part recognition model is sequentially performed from easy to difficult, so as to avoid the difficult training samples being eliminated, thereby making the model easy to It is difficult to learn the characteristics of the sample pictures of various parts, thereby improving the adaptability of the model parameters.
  • the present application classifies the training sample data of each category into a plurality of subsets according to the difficulty level, so that the distance between the samples in the same subset becomes shorter, and the distance between the samples in different subsets becomes larger.
  • the denser the samples of a certain subset the more similar the features representing the pictures, the more similar the data representing the subset is to the categories represented by the sample data, which are simple samples, and the model can easily learn the characteristics of simple samples.
  • the sample of a subset is sparse and the representative picture is more diverse, the data of the subset is considered to be more complicated and is a difficult sample.
  • the plurality of subsets of the training sample data are sorted from easy to difficult, so that the training of the model is divided into multiple subtasks, and the training is performed from easy to difficult according to the difficulty of the task, so as to avoid difficult training.
  • the samples are rejected, allowing the model to learn the characteristics of each category from easy to difficult, thereby increasing the resilience of the model parameters.
  • the sample data classification device 3 includes, but is not limited to, one or more of the following modules: a calculation module 30, a determination module 31, and a clustering module 32.
  • a unit referred to in this application refers to a series of computer readable instruction segments that can be executed by a processor of the sample data classification device 3 and that are capable of performing a fixed function, which are stored in a memory. The function of each unit will be detailed in the subsequent embodiments.
  • the calculation module 30 calculates features of each sample of sample data.
  • the sample data includes, but is not limited to, pre-acquired data, data crawled from the network. Therefore, in the process of large-scale sample data collection, there is a low correlation with the category indicated by the sample data, or erroneous data appears. In order to improve the accuracy of model training, it is necessary to classify the sample data, automatically detect the simple samples that are easy to be learned in the model training process, and the difficult samples that are not easy to be learned in the model training process, thus achieving Classification of sample data.
  • the calculation module 30 extracts features of each sample using a feature extraction model.
  • the feature extraction model includes, but is not limited to, a deep convolutional neural network model.
  • the sample data is extracted by a deep convolutional neural network.
  • a deep convolutional neural network For example, any network (VGG-16, ResNet-50, etc.) in front of the Soft-max classification layer can be regarded as a feature extractor, and the output of this layer is taken. As the extracted feature.
  • the deep convolutional neural network model is composed of one input layer, 20 convolution layers, 6 pooling layers, 3 hidden layers, and 1 sorting layer.
  • the model architecture of the deep convolutional neural network model is shown in FIG. 3, wherein Conv ab (for example, Conv 3-64) indicates that the dimension of the layer convolution kernel is a ⁇ a, and the number of convolution kernels of the layer b; Maxpool2 indicates that the pooled core of the pooled layer has a dimension of 2 ⁇ 2; FC-c (for example, FC-6) indicates that the hidden layer (ie, the fully connected layer) has c output nodes; -max indicates that the classification layer classifies the input image using the Soft-max classifier.
  • the training sample is used for training learning to obtain a trained deep convolutional neural network model.
  • Importing the sample data into the trained deep convolutional neural network model can accurately and automatically extract features of each sample in the sample data.
  • the larger the size of the training samples the more accurate the extracted features of the post-training deep convolutional neural network model.
  • the deep convolutional neural network model can also be in other forms of expression, and the present application does not impose any limitation.
  • the calculation module 30 calculates a distance set for each sample based on the characteristics of each sample.
  • the distance set of each sample includes a distance between each sample of each of the remaining samples corresponding to each sample, wherein the remaining samples corresponding to each sample include each sample in the sample data.
  • the distance matrix is a 3*2 or 2*3 matrix.
  • the distance includes, without limitation, a European distance, a cosine distance, and the like.
  • Each distance value in the distance matrix is greater than zero. For example, when the calculated cosine distance is less than zero, the absolute value of the calculated cosine distance is taken.
  • the calculation module 30 calculates a density value for each sample and calculates a density distance value for each sample based on the distance set of each sample.
  • the calculation module 30 compares each distance of each sample with a distance threshold, determines a distance number greater than the distance threshold, and uses the distance number corresponding to each sample as the density value of each sample. . The larger the density value of such a sample, the more samples are represented similar to the sample.
  • the density value of any one of the samples is calculated as follows:
  • ⁇ i represents the density value of the ith sample in the sample data
  • D ij represents the distance between the ith sample and the jth sample in the sample data
  • d c represents the distance threshold
  • the calculating module 30 calculates the density distance value of each sample includes:
  • the maximum distance is selected as the density distance value of the sample having the largest density value from the distance concentration of the sample having the largest density value.
  • the second set of samples includes other samples of the sample data from which the sample having the largest density value is removed.
  • the density distance value of each sample is calculated as follows:
  • ⁇ i represents the density distance value of the i-th sample
  • ⁇ i represents the density value of the i-th sample
  • ⁇ j represents the density value of the j-th sample
  • the distance between the i-th sample and the j-th sample of D ij is the distance between the i-th sample and the j-th sample of D ij .
  • the determining module 31 determines at least one cluster center according to the density value of each sample and the density distance value of each sample.
  • the determining module 31 determines, according to the density value of each sample and the density distance value of each sample, that the at least one cluster center comprises:
  • cluster metric value of each sample is equal to the product of the density value of each sample and the density distance value of each sample.
  • the determining module 31 determines, according to the cluster metric value of each sample, the at least one cluster center includes:
  • the sample with the clustering metric value greater than the threshold is filtered as the clustering center point.
  • the threshold is configured according to a cluster metric value of each sample, for example, a mean value is calculated according to a cluster metric value of each sample, and an average value is taken as the threshold value.
  • the clustering module 32 clusters the sample data into a plurality of subsets based on the at least one cluster center and features of each sample.
  • the clustering module 32 clusters the sample data into a plurality of subsets according to a distance set of samples corresponding to each of the cluster centers in the at least one cluster center.
  • the clustering algorithm includes, but is not limited to, a k-means clustering algorithm, a hierarchical clustering algorithm, and the like.
  • the determining module 31 determines a sample whose distance from each cluster center of the at least one cluster center exceeds a distance threshold as an error sample. This can effectively eliminate the wrong sample.
  • the larger the density value of the sample the more samples are represented similar to the sample.
  • the larger the density distance value of the sample the further the distance between the subset of the sample points and other subsets. Therefore, after clustering according to the above embodiment, the distance between samples in the same subset can be shortened, and the distance between samples in different subsets becomes larger.
  • the denser the samples of a certain subset the more similar the features representing the pictures, the more similar the data representing the subset is to the categories represented by the sample data, which are simple samples, and the model can easily learn the characteristics of simple samples.
  • clustering the sample data through the cluster center selected in the above embodiment can also effectively eliminate the erroneous samples, thereby improving the accuracy of the parameters of the subsequent training model.
  • the model training device 4 includes, but is not limited to, one or more of the following modules: a data acquisition module 40, a data clustering module 41, a correlation calculation module 42, a ranking module 43, and a training module 44.
  • a unit referred to in this application refers to a series of computer readable instruction segments that can be executed by a processor of the model training device 4 and that are capable of performing a fixed function, which is stored in a memory. The function of each unit will be detailed in the subsequent embodiments.
  • the data acquisition module 40 acquires sample data for each category.
  • the trained model is used to identify the category to which the picture to be detected belongs, for example, the model is a vehicle part identification model, and the vehicle part identification model is used to identify which part of the vehicle in the picture to be tested belongs to Part. In this way, it is necessary to obtain sample data of various parts of the vehicle, and the sample data of one part belongs to one category.
  • the data clustering module 41 classifies the sample data for each category to obtain a plurality of subsets of each category.
  • the sample data of each category is classified in the first preferred embodiment.
  • the data clustering module 41 is used to implement the sample data classification method in the first preferred embodiment, which is not described in detail herein.
  • the relevance calculation module 42 calculates the relevance of each subset of the plurality of subsets of each category to the category in which each subset is located.
  • the distance between samples in the same subset can be shortened, and the distance between samples in different subsets becomes larger.
  • the denser the sample of such a subset the more similar the features of the representative image, the more relevant the data in the subset is to the category in which the subset is located, and the higher the similarity, the simple sample.
  • the sample of a subset is sparse, the more representative the picture, the more difficult the sample.
  • the number of samples included in each subset is taken as the relevance of each subset to the category in which each subset is located. For example, for a category, three subsets are obtained after clustering: a first subset, a second subset, and a third subset. If the number of samples in the first subset is 40, the number of samples in the second subset is 100, and the number of samples in the third subset is 10, a value of 40 is used to indicate that the first subset is similar to the category of the first subset. degree.
  • the sorting module 43 sorts multiple subsets of each category according to the degree of relevance of each subset and category in each subset of each category, and obtains a plurality of sorted subsets of each category. .
  • three subsets are obtained after clustering: a first subset, a second subset, and a third subset. If the number of samples in the first subset is 40, the number of samples in the second subset is 100, and the number of samples in the third subset is 10, then the plurality of sorted subsets of the one category is: Two subsets, a first subset, and a third subset.
  • the training module 44 sequentially reads the subsets with the same sorting position from the plurality of sorted subsets of each category as training samples of the model, and trains the model.
  • the training module 44 reads the first subset of each category from the plurality of sorted subsets of each category as a training sample of the model, and trains the model to reach the first termination condition. Thereafter, a second subset of each category is read, a second subset of each category is added to the model's training samples, and the model continues to be trained until all subsets of each category are used as training samples.
  • the training of the model is divided into multiple subtasks, and the training is performed from easy to difficult according to the difficulty of the task, so as to avoid the difficult training samples being eliminated, so that the model can learn from each category from easy to difficult.
  • the higher the ranking position the larger the weight corresponding to the subset.
  • category A there are two categories, category A and category B.
  • the sorted subsets in category A are: subset A1, subset A2, the weight corresponding to subset A1 is 1, and the weight corresponding to subset A2 is 0.5.
  • the sorted subset in category B is subset B1 and subset B2, the weight corresponding to subset B1 is 1, and the weight corresponding to subset B2 is 0.5.
  • First read the subset A1 and the subset B1 to train the model. After reaching the first termination condition, read the subset A2 and the subset B2, and add the subset A2 and the subset B2 to the training samples of the model.
  • Set A1, subset B1, subset A2, and subset B2 are all used as training samples, and the model is trained until the end of training.
  • a sample picture of each part of the vehicle is obtained, and the sample of one part is a picture of one category, and the sample of any part is processed by the sample data classification method in the first preferred embodiment to obtain multiple subsets of each part. And sorting the plurality of subsets of each part by using the method in the second preferred embodiment, and training the vehicle part identification model based on the plurality of sorted subsets of each part.
  • the training of the vehicle part recognition model is divided into a plurality of subtasks, and according to the difficulty degree of the task, the training vehicle part recognition model is sequentially performed from easy to difficult, so as to avoid the difficult training samples being eliminated, thereby making the model easy to It is difficult to learn the characteristics of the sample pictures of various parts, thereby improving the adaptability of the model parameters.
  • the present application classifies the training sample data of each category into a plurality of subsets according to the difficulty level, so that the distance between the samples in the same subset becomes shorter, and the distance between the samples in different subsets becomes larger.
  • the denser the samples of a certain subset the more similar the features representing the pictures, the more similar the data representing the subset is to the categories represented by the sample data, which are simple samples, and the model can easily learn the characteristics of simple samples.
  • the sample of a subset is sparse and the representative picture is more diverse, the data of the subset is considered to be more complicated and is a difficult sample.
  • the plurality of subsets of the training sample data are sorted from easy to difficult, so that the training of the model is divided into multiple subtasks, and the training is performed from easy to difficult according to the difficulty of the task, so as to avoid difficult training.
  • the samples are rejected, allowing the model to learn the characteristics of each category from easy to difficult, thereby increasing the resilience of the model parameters.
  • the above-described integrated unit implemented in the form of a software program module can be stored in a non-volatile readable storage medium.
  • the software program module described above is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to perform the method of each embodiment of the present application. Part of the steps.
  • the electronic device 5 comprises at least one transmitting device 51, at least one memory 52, at least one processor 53, at least one receiving device 54, and at least one communication bus.
  • the communication bus is used to implement connection communication between these components.
  • the electronic device 5 is a device capable of automatically performing numerical calculation and/or information processing according to an instruction set or stored in advance, and the hardware includes, but not limited to, a microprocessor, an application specific integrated circuit (ASIC). ), Field-Programmable Gate Array (FPGA), Digital Signal Processor (DSP), embedded devices, etc.
  • the electronic device 5 may also comprise a network device and/or a user device.
  • the network device includes, but is not limited to, a single network server, a server group composed of multiple network servers, or a cloud computing-based cloud composed of a large number of hosts or network servers, where the cloud computing is distributed computing.
  • a super virtual computer consisting of a group of loosely coupled computers.
  • the electronic device 5 can be, but is not limited to, any electronic product that can interact with a user through a keyboard, a touch pad, or a voice control device, such as a tablet, a smart phone, or a personal digital assistant (Personal Digital Assistant). , PDA), smart wearable devices, camera equipment, monitoring equipment and other terminals.
  • a keyboard e.g., a keyboard
  • a touch pad e.g., a touch pad
  • a voice control device such as a tablet, a smart phone, or a personal digital assistant (Personal Digital Assistant). , PDA), smart wearable devices, camera equipment, monitoring equipment and other terminals.
  • PDA Personal Digital Assistant
  • the network in which the electronic device 5 is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (VPN), and the like.
  • the Internet includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (VPN), and the like.
  • VPN virtual private network
  • the receiving device 54 and the sending device 51 may be wired transmission ports, or may be wireless devices, for example, including antenna devices, for performing data communication with other devices.
  • the memory 52 is used to store program code.
  • the memory 52 may be a circuit having a storage function, such as a RAM (Random-Access Memory), a FIFO (First In First Out), or the like, which is not in a physical form in the integrated circuit.
  • the memory 52 may also be a memory having a physical form, such as a memory stick, a TF card (Trans-flash Card), a smart media card, a secure digital card, a flash memory card.
  • Storage devices such as (flash card) and the like.
  • the processor 53 can include one or more microprocessors, digital processors.
  • the processor 53 can call program code stored in the memory 52 to perform related functions.
  • the various modules described in FIG. 3 are program code stored in the memory 52 and executed by the processor 53 to implement a sample data classification method; and/or as described in FIG.
  • the various modules are program code stored in the memory 52 and executed by the processor 53 to implement a model training method.
  • the processor 53 also known as a central processing unit (CPU), is a very large-scale integrated circuit, which is a computing core (Core) and a control unit (Control Unit).
  • the embodiment of the present application further provides a non-volatile readable storage medium having stored thereon computer instructions that, when executed by an electronic device including one or more processors, cause the electronic device to perform the method as described above.
  • a non-volatile readable storage medium having stored thereon computer instructions that, when executed by an electronic device including one or more processors, cause the electronic device to perform the method as described above.
  • the memory 52 in the electronic device 5 stores a plurality of instructions to implement a sample data classification method, and the processor 53 can execute the plurality of instructions to implement:
  • Calculating a feature of each sample in the sample data calculating a distance set of each sample according to characteristics of each sample, the distance set of each sample including each sample in each of the remaining samples corresponding to each sample Distance; calculate the density value of each sample and calculate the density distance value of each sample according to the distance set of each sample; determine at least one cluster center according to the density value of each sample and the density distance value of each sample And clustering the sample data into a plurality of subsets based on the at least one cluster center and features of each sample.
  • the processor executing the plurality of instructions further includes:
  • Each distance distance set of each sample is compared with a distance threshold, a distance number greater than the distance threshold is determined, and the distance number corresponding to each sample is taken as the density value of each sample.
  • the calculating the density distance value of each sample comprises:
  • the maximum distance is selected as the density distance value of the sample having the largest density value from the distance concentration of the sample having the largest density value;
  • the processor may execute the plurality of instructions further including:
  • At least one cluster center is determined based on the cluster metric of each sample.
  • the cluster metric value of each sample is equal to the product of the density value of each sample and the density distance value of each sample.
  • the executing the plurality of instructions by the processor when determining the at least one clustering center according to the clustering metric value of each sample, the executing the plurality of instructions by the processor further includes:
  • the sorting is performed from large to small, and from the sorted clustering metric values, the sample of the preset number of bits before the sorting of the clustering metric value is selected as the clustering center point;
  • the sample with the cluster metric greater than the threshold is filtered as the cluster center point.
  • the processor executable to execute the plurality of instructions further includes:
  • a sample having a distance from each cluster center of the at least one cluster center exceeding a distance threshold is determined as an error sample.
  • the plurality of instructions corresponding to the sample data classification method are stored in the memory 52 in any of the embodiments and executed by the processor 53, and are not described in detail herein.
  • the memory 52 in the electronic device 5 stores a plurality of instructions to implement a model training method
  • the processor 53 can execute the plurality of instructions to implement:
  • sample data for each category classifying sample data for each category to obtain multiple subsets of each category; calculating the relevance of each subset of each subset of each category to the category of each subset; The relevance of each subset and category in multiple subsets of a category, from high to low, sorting multiple subsets of each category to obtain multiple sorted subsets of each category; In a plurality of sorted subsets, a subset of the same sort position is read as a training sample of the model, and the model is trained.
  • the subset of the higher ranking positions corresponds to a greater weight.
  • the above-described characteristic means of the present application can be implemented by an integrated circuit and control the function of implementing the sample data classification method in any of the above embodiments. That is, the integrated circuit of the present application is installed in the electronic device, so that the electronic device performs the functions of: calculating characteristics of each sample in the sample data; calculating a distance set of each sample according to characteristics of each sample, The distance set of each sample includes the distance between each sample of each sample corresponding to each sample; according to the distance set of each sample, the density value of each sample is calculated and the density distance of each sample is calculated. a value; determining at least one cluster center according to a density value of each sample and a density distance value of each sample; clustering the sample data into a plurality of children based on the at least one cluster center and characteristics of each sample set.
  • the functions that can be implemented by the sample data classification method in any of the embodiments can be installed in the electronic device through the integrated circuit of the present application, so that the electronic device can perform the sample data classification method in any embodiment.
  • the functions implemented are not detailed here.
  • the above-described characteristic means of the present application can be implemented by an integrated circuit and control the function of implementing the sample data classification method in any of the above embodiments. That is, the integrated circuit of the present application is installed in the electronic device, so that the electronic device functions to acquire sample data of each category; classify sample data of each category to obtain multiple subsets of each category Calculating the relevance of each subset in each subset of each category to the category in which each subset is located; the relevance of each subset to the category in multiple subsets of each category, from high to low, for each category Sorting subsets to obtain multiple sorted subsets of each category; sequentially reading subsets with the same sorting position from multiple sorted subsets of each category as training samples of the model, for the model Train.
  • the functions that can be implemented by the model training method in any of the embodiments can be installed in the electronic device through the integrated circuit of the present application, so that the electronic device can be implemented by the model training method in any embodiment. Function, no longer detailed here.
  • the disclosed apparatus may be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be electrical or otherwise.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a non-volatile readable storage medium.
  • a computer device which may be a personal computer, server or network device, etc.
  • the foregoing storage medium includes: a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The present application provides a sample data classification method. The method comprises: calculating features of each sample in sample data; according to the features of each sample, calculating a distance set for said sample, the distance set for said sample comprising a distance between said sample and each of the remaining samples corresponding to said sample; according to the distance set of each sample, calculating a density value of said sample and calculating a density distance value of said sample; determining at least one cluster center according to the density value of each sample and the density distance value of said sample; clustering the sample data into a plurality of subsets on the basis of the at least one cluster center and the features of each sample. The present application further provides a model training method using the sample data classification method and an electronic device. In the present application, training is performed from easy to difficult in sequence according to the degree of difficulty of tasks, to prevent difficult-to-train samples from being eliminated, thereby improving the adaptability of model parameters.

Description

样本数据分类方法、模型训练方法、电子设备及存储介质Sample data classification method, model training method, electronic device and storage medium
本申请要求于2018年04月18日提交中国专利局,申请号为201810350730.2发明名称为“样本数据分类方法、模型训练方法、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims to be filed on the Chinese Patent Office of the Chinese Patent Office on April 18, 2018, and the application number is 201810350730.2. The invention is entitled "sample data classification method, model training method, electronic equipment and storage medium". The citations are incorporated herein by reference.
技术领域Technical field
本申请涉及数据处理领域,尤其涉及一种样本数据分类方法、模型训练方法、电子设备及存储介质。The present application relates to the field of data processing, and in particular, to a sample data classification method, a model training method, an electronic device, and a storage medium.
背景技术Background technique
在大规模数据收集的过程中,难免会有噪声样本(例如不相关、错误的样本数据)出现。处理含有大量错误标签的算法一般是,设计对噪声鲁棒的算法,让模型自动检测出相关度高的样本和有噪声的样本,然后丢弃错误标签,再进行训练。但这种方法的缺陷是:很难区分难训练样本和错误样本,导致难训练样本被剔除,而难训练样本对提升模型性能是非常重要的。In the process of large-scale data collection, it is inevitable that noise samples (such as uncorrelated, erroneous sample data) will appear. The algorithm that deals with a large number of error tags is generally designed to be robust to noise, allowing the model to automatically detect highly correlated samples and noisy samples, then discard the error tags and train. However, the drawback of this method is that it is difficult to distinguish between difficult and erroneous samples, resulting in difficult training samples being rejected, and difficult training samples are very important for improving model performance.
发明内容Summary of the invention
鉴于以上内容,有必要提供一种样本数据分类方法、模型训练方法、电子设备及存储介质,能从易到难依次进行训练车辆部位识别模型,以避免难训练样本被剔除,从而让所述模型从易到难学习各个部位的样本图片的特征,从而提高模型参数的适应力。In view of the above, it is necessary to provide a sample data classification method, a model training method, an electronic device and a storage medium, which can sequentially train a vehicle part recognition model from easy to difficult, so as to avoid the difficult training samples being rejected, thereby allowing the model to be From the easy to the difficult to learn the characteristics of the sample picture of each part, thereby improving the adaptability of the model parameters.
一种样本数据分类方法,所述方法包括:A sample data classification method, the method comprising:
计算样本数据中每个样本的特征;Calculating the characteristics of each sample in the sample data;
根据每个样本的特征,计算每个样本的距离集,所述每个样本的距离集包括每个样本与每个样本对应的剩余样本中每个样本间的距离;Calculating a distance set of each sample according to characteristics of each sample, the distance set of each sample including a distance between each sample of each of the remaining samples corresponding to each sample;
根据每个样本的距离集,计算每个样本的密度值及计算每个样本的密度距离值;Calculating the density value of each sample and calculating the density distance value of each sample according to the distance set of each sample;
根据每个样本的密度值及每个样本的密度距离值,确定至少一个聚类中心;Determining at least one cluster center according to the density value of each sample and the density distance value of each sample;
基于所述至少一个聚类中心及每个样本的特征,将所述样本数据聚类成多个子集。The sample data is clustered into a plurality of subsets based on the at least one cluster center and features of each sample.
一种模型训练方法,所述方法包括:A model training method, the method comprising:
获取每个类别的样本数据;Get sample data for each category;
利用任意实施例中所述的样本数据分类方法对每个类别的样本数据进行分类,得到每个类别的多个子集;The sample data of each category is classified by using the sample data classification method described in any embodiment to obtain a plurality of subsets of each category;
计算每个类别的多个子集中每个子集与每个子集所在类别的相关度;Calculate the relevance of each subset in each subset of each category to the category in which each subset is located;
根据每个类别的多个子集中每个子集与类别的相关度,从高到低,对每个类别的多个子集进行排序,得到每个类别的多个排序后的子集;Sorting multiple subsets of each category according to the relevance of each subset and category in each subset of each category, from high to low, obtaining multiple sorted subsets of each category;
依次从每个类别的多个排序后的子集中,读取排序位置相同的子集作为模型的训练样本,对所述模型进行训练。The subsets with the same sorting position are read from the plurality of sorted subsets of each category in turn as training samples of the model, and the model is trained.
一种电子设备,所述电子设备包括存储器及处理器,所述存储器用于存储至少一个指令,所述处理器用于执行所述至少一个指令以实现如任意实施例中所述样本数据分类方法,及/或任意实施例中任一项所述模型训练方法。An electronic device, comprising: a memory for storing at least one instruction, the processor for executing the at least one instruction to implement a sample data classification method as in any of the embodiments, And/or the model training method of any of any of the embodiments.
一种非易失性可读存储介质,所述非易失性可读存储介质存储有至少一个指令,所述至少一个指令被处理器执行时实现如任意实施例中所述样本数据分类方法,及/或如任意实施例中所述模型训练方法。A non-volatile readable storage medium storing at least one instruction that, when executed by a processor, implements a sample data classification method as described in any embodiment, And/or the model training method as described in any embodiment.
由以上技术方案可知,本申请计算样本数据中每个样本的特征;根据每个样本的特征,计算每个样本的距离集,所述每个样本的距离集包括每个样本与每个样本对应的剩余样本中每个样本间的距离;根据每个样本的距离集,计算每个样本的密度值及计算每个样本的密度距离值;根据每个样本的密度值及每个样本的密度距离值,确定至少一个聚类中心;基于所述至少一个聚类中心及每个样本的特征,将所述样本数据聚类成多个子集。本申请按照任务的难易度,从易到难依次进行训练,以避免难训练样本被剔除,从而提高模型参数的适应力。As can be seen from the above technical solution, the present application calculates features of each sample in the sample data; calculates a distance set of each sample according to characteristics of each sample, and the distance set of each sample includes each sample corresponding to each sample The distance between each sample in the remaining samples; calculate the density value of each sample and calculate the density distance value of each sample according to the distance set of each sample; according to the density value of each sample and the density distance of each sample a value, determining at least one cluster center; clustering the sample data into a plurality of subsets based on the at least one cluster center and features of each sample. According to the difficulty level of the task, the present application trains from easy to difficult to avoid the difficult training samples being eliminated, thereby improving the adaptability of the model parameters.
附图说明DRAWINGS
图1是本申请样本数据分类方法的第一较佳实施例的流程图。1 is a flow chart of a first preferred embodiment of a sample data classification method of the present application.
图2是本申请模型训练方法的第一较佳实施例的流程图。2 is a flow chart of a first preferred embodiment of the method of training a model of the present application.
图3是本申请样本数据分类装置的第一较佳实施例的程序模块图。3 is a block diagram showing the program of a first preferred embodiment of the sample data sorting apparatus of the present application.
图4是本申请模型训练装置的第一较佳实施例的程序模块图。4 is a block diagram of a program of a first preferred embodiment of the model training device of the present application.
图5是本申请至少一个实例中电子设备的较佳实施例的结构示意图。FIG. 5 is a schematic structural diagram of a preferred embodiment of an electronic device in at least one example of the present application.
具体实施方式detailed description
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application are clearly and completely described in the following with reference to the drawings in the embodiments of the present application. It is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.
为使本申请的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本申请作进一步详细的说明。The above described objects, features and advantages of the present application will become more apparent and understood.
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。The technical solutions in the embodiments of the present application are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present application. It is an embodiment of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope shall fall within the scope of the application.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”和“第三”等是用于区别不同对象,而非用于描述特定顺序。此外,术语“包括”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、***、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second" and "third" and the like in the specification and claims of the present application and the above-mentioned drawings are used to distinguish different objects, and are not intended to describe a specific order. Moreover, the term "comprise" and any variants thereof are intended to cover a non-exclusive inclusion. For example, a process, method, system, product, or device that comprises a series of steps or units is not limited to the listed steps or units, but optionally also includes steps or units not listed, or alternatively Other steps or units inherent to these processes, methods, products or equipment.
如图1所示,是本申请样本数据分类方法的第一较佳实施例的流程图。根据不同的需求,该流程图中步骤的顺序可以改变,某些步骤可以省略。As shown in FIG. 1, it is a flowchart of a first preferred embodiment of the sample data classification method of the present application. The order of the steps in the flowchart may be changed according to different requirements, and some steps may be omitted.
S10、电子设备计算样本数据每个样本的特征。S10. The electronic device calculates characteristics of each sample of the sample data.
在可选实施例中,所述样本数据包括,但不限于:预先采集的数据,从网络上爬取的数据。因此,在大规模的样本数据收集的过程中,会有与所述样本数据表示的类别相关度不高,或者错误的数据出现。为了后续提高模型 训练的准确度,需要对样本数据进行分类,自动检测出在模型训练过程中容易被学习特征的简单样本,及在模型训练过程中不容易被学习特征的难样本,从而实现对样本数据的分类。In an alternative embodiment, the sample data includes, but is not limited to, pre-acquired data, data crawled from the network. Therefore, in the process of large-scale sample data collection, there is a low correlation with the category indicated by the sample data, or erroneous data appears. In order to improve the accuracy of model training, it is necessary to classify the sample data, automatically detect the simple samples that are easy to be learned in the model training process, and the difficult samples that are not easy to be learned in the model training process, thus achieving Classification of sample data.
优选地,利用特征提取模型提取每个样本的特征。进一步地,所述特征提取模型包括,但不限于:深度卷积神经网络模型。通过深度卷积神经网络对样本数据提取特征,比如任何网络(VGG-16,ResNet-50等)在Soft-max分类层之前的一层都可以看做是特征提取器,把这一层的输出作为提取到的特征。Preferably, the features of each sample are extracted using a feature extraction model. Further, the feature extraction model includes, but is not limited to, a deep convolutional neural network model. The sample data is extracted by a deep convolutional neural network. For example, any network (VGG-16, ResNet-50, etc.) in front of the Soft-max classification layer can be regarded as a feature extractor, and the output of this layer is taken. As the extracted feature.
在本实施例中,所述深度卷积神经网络模型由1个输入层、20个卷积层、6个池化层、3个隐含层、1个分类层构成。所述深度卷积神经网络模型的模型架构如图3所示,其中,Conv a-b(例如,Conv 3-64)表示该层卷积核的维度为a×a,该层卷积核的个数为b;Maxpool2表示所述池化层的池化核的维度为2×2;FC-c(例如,FC-6)表示该隐含层(即:完全连接层)有c个输出节点;Soft-max表示该分类层使用Soft-max分类器对输入图像进行分类。In this embodiment, the deep convolutional neural network model is composed of one input layer, 20 convolution layers, 6 pooling layers, 3 hidden layers, and 1 sorting layer. The model architecture of the deep convolutional neural network model is shown in FIG. 3, wherein Conv ab (for example, Conv 3-64) indicates that the dimension of the layer convolution kernel is a×a, and the number of convolution kernels of the layer b; Maxpool2 indicates that the pooled core of the pooled layer has a dimension of 2×2; FC-c (for example, FC-6) indicates that the hidden layer (ie, the fully connected layer) has c output nodes; -max indicates that the classification layer classifies the input image using the Soft-max classifier.
在本实施例中,利用训练样本进行训练学习得到训练后的深度卷积神经网络模型。将所述样本数据输入到所述训练后的深度卷积神经网络模型中就能准确自动地提取所述样本数据中每个样本的特征。一般情况下,训练样本的规模越大,所述训练后的深度卷积神经网络模型的提取的特征就越准确。当然所述深度卷积神经网络模型也可以其他表现形式,本申请不做任何限制。In this embodiment, the training sample is used for training learning to obtain a trained deep convolutional neural network model. Importing the sample data into the trained deep convolutional neural network model can accurately and automatically extract features of each sample in the sample data. In general, the larger the size of the training samples, the more accurate the extracted features of the post-training deep convolutional neural network model. Of course, the deep convolutional neural network model can also be in other forms of expression, and the present application does not impose any limitation.
S11、所述电子设备根据每个样本的特征,计算每个样本的距离集。S11. The electronic device calculates a distance set of each sample according to characteristics of each sample.
优选地,所述每个样本的距离集包括每个样本与每个样本对应的剩余样本中每个样本间的距离,其中每个样本对应的剩余样本包括所述样本数据中除去每个样本之外的其他样本。例如,若所述样本数据中有3个样本,样本A、样本B及样本C,对于样本A,分别计算样本A与样本B、样本C的距离。对于样本B,分别计算样本B与样本A、样本C的距离。对于样本C,分别计算样本C与样本A及样本B的距离。这样距离矩阵为一个3*2或者2*3的矩阵。Preferably, the distance set of each sample includes a distance between each sample of each of the remaining samples corresponding to each sample, wherein the remaining samples corresponding to each sample include each sample in the sample data. Other samples outside. For example, if there are 3 samples in the sample data, sample A, sample B, and sample C, for sample A, the distance between sample A and sample B and sample C is calculated separately. For sample B, the distance between sample B and sample A and sample C is calculated separately. For sample C, the distance between sample C and sample A and sample B is calculated separately. Thus the distance matrix is a 3*2 or 2*3 matrix.
进一步地,所述距离包括,不限于:欧式距离、余弦距离等等。所述距离矩阵中的每个距离值都大于0,例如,当计算的余弦距离小于0时,则取计算的余弦距离的绝对值。Further, the distance includes, without limitation, a European distance, a cosine distance, and the like. Each distance value in the distance matrix is greater than zero. For example, when the calculated cosine distance is less than 0, the absolute value of the calculated cosine distance is taken.
S12、所述电子设备根据每个样本的距离集,计算每个样本的密度值及计算每个样本的密度距离值。S12. The electronic device calculates a density value of each sample according to a distance set of each sample and calculates a density distance value of each sample.
优选地,将每个样本的距离集每个距离与距离阈值进行对比,确定大于所述距离阈值的距离数,并将每个样本对应的距离数作为每个样本的密度值。这样样本的密度值越大,表示与样本相似的样本越多。Preferably, the distance set of each sample is compared with the distance threshold, the number of distances greater than the distance threshold is determined, and the number of distances corresponding to each sample is taken as the density value of each sample. The larger the density value of such a sample, the more samples are represented similar to the sample.
具体地,对于任意一个样本,所述任意一个样本的密度值的计算方式如下:Specifically, for any one sample, the density value of any one of the samples is calculated as follows:
Figure PCTCN2018100157-appb-000001
Figure PCTCN2018100157-appb-000001
其中ρ i表示所述样本数据中第i个样本的密度值,D ij表示第i个样本与所述样本数据中第j个样本的距离,d c表示距离阈值。 Where ρ i represents the density value of the ith sample in the sample data, D ij represents the distance between the ith sample and the jth sample in the sample data, and d c represents the distance threshold.
优选地,所述计算每个样本的密度距离值包括:Preferably, the calculating the density distance value of each sample comprises:
(1)对于密度值最大的样本,从所述密度值最大的样本的距离集中,筛选最大距离作为所述密度值最大的样本的密度距离值。(1) For the sample having the largest density value, the maximum distance is selected as the density distance value of the sample having the largest density value from the distance concentration of the sample having the largest density value.
(2)对于第二样本集中任意一个样本,确定密度值大于所述任意一个样本的密度值的样本;根据所述任意一个样本的距离集,从密度值大于所述任意一个样本的密度值的样本中确定与所述任意一个样本最近的距离,将与所述任意一个样本最近的距离确定为所述任意一个样本的密度距离值。其中所述第二样本集包括所述样本数据中除去密度值最大的样本的其他样本。(2) determining, for any one of the samples in the second sample set, a sample having a density value greater than a density value of the any one of the samples; and according to the distance set of the any one of the samples, the density value is greater than a density value of the arbitrary one of the samples The distance closest to the any one of the samples is determined in the sample, and the closest distance to the any one of the samples is determined as the density distance value of the arbitrary one of the samples. Wherein the second set of samples includes other samples of the sample data from which the sample having the largest density value is removed.
具体地,每个样本的密度距离值的计算公式如下:Specifically, the density distance value of each sample is calculated as follows:
Figure PCTCN2018100157-appb-000002
Figure PCTCN2018100157-appb-000002
其中δ i表示第i个样本的密度距离值,ρ i表示第i个样本的密度值,ρ j表示第j个样本的密度值,D ij第i个样本与第j个样本的距离。 Where δ i represents the density distance value of the i-th sample, ρ i represents the density value of the i-th sample, ρ j represents the density value of the j-th sample, and the distance between the i-th sample and the j-th sample of D ij .
S13、所述电子设备根据每个样本的密度值及每个样本的密度距离值,确定至少一个聚类中心。S13. The electronic device determines at least one cluster center according to a density value of each sample and a density distance value of each sample.
优选地,所述根据每个样本的密度值及每个样本的密度距离值,确定至少一个聚类中心包括:Preferably, the determining, according to the density value of each sample and the density distance value of each sample, determining at least one cluster center comprises:
A、根据每个样本的密度值及每个样本的密度距离值,计算每个样本的聚类度量值。A. Calculate the cluster metric for each sample based on the density value of each sample and the density distance value of each sample.
进一步地,每个样本的聚类度量值等于每个样本的密度值与每个样本的密度距离值的乘积。Further, the cluster metric value of each sample is equal to the product of the density value of each sample and the density distance value of each sample.
B、根据每个样本的聚类度量值,确定至少一个聚类中心。B. Determine at least one cluster center according to the cluster metric of each sample.
进一步地,所述根据每个样本的聚类度量值,确定至少一个聚类中心包括:Further, determining, according to the cluster metric value of each sample, the at least one cluster center includes:
(1)、根据每个样本的聚类度量值,从大到小进行排序,从排序后的聚类度量值中,筛选聚类度量值排序前预设位数(例如,前三个)的样本作为聚类中心点。(1) Sorting the clustering metric values of each sample from large to small, and filtering the preset number of bits before the sorting of the clustering metric values (for example, the first three) from the sorted clustering metric values. The sample serves as the cluster center point.
(2)、根据每个样本的聚类度量值,筛选聚类度量值大于阈值的样本作为聚类中心点。(2), according to the clustering metric value of each sample, the sample with the clustering metric value greater than the threshold is filtered as the clustering center point.
进一步地,根据每个样本的聚类度量值配置所述阈值,例如根据每个样本的聚类度量值计算均值,将均值作为所述阈值。Further, the threshold is configured according to a cluster metric value of each sample, for example, a mean value is calculated according to a cluster metric value of each sample, and an average value is taken as the threshold value.
S14、所述电子设备基于所述至少一个聚类中心及每个样本的特征,将所述样本数据聚类成多个子集。S14. The electronic device clusters the sample data into a plurality of subsets based on the at least one cluster center and characteristics of each sample.
优选地,根据所述至少一个聚类中心中每个聚类中心对应的样本的距离集,利用聚类算法,将所述样本数据聚类成多个子集。Preferably, the sample data is clustered into a plurality of subsets according to a distance set of samples corresponding to each cluster center in the at least one cluster center by using a clustering algorithm.
进一步地,所述聚类算法包括,但不限于:k-means聚类算法、层次聚类算法等等。Further, the clustering algorithm includes, but is not limited to, a k-means clustering algorithm, a hierarchical clustering algorithm, and the like.
进一步地,将与所述至少一个聚类中心每个聚类中心的距离超过距离阈值的样本确定为错误样本。这样可以有效排除错误样本。Further, a sample whose distance from each cluster center of the at least one cluster center exceeds a distance threshold is determined as an error sample. This can effectively eliminate the wrong sample.
在上述实施例中,样本的密度值越大,表示与该样本相似的样本越多。样本的密度距离值越大,表示该样本点所在的子集与其他子集的距离就越远。因此按照上述实施例进行聚类后,能使同一个子集中的样本之间距离变短,且不同子集中的样本之间距离变大。这样某一子集的样本越密集,代表图片的特征越相似,则代表该子集中的数据与样本数据表示的类别越相似,属于简单样本,模型很容易学习到简单样本的特征。反之,若一个子集的样本越稀疏,代表图片越多样化,则认为该子集的数据较复杂,属于难样本。而且通过上述实施例选取的聚类中心,对样本数据进行聚类,还可以有效排除错误的样本,从而提高后续训练模型参数的准确度。In the above embodiment, the larger the density value of the sample, the more samples are represented similar to the sample. The larger the density distance value of the sample, the further the distance between the subset of the sample points and other subsets. Therefore, after clustering according to the above embodiment, the distance between samples in the same subset can be shortened, and the distance between samples in different subsets becomes larger. Thus, the denser the samples of a certain subset, the more similar the features representing the pictures, the more similar the data representing the subset is to the categories represented by the sample data, which are simple samples, and the model can easily learn the characteristics of simple samples. On the other hand, if the sample of a subset is sparse and the representative picture is more diverse, the data of the subset is considered to be more complicated and is a difficult sample. Moreover, clustering the sample data through the cluster center selected in the above embodiment can also effectively eliminate the erroneous samples, thereby improving the accuracy of the parameters of the subsequent training model.
如图2所示,是本申请模型训练方法的第二较佳实施例的流程图。根据不同的需求,该流程图中步骤的顺序可以改变,某些步骤可以省略。2 is a flow chart of a second preferred embodiment of the model training method of the present application. The order of the steps in the flowchart may be changed according to different requirements, and some steps may be omitted.
S20、电子设备获取每个类别的样本数据。S20. The electronic device acquires sample data of each category.
在本实施例中,训练的模型用于识别待检测图片属于的类别,例如,所述模型为车辆部位识别模型,所述车辆部位识别模型用于识别待待测图片中的部位属于车辆的哪个部位。这样就需要获取车辆各个部位的样本数据,一个部位的样本数据属于一个类别。In this embodiment, the trained model is used to identify the category to which the picture to be detected belongs, for example, the model is a vehicle part identification model, and the vehicle part identification model is used to identify which part of the vehicle in the picture to be tested belongs to Part. In this way, it is necessary to obtain sample data of various parts of the vehicle, and the sample data of one part belongs to one category.
S21、所述电子设备对每个类别的样本数据进行分类,得到每个类别的多个子集。其中采用第一较优实施例中对每个类别的样本数据进行分类。S21. The electronic device classifies sample data of each category to obtain multiple subsets of each category. The sample data of each category is classified in the first preferred embodiment.
在本实施例中,该步骤S21的处理过程与第一较优实施例中数据分类方法相同,在此不再详述。In this embodiment, the processing of the step S21 is the same as the data classification method in the first preferred embodiment, and will not be described in detail herein.
S22、所述电子设备计算每个类别的多个子集中每个子集与每个子集所在类别的相关度。S22. The electronic device calculates a correlation between each subset of the plurality of subsets of each category and a category of each subset.
按照上述实施例进行聚类后,对于每个类别而言,能使同一个子集中的样本之间距离变短,且不同子集中的样本之间距离变大。这样某一子集的样本越密集,代表图片的特征越相似,则表示该子集中的数据与该子集所在的类别越相关,相似度越高,属于简单样本。反之,若一个子集的样本越稀疏,代表图片越多样化,属于难样本。After clustering according to the above embodiment, for each category, the distance between samples in the same subset can be shortened, and the distance between samples in different subsets becomes larger. Thus, the denser the samples of a certain subset, the more similar the characteristics of the representative pictures, the more the data in the subset is related to the category in which the subset is located, and the higher the similarity, the simple samples. Conversely, if the sample of a subset is sparse, the more representative the picture, the more difficult the sample.
优选地,对于每个类别,将每个子集包含的样本数作为每个子集与每个子集所在类别的相关度。例如对于一个类别,聚类后得到三个子集:第一子集、第二子集、第三子集。若第一子集的样本数为40个,第二子集的样本数位100个,第三子集的样本数位10个,则用数值40表示第一子集与第一子集所在类别的相似度。Preferably, for each category, the number of samples included in each subset is taken as the relevance of each subset to the category in which each subset is located. For example, for a category, three subsets are obtained after clustering: a first subset, a second subset, and a third subset. If the number of samples in the first subset is 40, the number of samples in the second subset is 100, and the number of samples in the third subset is 10, a value of 40 is used to indicate that the first subset is similar to the category of the first subset. degree.
S23、所述电子设备根据每个类别的多个子集中每个子集与类别的相关度,从高到低,对每个类别的多个子集进行排序,得到每个类别的多个排序后的子集。S23. The electronic device sorts multiple subsets of each category according to the relevance of each subset and category in multiple subsets of each category, and obtains multiple sorted sub-categories of each category. set.
例如对于一个类别,聚类后得到三个子集:第一子集、第二子集、第三子集。若第一子集的样本数为40个,第二子集的样本数为100个,第三子集 的样本数为10个,则所述一个类别的多个排序后的子集为:第二子集、第一子集、第三子集。For example, for a category, three subsets are obtained after clustering: a first subset, a second subset, and a third subset. If the number of samples in the first subset is 40, the number of samples in the second subset is 100, and the number of samples in the third subset is 10, then the plurality of sorted subsets of the one category is: Two subsets, a first subset, and a third subset.
S24、所述电子设备依次从每个类别的多个排序后的子集中,读取排序位置相同的子集作为模型的训练样本,对所述模型进行训练。S24. The electronic device sequentially reads, from a plurality of sorted subsets of each category, a subset with the same sorting position as a training sample of the model, and trains the model.
优选地,从每个类别的多个排序后的子集中,读取每个类别的第一个子集作为模型的训练样本,对所述模型进行训练,达到第一终止条件后,读取每个类别的第二子集,将每个类别的第二子集加入模型的训练样本中,继续对所述模型进行训练,直至每个类别的所有子集都作为训练样本。由于根据每个类别的多个子集中每个子集与类别的相关度对每个类别的多个子集,从高到低,进行排序后,这样简单样本就会排在前面,在训练模型时,简单样本比较容易训练,难样本排在后面,比较难训练,这样将对所述模型的训练分成多个子任务,按照任务的难易度,从易到难依次进行训练,以避免难训练样本被剔除,从而让所述模型从易到难学习每个类别的特征,从而提高模型参数的适应力。Preferably, from the plurality of sorted subsets of each category, the first subset of each category is read as a training sample of the model, and the model is trained to reach the first termination condition and then read each A second subset of the categories, the second subset of each category is added to the model's training samples, and the model continues to be trained until all subsets of each category are used as training samples. Since the subsets of each subset in each subset of each category are sorted from high to low, based on the relevance of each subset to the category, such simple samples are ranked first, and when training the model, simple The sample is easier to train, and the difficult sample is ranked later. It is more difficult to train. This will divide the training of the model into multiple subtasks. According to the difficulty of the task, it will be trained from easy to difficult to avoid the difficult training samples being rejected. So that the model can learn the characteristics of each category from easy to difficult, thereby improving the adaptability of the model parameters.
进一步地,在所述多个排序后的子集中,排序位置越靠前,子集对应的权重越大。这样相似度越高的样本,权重越大,在训练模型的时候,就能学习更多的特征,从而提高模型参数的准确率。因此,可以依赖置信度高的子集来提升模型识别的准确率。Further, in the plurality of sorted subsets, the higher the ranking position, the larger the weight corresponding to the subset. Such a sample with higher similarity has a larger weight, and when the model is trained, more features can be learned, thereby improving the accuracy of the model parameters. Therefore, a subset with high confidence can be relied upon to improve the accuracy of model recognition.
例如,有两个类别,类别A及类别B,类别A中排序后的子集为:子集A1、子集A2,子集A1对应的权重为1、子集A2对应的权重为0.5。类别B中排序后的子集为子集B1、子集B2,子集B1对应的权重为1、子集B2对应的权重为0.5。先读取子集A1及子集B1对模型进行训练,达到第一终止条件后,再读取子集A2及子集B2,将子集A2及子集B2加入模型的训练样本中,这样子集A1、子集B1、子集A2及子集B2都作为训练样本,再对模型进行训练,直至训练结束。For example, there are two categories, category A and category B. The sorted subsets in category A are: subset A1, subset A2, the weight corresponding to subset A1 is 1, and the weight corresponding to subset A2 is 0.5. The sorted subset in category B is subset B1 and subset B2, the weight corresponding to subset B1 is 1, and the weight corresponding to subset B2 is 0.5. First read the subset A1 and the subset B1 to train the model. After reaching the first termination condition, read the subset A2 and the subset B2, and add the subset A2 and the subset B2 to the training samples of the model. Set A1, subset B1, subset A2, and subset B2 are all used as training samples, and the model is trained until the end of training.
利用上述方法,训练车辆部位识别模型应用场景举例如下:Using the above method, an example of a training vehicle part identification model application scenario is as follows:
首先获取车辆的各个部位的样本图片,一个部位的样本是一个类别的图片,对任意一个部位的样本采用第一较优实施例中的样本数据分类方法进行处理,得到每个部位的多个子集,并利用第二较优实施例中的方法对每个部位的多个子集进行排序,并基于每个部位的多个排序后的子集,对所述车辆部位识别模型进行训练。这样将对所述车辆部位识别模型的训练分成多个子任务,按照任务的难易度,从易到难依次进行训练车辆部位识别模型,以避免难训练样本被剔除,从而让所述模型从易到难学习各个部位的样本图片的特征,从而提高模型参数的适应力。First, a sample picture of each part of the vehicle is obtained, and the sample of one part is a picture of one category, and the sample of any part is processed by the sample data classification method in the first preferred embodiment to obtain multiple subsets of each part. And sorting the plurality of subsets of each part by using the method in the second preferred embodiment, and training the vehicle part identification model based on the plurality of sorted subsets of each part. In this way, the training of the vehicle part recognition model is divided into a plurality of subtasks, and according to the difficulty degree of the task, the training vehicle part recognition model is sequentially performed from easy to difficult, so as to avoid the difficult training samples being eliminated, thereby making the model easy to It is difficult to learn the characteristics of the sample pictures of various parts, thereby improving the adaptability of the model parameters.
由以上实施例可知,本申请将将每个类别的训练样本数据按照难易程度分类多个子集,能使同一个子集中的样本之间距离变短,且不同子集中的样本之间距离变大。这样某一子集的样本越密集,代表图片的特征越相似,则代表该子集中的数据与样本数据表示的类别越相似,属于简单样本,模型很容易学习到简单样本的特征。反之,若一个子集的样本越稀疏,代表图片越多样化,则认为该子集的数据较复杂,属于难样本。再对训练样本数据的多个子集,从易到难进行排序,从而实现将对所述模型的训练分成多个子任务, 按照任务的难易度,从易到难依次进行训练,以避免难训练样本被剔除,从而让所述模型从易到难学习每个类别的特征,从而提高模型参数的适应力。It can be seen from the above embodiments that the present application classifies the training sample data of each category into a plurality of subsets according to the difficulty level, so that the distance between the samples in the same subset becomes shorter, and the distance between the samples in different subsets becomes larger. . Thus, the denser the samples of a certain subset, the more similar the features representing the pictures, the more similar the data representing the subset is to the categories represented by the sample data, which are simple samples, and the model can easily learn the characteristics of simple samples. On the other hand, if the sample of a subset is sparse and the representative picture is more diverse, the data of the subset is considered to be more complicated and is a difficult sample. Then, the plurality of subsets of the training sample data are sorted from easy to difficult, so that the training of the model is divided into multiple subtasks, and the training is performed from easy to difficult according to the difficulty of the task, so as to avoid difficult training. The samples are rejected, allowing the model to learn the characteristics of each category from easy to difficult, thereby increasing the resilience of the model parameters.
如图3所示,本申请样本数据分类装置的第一较佳实施例的程序模块图。所述样本数据分类装置3包括,但不限于以下一个或者多个模块:计算模块30、确定模块31及聚类模块32。本申请所称的单元是指一种能够被样本数据分类装置3的处理器所执行并且能够完成固定功能的一系列计算机可读指令段,其存储在存储器中。关于各单元的功能将在后续的实施例中详述。As shown in FIG. 3, a program module diagram of a first preferred embodiment of the sample data classification device of the present application. The sample data classification device 3 includes, but is not limited to, one or more of the following modules: a calculation module 30, a determination module 31, and a clustering module 32. A unit referred to in this application refers to a series of computer readable instruction segments that can be executed by a processor of the sample data classification device 3 and that are capable of performing a fixed function, which are stored in a memory. The function of each unit will be detailed in the subsequent embodiments.
所述计算模块30计算样本数据每个样本的特征。The calculation module 30 calculates features of each sample of sample data.
在可选实施例中,所述样本数据包括,但不限于:预先采集的数据,从网络上爬取的数据。因此,在大规模的样本数据收集的过程中,会有与所述样本数据表示的类别相关度不高,或者错误的数据出现。为了后续提高模型训练的准确度,需要对样本数据进行分类,自动检测出在模型训练过程中容易被学习特征的简单样本,及在模型训练过程中不容易被学习特征的难样本,从而实现对样本数据的分类。In an alternative embodiment, the sample data includes, but is not limited to, pre-acquired data, data crawled from the network. Therefore, in the process of large-scale sample data collection, there is a low correlation with the category indicated by the sample data, or erroneous data appears. In order to improve the accuracy of model training, it is necessary to classify the sample data, automatically detect the simple samples that are easy to be learned in the model training process, and the difficult samples that are not easy to be learned in the model training process, thus achieving Classification of sample data.
优选地,所述计算模块30利用特征提取模型提取每个样本的特征。进一步地,所述特征提取模型包括,但不限于:深度卷积神经网络模型。通过深度卷积神经网络对样本数据提取特征,比如任何网络(VGG-16,ResNet-50等)在Soft-max分类层之前的一层都可以看做是特征提取器,把这一层的输出作为提取到的特征。Preferably, the calculation module 30 extracts features of each sample using a feature extraction model. Further, the feature extraction model includes, but is not limited to, a deep convolutional neural network model. The sample data is extracted by a deep convolutional neural network. For example, any network (VGG-16, ResNet-50, etc.) in front of the Soft-max classification layer can be regarded as a feature extractor, and the output of this layer is taken. As the extracted feature.
在本实施例中,所述深度卷积神经网络模型由1个输入层、20个卷积层、6个池化层、3个隐含层、1个分类层构成。所述深度卷积神经网络模型的模型架构如图3所示,其中,Conv a-b(例如,Conv 3-64)表示该层卷积核的维度为a×a,该层卷积核的个数为b;Maxpool2表示所述池化层的池化核的维度为2×2;FC-c(例如,FC-6)表示该隐含层(即:完全连接层)有c个输出节点;Soft-max表示该分类层使用Soft-max分类器对输入图像进行分类。In this embodiment, the deep convolutional neural network model is composed of one input layer, 20 convolution layers, 6 pooling layers, 3 hidden layers, and 1 sorting layer. The model architecture of the deep convolutional neural network model is shown in FIG. 3, wherein Conv ab (for example, Conv 3-64) indicates that the dimension of the layer convolution kernel is a×a, and the number of convolution kernels of the layer b; Maxpool2 indicates that the pooled core of the pooled layer has a dimension of 2×2; FC-c (for example, FC-6) indicates that the hidden layer (ie, the fully connected layer) has c output nodes; -max indicates that the classification layer classifies the input image using the Soft-max classifier.
在本实施例中,利用训练样本进行训练学习得到训练后的深度卷积神经网络模型。将所述样本数据输入到所述训练后的深度卷积神经网络模型中就能准确自动地提取所述样本数据中每个样本的特征。一般情况下,训练样本的规模越大,所述训练后的深度卷积神经网络模型的提取的特征就越准确。当然所述深度卷积神经网络模型也可以其他表现形式,本申请不做任何限制。In this embodiment, the training sample is used for training learning to obtain a trained deep convolutional neural network model. Importing the sample data into the trained deep convolutional neural network model can accurately and automatically extract features of each sample in the sample data. In general, the larger the size of the training samples, the more accurate the extracted features of the post-training deep convolutional neural network model. Of course, the deep convolutional neural network model can also be in other forms of expression, and the present application does not impose any limitation.
所述计算模块30根据每个样本的特征,计算每个样本的距离集。The calculation module 30 calculates a distance set for each sample based on the characteristics of each sample.
优选地,所述每个样本的距离集包括每个样本与每个样本对应的剩余样本中每个样本间的距离,其中每个样本对应的剩余样本包括所述样本数据中除去每个样本之外的其他样本。例如,若所述样本数据中有3个样本,样本A、样本B及样本C,对于样本A,分别计算样本A与样本B、样本C的距离。对于样本B,分别计算样本B与样本A、样本C的距离。对于样本C,分别计算样本C与样本A及样本B的距离。这样距离矩阵为一个3*2或者2*3的矩阵。Preferably, the distance set of each sample includes a distance between each sample of each of the remaining samples corresponding to each sample, wherein the remaining samples corresponding to each sample include each sample in the sample data. Other samples outside. For example, if there are 3 samples in the sample data, sample A, sample B, and sample C, for sample A, the distance between sample A and sample B and sample C is calculated separately. For sample B, the distance between sample B and sample A and sample C is calculated separately. For sample C, the distance between sample C and sample A and sample B is calculated separately. Thus the distance matrix is a 3*2 or 2*3 matrix.
进一步地,所述距离包括,不限于:欧式距离、余弦距离等等。所述距离矩阵中的每个距离值都大于0,例如,当计算的余弦距离小于0时,则取 计算的余弦距离的绝对值。Further, the distance includes, without limitation, a European distance, a cosine distance, and the like. Each distance value in the distance matrix is greater than zero. For example, when the calculated cosine distance is less than zero, the absolute value of the calculated cosine distance is taken.
所述计算模块30根据每个样本的距离集,计算每个样本的密度值及计算每个样本的密度距离值。The calculation module 30 calculates a density value for each sample and calculates a density distance value for each sample based on the distance set of each sample.
优选地,所述计算模块30将每个样本的距离集每个距离与距离阈值进行对比,确定大于所述距离阈值的距离数,并将每个样本对应的距离数作为每个样本的密度值。这样样本的密度值越大,表示与样本相似的样本越多。Preferably, the calculation module 30 compares each distance of each sample with a distance threshold, determines a distance number greater than the distance threshold, and uses the distance number corresponding to each sample as the density value of each sample. . The larger the density value of such a sample, the more samples are represented similar to the sample.
具体地,对于任意一个样本,所述任意一个样本的密度值的计算方式如下:Specifically, for any one sample, the density value of any one of the samples is calculated as follows:
Figure PCTCN2018100157-appb-000003
Figure PCTCN2018100157-appb-000003
其中ρ i表示所述样本数据中第i个样本的密度值,D ij表示第i个样本与所述样本数据中第j个样本的距离,d c表示距离阈值。 Where ρ i represents the density value of the ith sample in the sample data, D ij represents the distance between the ith sample and the jth sample in the sample data, and d c represents the distance threshold.
优选地,所述计算模块30计算每个样本的密度距离值包括:Preferably, the calculating module 30 calculates the density distance value of each sample includes:
(1)对于密度值最大的样本,从所述密度值最大的样本的距离集中,筛选最大距离作为所述密度值最大的样本的密度距离值。(1) For the sample having the largest density value, the maximum distance is selected as the density distance value of the sample having the largest density value from the distance concentration of the sample having the largest density value.
(2)对于第二样本集中任意一个样本,确定密度值大于所述任意一个样本的密度值的样本;根据所述任意一个样本的距离集,从密度值大于所述任意一个样本的密度值的样本中确定与所述任意一个样本最近的距离,将与所述任意一个样本最近的距离确定为所述任意一个样本的密度距离值。其中所述第二样本集包括所述样本数据中除去密度值最大的样本的其他样本。(2) determining, for any one of the samples in the second sample set, a sample having a density value greater than a density value of the any one of the samples; and according to the distance set of the any one of the samples, the density value is greater than a density value of the arbitrary one of the samples The distance closest to the any one of the samples is determined in the sample, and the closest distance to the any one of the samples is determined as the density distance value of the arbitrary one of the samples. Wherein the second set of samples includes other samples of the sample data from which the sample having the largest density value is removed.
具体地,每个样本的密度距离值的计算公式如下:Specifically, the density distance value of each sample is calculated as follows:
Figure PCTCN2018100157-appb-000004
Figure PCTCN2018100157-appb-000004
其中δ i表示第i个样本的密度距离值,ρ i表示第i个样本的密度值,ρ j表示第j个样本的密度值,D ij第i个样本与第j个样本的距离。 Where δ i represents the density distance value of the i-th sample, ρ i represents the density value of the i-th sample, ρ j represents the density value of the j-th sample, and the distance between the i-th sample and the j-th sample of D ij .
所述确定模块31根据每个样本的密度值及每个样本的密度距离值,确定至少一个聚类中心。The determining module 31 determines at least one cluster center according to the density value of each sample and the density distance value of each sample.
优选地,所述确定模块31根据每个样本的密度值及每个样本的密度距离值,确定至少一个聚类中心包括:Preferably, the determining module 31 determines, according to the density value of each sample and the density distance value of each sample, that the at least one cluster center comprises:
A、根据每个样本的密度值及每个样本的密度距离值,计算每个样本的聚类度量值。A. Calculate the cluster metric for each sample based on the density value of each sample and the density distance value of each sample.
进一步地,每个样本的聚类度量值等于每个样本的密度值与每个样本的密度距离值的乘积。Further, the cluster metric value of each sample is equal to the product of the density value of each sample and the density distance value of each sample.
B、根据每个样本的聚类度量值,确定至少一个聚类中心。B. Determine at least one cluster center according to the cluster metric of each sample.
进一步地,所述确定模块31根据每个样本的聚类度量值,确定至少一个聚类中心包括:Further, the determining module 31 determines, according to the cluster metric value of each sample, the at least one cluster center includes:
(1)、根据每个样本的聚类度量值,从大到小进行排序,从排序后的聚类度量值中,筛选聚类度量值排序前预设位数(例如,前三个)的样本作为聚类中心点。(1) Sorting the clustering metric values of each sample from large to small, and filtering the preset number of bits before the sorting of the clustering metric values (for example, the first three) from the sorted clustering metric values. The sample serves as the cluster center point.
(2)、根据每个样本的聚类度量值,筛选聚类度量值大于阈值的样本作为聚类中心点。(2), according to the clustering metric value of each sample, the sample with the clustering metric value greater than the threshold is filtered as the clustering center point.
进一步地,根据每个样本的聚类度量值配置所述阈值,例如根据每个样本的聚类度量值计算均值,将均值作为所述阈值。Further, the threshold is configured according to a cluster metric value of each sample, for example, a mean value is calculated according to a cluster metric value of each sample, and an average value is taken as the threshold value.
所述聚类模块32基于所述至少一个聚类中心及每个样本的特征,将所述样本数据聚类成多个子集。The clustering module 32 clusters the sample data into a plurality of subsets based on the at least one cluster center and features of each sample.
优选地,所述聚类模块32根据所述至少一个聚类中心中每个聚类中心对应的样本的距离集,利用聚类算法,将所述样本数据聚类成多个子集。Preferably, the clustering module 32 clusters the sample data into a plurality of subsets according to a distance set of samples corresponding to each of the cluster centers in the at least one cluster center.
进一步地,所述聚类算法包括,但不限于:k-means聚类算法、层次聚类算法等等。Further, the clustering algorithm includes, but is not limited to, a k-means clustering algorithm, a hierarchical clustering algorithm, and the like.
进一步地,所述确定模块31将与所述至少一个聚类中心每个聚类中心的距离超过距离阈值的样本确定为错误样本。这样可以有效排除错误样本。Further, the determining module 31 determines a sample whose distance from each cluster center of the at least one cluster center exceeds a distance threshold as an error sample. This can effectively eliminate the wrong sample.
在上述实施例中,样本的密度值越大,表示与该样本相似的样本越多。样本的密度距离值越大,表示该样本点所在的子集与其他子集的距离就越远。因此按照上述实施例进行聚类后,能使同一个子集中的样本之间距离变短,且不同子集中的样本之间距离变大。这样某一子集的样本越密集,代表图片的特征越相似,则代表该子集中的数据与样本数据表示的类别越相似,属于简单样本,模型很容易学习到简单样本的特征。反之,若一个子集的样本越稀疏,代表图片越多样化,则认为该子集的数据较复杂,属于难样本。而且通过上述实施例选取的聚类中心,对样本数据进行聚类,还可以有效排除错误的样本,从而提高后续训练模型参数的准确度。In the above embodiment, the larger the density value of the sample, the more samples are represented similar to the sample. The larger the density distance value of the sample, the further the distance between the subset of the sample points and other subsets. Therefore, after clustering according to the above embodiment, the distance between samples in the same subset can be shortened, and the distance between samples in different subsets becomes larger. Thus, the denser the samples of a certain subset, the more similar the features representing the pictures, the more similar the data representing the subset is to the categories represented by the sample data, which are simple samples, and the model can easily learn the characteristics of simple samples. On the other hand, if the sample of a subset is sparse and the representative picture is more diverse, the data of the subset is considered to be more complicated and is a difficult sample. Moreover, clustering the sample data through the cluster center selected in the above embodiment can also effectively eliminate the erroneous samples, thereby improving the accuracy of the parameters of the subsequent training model.
如图4所示,本申请模型训练装置的第一较佳实施例的程序模块图。所述模型训练装置4包括,但不限于以下一个或者多个模块:数据获取模块40、数据聚类模块41、相关度计算模块42、排序模块43及训练模块44。本申请所称的单元是指一种能够被模型训练装置4的处理器所执行并且能够完成固定功能的一系列计算机可读指令段,其存储在存储器中。关于各单元的功能将在后续的实施例中详述。As shown in FIG. 4, a program block diagram of a first preferred embodiment of the model training device of the present application. The model training device 4 includes, but is not limited to, one or more of the following modules: a data acquisition module 40, a data clustering module 41, a correlation calculation module 42, a ranking module 43, and a training module 44. A unit referred to in this application refers to a series of computer readable instruction segments that can be executed by a processor of the model training device 4 and that are capable of performing a fixed function, which is stored in a memory. The function of each unit will be detailed in the subsequent embodiments.
所述数据获取模块40获取每个类别的样本数据。The data acquisition module 40 acquires sample data for each category.
在本实施例中,训练的模型用于识别待检测图片属于的类别,例如,所述模型为车辆部位识别模型,所述车辆部位识别模型用于识别待待测图片中的部位属于车辆的哪个部位。这样就需要获取车辆各个部位的样本数据,一个部位的样本数据属于一个类别。In this embodiment, the trained model is used to identify the category to which the picture to be detected belongs, for example, the model is a vehicle part identification model, and the vehicle part identification model is used to identify which part of the vehicle in the picture to be tested belongs to Part. In this way, it is necessary to obtain sample data of various parts of the vehicle, and the sample data of one part belongs to one category.
所述数据聚类模块41对每个类别的样本数据进行分类,得到每个类别的多个子集。其中采用第一较优实施例中对每个类别的样本数据进行分类。The data clustering module 41 classifies the sample data for each category to obtain a plurality of subsets of each category. The sample data of each category is classified in the first preferred embodiment.
在本实施例中,所述数据聚类模块41用于实现第一较优实施例中样本数据分类方法,在此不再详述。In this embodiment, the data clustering module 41 is used to implement the sample data classification method in the first preferred embodiment, which is not described in detail herein.
所述相关度计算模块42计算每个类别的多个子集中每个子集与每个子集所在类别的相关度。The relevance calculation module 42 calculates the relevance of each subset of the plurality of subsets of each category to the category in which each subset is located.
按照上述实施例进行聚类后,对于每个类别而言,能使同一个子集中的样本之间距离变短,且不同子集中的样本之间距离变大。这样某一子集的样 本越密集,代表图片的特征越相似,则表示该子集中的数据与该子集所在的类别越相关,相似度越高,属于简单样本。反之,若一个子集的样本越稀疏,代表图片越多样化,属于难样本。After clustering according to the above embodiment, for each category, the distance between samples in the same subset can be shortened, and the distance between samples in different subsets becomes larger. The denser the sample of such a subset, the more similar the features of the representative image, the more relevant the data in the subset is to the category in which the subset is located, and the higher the similarity, the simple sample. Conversely, if the sample of a subset is sparse, the more representative the picture, the more difficult the sample.
优选地,对于每个类别,将每个子集包含的样本数作为每个子集与每个子集所在类别的相关度。例如对于一个类别,聚类后得到三个子集:第一子集、第二子集、第三子集。若第一子集的样本数为40个,第二子集的样本数位100个,第三子集的样本数位10个,则用数值40表示第一子集与第一子集所在类别的相似度。Preferably, for each category, the number of samples included in each subset is taken as the relevance of each subset to the category in which each subset is located. For example, for a category, three subsets are obtained after clustering: a first subset, a second subset, and a third subset. If the number of samples in the first subset is 40, the number of samples in the second subset is 100, and the number of samples in the third subset is 10, a value of 40 is used to indicate that the first subset is similar to the category of the first subset. degree.
所述排序模块43根据每个类别的多个子集中每个子集与类别的相关度,从高到低,对每个类别的多个子集进行排序,得到每个类别的多个排序后的子集。The sorting module 43 sorts multiple subsets of each category according to the degree of relevance of each subset and category in each subset of each category, and obtains a plurality of sorted subsets of each category. .
例如对于一个类别,聚类后得到三个子集:第一子集、第二子集、第三子集。若第一子集的样本数为40个,第二子集的样本数为100个,第三子集的样本数为10个,则所述一个类别的多个排序后的子集为:第二子集、第一子集、第三子集。For example, for a category, three subsets are obtained after clustering: a first subset, a second subset, and a third subset. If the number of samples in the first subset is 40, the number of samples in the second subset is 100, and the number of samples in the third subset is 10, then the plurality of sorted subsets of the one category is: Two subsets, a first subset, and a third subset.
所述训练模块44依次从每个类别的多个排序后的子集中,读取排序位置相同的子集作为模型的训练样本,对所述模型进行训练。The training module 44 sequentially reads the subsets with the same sorting position from the plurality of sorted subsets of each category as training samples of the model, and trains the model.
优选地,所述训练模块44从每个类别的多个排序后的子集中,读取每个类别的第一个子集作为模型的训练样本,对所述模型进行训练,达到第一终止条件后,读取每个类别的第二子集,将每个类别的第二子集加入模型的训练样本中,继续对所述模型进行训练,直至每个类别的所有子集都作为训练样本。这样将对所述模型的训练分成多个子任务,按照任务的难易度,从易到难依次进行训练,以避免难训练样本被剔除,从而让所述模型从易到难学习每个类别的特征,从而提高模型参数的适应力。Preferably, the training module 44 reads the first subset of each category from the plurality of sorted subsets of each category as a training sample of the model, and trains the model to reach the first termination condition. Thereafter, a second subset of each category is read, a second subset of each category is added to the model's training samples, and the model continues to be trained until all subsets of each category are used as training samples. In this way, the training of the model is divided into multiple subtasks, and the training is performed from easy to difficult according to the difficulty of the task, so as to avoid the difficult training samples being eliminated, so that the model can learn from each category from easy to difficult. Features to improve the resilience of model parameters.
进一步地,在所述多个排序后的子集中,排序位置越靠前,子集对应的权重越大。这样相似度越高的样本,权重越大,在训练模型的时候,就能学习更多的特征,从而提高模型参数的准确率。因此,可以依赖置信度高的子集来提升模型识别的准确率。Further, in the plurality of sorted subsets, the higher the ranking position, the larger the weight corresponding to the subset. Such a sample with higher similarity has a larger weight, and when the model is trained, more features can be learned, thereby improving the accuracy of the model parameters. Therefore, a subset with high confidence can be relied upon to improve the accuracy of model recognition.
例如,有两个类别,类别A及类别B,类别A中排序后的子集为:子集A1、子集A2,子集A1对应的权重为1、子集A2对应的权重为0.5。类别B中排序后的子集为子集B1、子集B2,子集B1对应的权重为1、子集B2对应的权重为0.5。先读取子集A1及子集B1对模型进行训练,达到第一终止条件后,再读取子集A2及子集B2,将子集A2及子集B2加入模型的训练样本中,这样子集A1、子集B1、子集A2及子集B2都作为训练样本,再对模型进行训练,直至训练结束。For example, there are two categories, category A and category B. The sorted subsets in category A are: subset A1, subset A2, the weight corresponding to subset A1 is 1, and the weight corresponding to subset A2 is 0.5. The sorted subset in category B is subset B1 and subset B2, the weight corresponding to subset B1 is 1, and the weight corresponding to subset B2 is 0.5. First read the subset A1 and the subset B1 to train the model. After reaching the first termination condition, read the subset A2 and the subset B2, and add the subset A2 and the subset B2 to the training samples of the model. Set A1, subset B1, subset A2, and subset B2 are all used as training samples, and the model is trained until the end of training.
训练车辆部位识别模型应用场景举例如下:An example of a training vehicle part identification model application scenario is as follows:
首先获取车辆的各个部位的样本图片,一个部位的样本是一个类别的图片,对任意一个部位的样本采用第一较优实施例中的样本数据分类方法进行处理,得到每个部位的多个子集,并利用第二较优实施例中的方法对每个部位的多个子集进行排序,并基于每个部位的多个排序后的子集,对所述车辆 部位识别模型进行训练。这样将对所述车辆部位识别模型的训练分成多个子任务,按照任务的难易度,从易到难依次进行训练车辆部位识别模型,以避免难训练样本被剔除,从而让所述模型从易到难学习各个部位的样本图片的特征,从而提高模型参数的适应力。First, a sample picture of each part of the vehicle is obtained, and the sample of one part is a picture of one category, and the sample of any part is processed by the sample data classification method in the first preferred embodiment to obtain multiple subsets of each part. And sorting the plurality of subsets of each part by using the method in the second preferred embodiment, and training the vehicle part identification model based on the plurality of sorted subsets of each part. In this way, the training of the vehicle part recognition model is divided into a plurality of subtasks, and according to the difficulty degree of the task, the training vehicle part recognition model is sequentially performed from easy to difficult, so as to avoid the difficult training samples being eliminated, thereby making the model easy to It is difficult to learn the characteristics of the sample pictures of various parts, thereby improving the adaptability of the model parameters.
由以上实施例可知,本申请将将每个类别的训练样本数据按照难易程度分类多个子集,能使同一个子集中的样本之间距离变短,且不同子集中的样本之间距离变大。这样某一子集的样本越密集,代表图片的特征越相似,则代表该子集中的数据与样本数据表示的类别越相似,属于简单样本,模型很容易学习到简单样本的特征。反之,若一个子集的样本越稀疏,代表图片越多样化,则认为该子集的数据较复杂,属于难样本。再对训练样本数据的多个子集,从易到难进行排序,从而实现将对所述模型的训练分成多个子任务,按照任务的难易度,从易到难依次进行训练,以避免难训练样本被剔除,从而让所述模型从易到难学习每个类别的特征,从而提高模型参数的适应力。It can be seen from the above embodiments that the present application classifies the training sample data of each category into a plurality of subsets according to the difficulty level, so that the distance between the samples in the same subset becomes shorter, and the distance between the samples in different subsets becomes larger. . Thus, the denser the samples of a certain subset, the more similar the features representing the pictures, the more similar the data representing the subset is to the categories represented by the sample data, which are simple samples, and the model can easily learn the characteristics of simple samples. On the other hand, if the sample of a subset is sparse and the representative picture is more diverse, the data of the subset is considered to be more complicated and is a difficult sample. Then, the plurality of subsets of the training sample data are sorted from easy to difficult, so that the training of the model is divided into multiple subtasks, and the training is performed from easy to difficult according to the difficulty of the task, so as to avoid difficult training. The samples are rejected, allowing the model to learn the characteristics of each category from easy to difficult, thereby increasing the resilience of the model parameters.
上述以软件程序模块的形式实现的集成的单元,可以存储在一个非易失性可读取存储介质中。上述软件程序模块存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本申请每个实施例所述方法的部分步骤。The above-described integrated unit implemented in the form of a software program module can be stored in a non-volatile readable storage medium. The software program module described above is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to perform the method of each embodiment of the present application. Part of the steps.
如图5所示,所述电子设备5包括至少一个发送装置51、至少一个存储器52、至少一个处理器53、至少一个接收装置54以及至少一个通信总线。其中,所述通信总线用于实现这些组件之间的连接通信。As shown in FIG. 5, the electronic device 5 comprises at least one transmitting device 51, at least one memory 52, at least one processor 53, at least one receiving device 54, and at least one communication bus. Wherein, the communication bus is used to implement connection communication between these components.
所述电子设备5是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。所述电子设备5还可包括网络设备和/或用户设备。其中,所述网络设备包括但不限于单个网络服务器、多个网络服务器组成的服务器组或基于云计算(Cloud Computing)的由大量主机或网络服务器构成的云,其中,云计算是分布式计算的一种,由一群松散耦合的计算机集组成的一个超级虚拟计算机。The electronic device 5 is a device capable of automatically performing numerical calculation and/or information processing according to an instruction set or stored in advance, and the hardware includes, but not limited to, a microprocessor, an application specific integrated circuit (ASIC). ), Field-Programmable Gate Array (FPGA), Digital Signal Processor (DSP), embedded devices, etc. The electronic device 5 may also comprise a network device and/or a user device. The network device includes, but is not limited to, a single network server, a server group composed of multiple network servers, or a cloud computing-based cloud composed of a large number of hosts or network servers, where the cloud computing is distributed computing. A super virtual computer consisting of a group of loosely coupled computers.
所述电子设备5可以是,但不限于任何一种可与用户通过键盘、触摸板或声控设备等方式进行人机交互的电子产品,例如,平板电脑、智能手机、个人数字助理(Personal Digital Assistant,PDA)、智能式穿戴式设备、摄像设备、监控设备等终端。The electronic device 5 can be, but is not limited to, any electronic product that can interact with a user through a keyboard, a touch pad, or a voice control device, such as a tablet, a smart phone, or a personal digital assistant (Personal Digital Assistant). , PDA), smart wearable devices, camera equipment, monitoring equipment and other terminals.
所述电子设备5所处的网络包括,但不限于互联网、广域网、城域网、局域网、虚拟专用网络(Virtual Private Network,VPN)等。The network in which the electronic device 5 is located includes, but is not limited to, the Internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (VPN), and the like.
其中,所述接收装置54和所述发送装置51可以是有线发送端口,也可以为无线设备,例如包括天线装置,用于与其他设备进行数据通信。The receiving device 54 and the sending device 51 may be wired transmission ports, or may be wireless devices, for example, including antenna devices, for performing data communication with other devices.
所述存储器52用于存储程序代码。所述存储器52可以是集成电路中没有实物形式的具有存储功能的电路,如RAM(Random-Access Memory,随机存取存储器)、FIFO(First In First Out,)等。或者,所述存储器52也可以是 具有实物形式的存储器,如内存条、TF卡(Trans-flash Card)、智能媒体卡(smart media card)、安全数字卡(secure digital card)、快闪存储器卡(flash card)等储存设备等等。The memory 52 is used to store program code. The memory 52 may be a circuit having a storage function, such as a RAM (Random-Access Memory), a FIFO (First In First Out), or the like, which is not in a physical form in the integrated circuit. Alternatively, the memory 52 may also be a memory having a physical form, such as a memory stick, a TF card (Trans-flash Card), a smart media card, a secure digital card, a flash memory card. Storage devices such as (flash card) and the like.
所述处理器53可以包括一个或者多个微处理器、数字处理器。所述处理器53可调用存储器52中存储的程序代码以执行相关的功能。例如,图3中所述的各个模块是存储在所述存储器52中的程序代码,并由所述处理器53所执行,以实现一种样本数据分类方法;及/或图4中所述的各个模块是存储在所述存储器52中的程序代码,并由所述处理器53所执行,以实现一种模型训练方法。所述处理器53又称中央处理器(CPU,Central Processing Unit),是一块超大规模的集成电路,是运算核心(Core)和控制核心(Control Unit)。The processor 53 can include one or more microprocessors, digital processors. The processor 53 can call program code stored in the memory 52 to perform related functions. For example, the various modules described in FIG. 3 are program code stored in the memory 52 and executed by the processor 53 to implement a sample data classification method; and/or as described in FIG. The various modules are program code stored in the memory 52 and executed by the processor 53 to implement a model training method. The processor 53, also known as a central processing unit (CPU), is a very large-scale integrated circuit, which is a computing core (Core) and a control unit (Control Unit).
本申请实施例还提供一种非易失性可读存储介质,其上存储有计算机指令,所述指令当被包括一个或多个处理器的电子设备执行时,使电子设备执行如上文方法实施例所述的样本数据分类方法及/或模型训练方法。The embodiment of the present application further provides a non-volatile readable storage medium having stored thereon computer instructions that, when executed by an electronic device including one or more processors, cause the electronic device to perform the method as described above The sample data classification method and/or the model training method described in the example.
结合图1所示,所述电子设备5中的所述存储器52存储多个指令以实现一种样本数据分类方法,所述处理器53可执行所述多个指令从而实现:As shown in FIG. 1, the memory 52 in the electronic device 5 stores a plurality of instructions to implement a sample data classification method, and the processor 53 can execute the plurality of instructions to implement:
计算样本数据中每个样本的特征;根据每个样本的特征,计算每个样本的距离集,所述每个样本的距离集包括每个样本与每个样本对应的剩余样本中每个样本间的距离;根据每个样本的距离集,计算每个样本的密度值及计算每个样本的密度距离值;根据每个样本的密度值及每个样本的密度距离值,确定至少一个聚类中心;基于所述至少一个聚类中心及每个样本的特征,将所述样本数据聚类成多个子集。Calculating a feature of each sample in the sample data; calculating a distance set of each sample according to characteristics of each sample, the distance set of each sample including each sample in each of the remaining samples corresponding to each sample Distance; calculate the density value of each sample and calculate the density distance value of each sample according to the distance set of each sample; determine at least one cluster center according to the density value of each sample and the density distance value of each sample And clustering the sample data into a plurality of subsets based on the at least one cluster center and features of each sample.
根据本申请优选实施例,在计算每个样本的密度值时,所述处理器可执行所述多个指令还包括:According to a preferred embodiment of the present application, when calculating the density value of each sample, the processor executing the plurality of instructions further includes:
将每个样本的距离集每个距离与距离阈值进行对比,确定大于所述距离阈值的距离数,并将每个样本对应的距离数作为每个样本的密度值。Each distance distance set of each sample is compared with a distance threshold, a distance number greater than the distance threshold is determined, and the distance number corresponding to each sample is taken as the density value of each sample.
根据本申请优选实施例,所述计算每个样本的密度距离值包括:According to a preferred embodiment of the present application, the calculating the density distance value of each sample comprises:
对于密度值最大的样本,从所述密度值最大的样本的距离集中,筛选最大距离作为所述密度值最大的样本的密度距离值;For the sample having the largest density value, the maximum distance is selected as the density distance value of the sample having the largest density value from the distance concentration of the sample having the largest density value;
对于第二样本集中任意一个样本,确定密度值大于所述任意一个样本的密度值的样本;根据所述任意一个样本的距离集,从密度值大于所述任意一个样本的密度值的样本中确定与所述任意一个样本最近的距离,将与所述任意一个样本最近的距离确定为所述任意一个样本的密度距离值,所述第二样本集包括所述样本数据中除去密度值最大的样本的其他样本。And determining, for any one sample in the second sample set, a sample whose density value is greater than a density value of the any one of the samples; determining, according to the distance set of the any one of the samples, a sample having a density value greater than a density value of the any one of the samples a distance closest to the any one of the samples, a distance closest to the any one of the samples is determined as a density distance value of the arbitrary one of the samples, and the second sample set includes a sample having the largest density value removed from the sample data Other samples.
根据本申请优选实施例,在根据每个样本的密度值及每个样本的密度距离值,确定至少一个聚类中心时,所述处理器可执行所述多个指令还包括:According to a preferred embodiment of the present application, when the at least one clustering center is determined according to the density value of each sample and the density distance value of each sample, the processor may execute the plurality of instructions further including:
根据每个样本的密度值及每个样本的密度距离值,计算每个样本的聚类度量值;Calculating a cluster metric for each sample based on the density value of each sample and the density distance value of each sample;
根据每个样本的聚类度量值,确定至少一个聚类中心。At least one cluster center is determined based on the cluster metric of each sample.
根据本申请优选实施例,所述每个样本的聚类度量值等于每个样本的密度值与每个样本的密度距离值的乘积。According to a preferred embodiment of the present application, the cluster metric value of each sample is equal to the product of the density value of each sample and the density distance value of each sample.
根据本申请优选实施例,在根据每个样本的聚类度量值,确定至少一个聚类中心时,所述处理器可执行所述多个指令还包括:According to a preferred embodiment of the present application, when determining the at least one clustering center according to the clustering metric value of each sample, the executing the plurality of instructions by the processor further includes:
根据每个样本的聚类度量值,从大到小进行排序,从排序后的聚类度量值中,筛选聚类度量值排序前预设位数的样本作为聚类中心点;According to the clustering metric value of each sample, the sorting is performed from large to small, and from the sorted clustering metric values, the sample of the preset number of bits before the sorting of the clustering metric value is selected as the clustering center point;
根据每个样本的聚类度量值,筛选聚类度量值大于阈值的样本作为聚类中心点。According to the cluster metric of each sample, the sample with the cluster metric greater than the threshold is filtered as the cluster center point.
根据本申请优选实施例,所述处理器可执行所述多个指令还包括:According to a preferred embodiment of the present application, the processor executable to execute the plurality of instructions further includes:
将与所述至少一个聚类中心每个聚类中心的距离超过距离阈值的样本确定为错误样本。A sample having a distance from each cluster center of the at least one cluster center exceeding a distance threshold is determined as an error sample.
在任意实施例中所述样本数据分类方法对应的多个指令存储在所述存储器52,并通过所述处理器53来执行,在此不再详述。The plurality of instructions corresponding to the sample data classification method are stored in the memory 52 in any of the embodiments and executed by the processor 53, and are not described in detail herein.
结合图2所示,所述电子设备5中的所述存储器52存储多个指令以实现一种模型训练方法,所述处理器53可执行所述多个指令从而实现:As shown in FIG. 2, the memory 52 in the electronic device 5 stores a plurality of instructions to implement a model training method, and the processor 53 can execute the plurality of instructions to implement:
获取每个类别的样本数据;对每个类别的样本数据进行分类,得到每个类别的多个子集;计算每个类别的多个子集中每个子集与每个子集所在类别的相关度;根据每个类别的多个子集中每个子集与类别的相关度,从高到低,对每个类别的多个子集进行排序,得到每个类别的多个排序后的子集;依次从每个类别的多个排序后的子集中,读取排序位置相同的子集作为模型的训练样本,对所述模型进行训练。Obtaining sample data for each category; classifying sample data for each category to obtain multiple subsets of each category; calculating the relevance of each subset of each subset of each category to the category of each subset; The relevance of each subset and category in multiple subsets of a category, from high to low, sorting multiple subsets of each category to obtain multiple sorted subsets of each category; In a plurality of sorted subsets, a subset of the same sort position is read as a training sample of the model, and the model is trained.
根据本申请优选实施例,在所述多个排序后的子集中,排序位置越靠前的子集对应的权重越大。According to a preferred embodiment of the present application, in the plurality of sorted subsets, the subset of the higher ranking positions corresponds to a greater weight.
以上说明的本申请的特征性的手段可以通过集成电路来实现,并控制实现上述任意实施例中所述样本数据分类方法的功能。即,本申请的集成电路安装于所述电子设备中,使所述电子设备发挥如下功能:计算样本数据中每个样本的特征;根据每个样本的特征,计算每个样本的距离集,所述每个样本的距离集包括每个样本与每个样本对应的剩余样本中每个样本间的距离;根据每个样本的距离集,计算每个样本的密度值及计算每个样本的密度距离值;根据每个样本的密度值及每个样本的密度距离值,确定至少一个聚类中心;基于所述至少一个聚类中心及每个样本的特征,将所述样本数据聚类成多个子集。The above-described characteristic means of the present application can be implemented by an integrated circuit and control the function of implementing the sample data classification method in any of the above embodiments. That is, the integrated circuit of the present application is installed in the electronic device, so that the electronic device performs the functions of: calculating characteristics of each sample in the sample data; calculating a distance set of each sample according to characteristics of each sample, The distance set of each sample includes the distance between each sample of each sample corresponding to each sample; according to the distance set of each sample, the density value of each sample is calculated and the density distance of each sample is calculated. a value; determining at least one cluster center according to a density value of each sample and a density distance value of each sample; clustering the sample data into a plurality of children based on the at least one cluster center and characteristics of each sample set.
在任意实施例中所述样本数据分类方法所能实现的功能都能通过本申请的集成电路安装于所述电子设备中,使所述电子设备发挥任意实施例中所述样本数据分类方法所能实现的功能,在此不再详述。The functions that can be implemented by the sample data classification method in any of the embodiments can be installed in the electronic device through the integrated circuit of the present application, so that the electronic device can perform the sample data classification method in any embodiment. The functions implemented are not detailed here.
以上说明的本申请的特征性的手段可以通过集成电路来实现,并控制实现上述任意实施例中所述样本数据分类方法的功能。即,本申请的集成电路安装于所述电子设备中,使所述电子设备发挥如下功能:获取每个类别的样本数据;对每个类别的样本数据进行分类,得到每个类别的多个子集;计算每个类别的多个子集中每个子集与每个子集所在类别的相关度;根据每个类别的多个子集中每个子集与类别的相关度,从高到低,对每个类别的多个子集进行排序,得到每个类别的多个排序后的子集;依次从每个类别的多个排序后的子集中,读取排序位置相同的子集作为模型的训练样本,对所述模型进行训练。The above-described characteristic means of the present application can be implemented by an integrated circuit and control the function of implementing the sample data classification method in any of the above embodiments. That is, the integrated circuit of the present application is installed in the electronic device, so that the electronic device functions to acquire sample data of each category; classify sample data of each category to obtain multiple subsets of each category Calculating the relevance of each subset in each subset of each category to the category in which each subset is located; the relevance of each subset to the category in multiple subsets of each category, from high to low, for each category Sorting subsets to obtain multiple sorted subsets of each category; sequentially reading subsets with the same sorting position from multiple sorted subsets of each category as training samples of the model, for the model Train.
在任意实施例中所述模型训练方法所能实现的功能都能通过本申请的集成电路安装于所述电子设备中,使所述电子设备发挥任意实施例中所述模型训练方法所能实现的功能,在此不再详述。The functions that can be implemented by the model training method in any of the embodiments can be installed in the electronic device through the integrated circuit of the present application, so that the electronic device can be implemented by the model training method in any embodiment. Function, no longer detailed here.
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。It should be noted that, for the foregoing method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should understand that the present application is not limited by the described action sequence. Because certain steps may be performed in other sequences or concurrently in accordance with the present application. In the following, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily required by the present application.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above embodiments, the descriptions of the various embodiments are different, and the details that are not detailed in a certain embodiment can be referred to the related descriptions of other embodiments.
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。In the several embodiments provided herein, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not executed. In addition, the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be electrical or otherwise.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
另外,在本申请的各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个非易失性可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a non-volatile readable storage medium. Based on such understanding, the technical solution of the present application, in essence or the contribution to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium. A number of instructions are included to cause a computer device (which may be a personal computer, server or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present application. The foregoing storage medium includes: a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, and the like. .
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。The above embodiments are only used to explain the technical solutions of the present application, and are not limited thereto; although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still The technical solutions described in the embodiments are modified, or equivalent to some of the technical features are replaced; and the modifications or substitutions do not deviate from the scope of the technical solutions of the embodiments of the present application.

Claims (20)

  1. 一种样本数据分类方法,其特征在于,所述方法包括:A sample data classification method, the method comprising:
    计算样本数据中每个样本的特征;Calculating the characteristics of each sample in the sample data;
    根据每个样本的特征,计算每个样本的距离集,所述每个样本的距离集包括每个样本与每个样本对应的剩余样本中每个样本间的距离;Calculating a distance set of each sample according to characteristics of each sample, the distance set of each sample including a distance between each sample of each of the remaining samples corresponding to each sample;
    根据每个样本的距离集,计算每个样本的密度值及计算每个样本的密度距离值;Calculating the density value of each sample and calculating the density distance value of each sample according to the distance set of each sample;
    根据每个样本的密度值及每个样本的密度距离值,确定至少一个聚类中心;Determining at least one cluster center according to the density value of each sample and the density distance value of each sample;
    基于所述至少一个聚类中心及每个样本的特征,将所述样本数据聚类成多个子集。The sample data is clustered into a plurality of subsets based on the at least one cluster center and features of each sample.
  2. 如权利要求1所述的样本数据分类方法,其特征在于,所述计算每个样本的密度值包括:The sample data classification method according to claim 1, wherein said calculating a density value of each sample comprises:
    将每个样本的距离集中每个距离与距离阈值进行对比,确定大于所述距离阈值的距离数,并将每个样本对应的距离数作为每个样本的密度值。The distance between each sample is compared with each distance and the distance threshold, the number of distances greater than the distance threshold is determined, and the number of distances corresponding to each sample is taken as the density value of each sample.
  3. 如权利要求1所述的样本数据分类方法,其特征在于,所述计算每个样本的密度距离值包括:The sample data classification method according to claim 1, wherein said calculating a density distance value of each sample comprises:
    对于密度值最大的样本,从所述密度值最大的样本的距离集中,筛选最大距离作为所述密度值最大的样本的密度距离值;For the sample having the largest density value, the maximum distance is selected as the density distance value of the sample having the largest density value from the distance concentration of the sample having the largest density value;
    对于第二样本集中任意一个样本,确定密度值大于所述任意一个样本的密度值的样本;根据所述任意一个样本的距离集,从密度值大于所述任意一个样本的密度值的样本中确定与所述任意一个样本最近的距离,将与所述任意一个样本最近的距离确定为所述任意一个样本的密度距离值,所述第二样本集包括所述样本数据中除去密度值最大的样本的其他样本。And determining, for any one sample in the second sample set, a sample whose density value is greater than a density value of the any one of the samples; determining, according to the distance set of the any one of the samples, a sample having a density value greater than a density value of the any one of the samples a distance closest to the any one of the samples, a distance closest to the any one of the samples is determined as a density distance value of the arbitrary one of the samples, and the second sample set includes a sample having the largest density value removed from the sample data Other samples.
  4. 如权利要求1所述的样本数据分类方法,其特征在于,所述根据每个样本的密度值及每个样本的密度距离值,确定至少一个聚类中心包括:The sample data classification method according to claim 1, wherein the determining the at least one cluster center according to the density value of each sample and the density distance value of each sample comprises:
    根据每个样本的密度值及每个样本的密度距离值,计算每个样本的聚类度量值;Calculating a cluster metric for each sample based on the density value of each sample and the density distance value of each sample;
    根据每个样本的聚类度量值,确定至少一个聚类中心,所述每个样本的聚类度量值等于每个样本的密度值与每个样本的密度距离值的乘积。Based on the cluster metric values for each sample, at least one cluster center is determined, the cluster metric value of each sample being equal to the product of the density value of each sample and the density distance value of each sample.
  5. 如权利要求4所述的样本数据分类方法,其特征在于,所述根据每个样本的聚类度量值,确定至少一个聚类中心包括:The sample data classification method according to claim 4, wherein the determining, according to the cluster metric value of each sample, the at least one cluster center comprises:
    根据每个样本的聚类度量值,从大到小进行排序,从排序后的聚类度量值中,筛选聚类度量值排序前预设位数的样本作为聚类中心点;According to the clustering metric value of each sample, the sorting is performed from large to small, and from the sorted clustering metric values, the sample of the preset number of bits before the sorting of the clustering metric value is selected as the clustering center point;
    根据每个样本的聚类度量值,筛选聚类度量值大于阈值的样本作为聚类中心点。According to the cluster metric of each sample, the sample with the cluster metric greater than the threshold is filtered as the cluster center point.
  6. 如权利要求1所述的样本数据分类方法,其特征在于,所述方法还包括:The sample data classification method according to claim 1, wherein the method further comprises:
    将与所述至少一个聚类中心每个聚类中心的距离超过距离阈值的样本确定为错误样本。A sample having a distance from each cluster center of the at least one cluster center exceeding a distance threshold is determined as an error sample.
  7. 一种模型训练方法,其特征在于,所述方法包括:A model training method, characterized in that the method comprises:
    获取每个类别的样本数据;Get sample data for each category;
    利用如权利要求1至6中任一项所述的样本数据分类方法对每个类别的样本数据进行分类,得到每个类别的多个子集;Using the sample data classification method according to any one of claims 1 to 6, classifying sample data of each category to obtain a plurality of subsets of each category;
    计算每个类别的多个子集中每个子集与每个子集所在类别的相关度;Calculate the relevance of each subset in each subset of each category to the category in which each subset is located;
    根据每个类别的多个子集中每个子集与类别的相关度,从高到低,对每个类别的多个子集进行排序,得到每个类别的多个排序后的子集;Sorting multiple subsets of each category according to the relevance of each subset and category in each subset of each category, from high to low, obtaining multiple sorted subsets of each category;
    依次从每个类别的多个排序后的子集中,读取排序位置相同的子集作为模型的训练样本,对所述模型进行训练。The subsets with the same sorting position are read from the plurality of sorted subsets of each category in turn as training samples of the model, and the model is trained.
  8. 如权利要求7所述的模型训练方法,其特征在于,在所述多个排序后的子集中,排序位置越靠前的子集对应的权重越大。The model training method according to claim 7, wherein in the plurality of sorted subsets, the subset of the higher ranking positions has a larger weight corresponding to the subset.
  9. 一种电子设备,其特征在于,所述电子设备包括存储器及处理器,所述存储器用于存储至少一个指令,所述处理器用于执行所述至少一个指令以实现以下步骤:An electronic device, comprising: a memory for storing at least one instruction, and a processor for executing the at least one instruction to implement the following steps:
    计算样本数据中每个样本的特征;Calculating the characteristics of each sample in the sample data;
    根据每个样本的特征,计算每个样本的距离集,所述每个样本的距离集包括每个样本与每个样本对应的剩余样本中每个样本间的距离;Calculating a distance set of each sample according to characteristics of each sample, the distance set of each sample including a distance between each sample of each of the remaining samples corresponding to each sample;
    根据每个样本的距离集,计算每个样本的密度值及计算每个样本的密度距离值;Calculating the density value of each sample and calculating the density distance value of each sample according to the distance set of each sample;
    根据每个样本的密度值及每个样本的密度距离值,确定至少一个聚类中心;Determining at least one cluster center according to the density value of each sample and the density distance value of each sample;
    基于所述至少一个聚类中心及每个样本的特征,将所述样本数据聚类成多个子集。The sample data is clustered into a plurality of subsets based on the at least one cluster center and features of each sample.
  10. 如权利要求9所述的电子设备,其特征在于,所述计算每个样本的密度值包括:The electronic device of claim 9, wherein said calculating a density value for each sample comprises:
    将每个样本的距离集中每个距离与距离阈值进行对比,确定大于所述距离阈值的距离数,并将每个样本对应的距离数作为每个样本的密度值。The distance between each sample is compared with each distance and the distance threshold, the number of distances greater than the distance threshold is determined, and the number of distances corresponding to each sample is taken as the density value of each sample.
  11. 如权利要求9所述的电子设备,其特征在于,所述计算每个样本的密度距离值包括:The electronic device of claim 9, wherein said calculating a density distance value for each sample comprises:
    对于密度值最大的样本,从所述密度值最大的样本的距离集中,筛选最大距离作为所述密度值最大的样本的密度距离值;For the sample having the largest density value, the maximum distance is selected as the density distance value of the sample having the largest density value from the distance concentration of the sample having the largest density value;
    对于第二样本集中任意一个样本,确定密度值大于所述任意一个样本的密度值的样本;根据所述任意一个样本的距离集,从密度值大于所述任意一个样本的密度值的样本中确定与所述任意一个样本最近的距离,将与所述任意一个样本最近的距离确定为所述任意一个样本的密度距离值,所述第二样本集包括所述样本数据中除去密度值最大的样本的其他样本。And determining, for any one sample in the second sample set, a sample whose density value is greater than a density value of the any one of the samples; determining, according to the distance set of the any one of the samples, a sample having a density value greater than a density value of the any one of the samples a distance closest to the any one of the samples, a distance closest to the any one of the samples is determined as a density distance value of the arbitrary one of the samples, and the second sample set includes a sample having the largest density value removed from the sample data Other samples.
  12. 如权利要求9所述的电子设备,其特征在于,所述根据每个样本的密度值及每个样本的密度距离值,确定至少一个聚类中心包括:The electronic device according to claim 9, wherein the determining the at least one cluster center according to the density value of each sample and the density distance value of each sample comprises:
    根据每个样本的密度值及每个样本的密度距离值,计算每个样本的聚类度量值;Calculating a cluster metric for each sample based on the density value of each sample and the density distance value of each sample;
    根据每个样本的聚类度量值,确定至少一个聚类中心,所述每个样本的聚类度量值等于每个样本的密度值与每个样本的密度距离值的乘积。Based on the cluster metric values for each sample, at least one cluster center is determined, the cluster metric value of each sample being equal to the product of the density value of each sample and the density distance value of each sample.
  13. 如权利要求12所述的电子设备,其特征在于,所述根据每个样本的聚类度量值,确定至少一个聚类中心包括:The electronic device according to claim 12, wherein the determining, according to the cluster metric value of each sample, the at least one cluster center comprises:
    根据每个样本的聚类度量值,从大到小进行排序,从排序后的聚类度量值中,筛选聚类度量值排序前预设位数的样本作为聚类中心点;According to the clustering metric value of each sample, the sorting is performed from large to small, and from the sorted clustering metric values, the sample of the preset number of bits before the sorting of the clustering metric value is selected as the clustering center point;
    根据每个样本的聚类度量值,筛选聚类度量值大于阈值的样本作为聚类中心点。According to the cluster metric of each sample, the sample with the cluster metric greater than the threshold is filtered as the cluster center point.
  14. 如权利要求9所述的电子设备,其特征在于,所述处理器还用于执行所述至少一个指令以实现以下步骤:The electronic device of claim 9, wherein the processor is further configured to execute the at least one instruction to implement the following steps:
    将与所述至少一个聚类中心每个聚类中心的距离超过距离阈值的样本确定为错误样本。A sample having a distance from each cluster center of the at least one cluster center exceeding a distance threshold is determined as an error sample.
  15. 一种非易失性可读存储介质,其特征在于,所述非易失性可读存储介质存储有至少一个指令,所述至少一个指令被处理器执行时实现以下步骤:A non-volatile readable storage medium, characterized in that the non-volatile readable storage medium stores at least one instruction, the at least one instruction being executed by a processor to implement the following steps:
    计算样本数据中每个样本的特征;Calculating the characteristics of each sample in the sample data;
    根据每个样本的特征,计算每个样本的距离集,所述每个样本的距离集包括每个样本与每个样本对应的剩余样本中每个样本间的距离;Calculating a distance set of each sample according to characteristics of each sample, the distance set of each sample including a distance between each sample of each of the remaining samples corresponding to each sample;
    根据每个样本的距离集,计算每个样本的密度值及计算每个样本的密度距离值;Calculating the density value of each sample and calculating the density distance value of each sample according to the distance set of each sample;
    根据每个样本的密度值及每个样本的密度距离值,确定至少一个聚类中心;Determining at least one cluster center according to the density value of each sample and the density distance value of each sample;
    基于所述至少一个聚类中心及每个样本的特征,将所述样本数据聚类成多个子集。The sample data is clustered into a plurality of subsets based on the at least one cluster center and features of each sample.
  16. 如权利要求15所述的存储介质,其特征在于,所述计算每个样本的密度值包括:The storage medium of claim 15 wherein said calculating a density value for each sample comprises:
    将每个样本的距离集中每个距离与距离阈值进行对比,确定大于所述距离阈值的距离数,并将每个样本对应的距离数作为每个样本的密度值。The distance between each sample is compared with each distance and the distance threshold, the number of distances greater than the distance threshold is determined, and the number of distances corresponding to each sample is taken as the density value of each sample.
  17. 如权利要求15所述的存储介质,其特征在于,所述计算每个样本的密度距离值包括:The storage medium of claim 15 wherein said calculating a density distance value for each sample comprises:
    对于密度值最大的样本,从所述密度值最大的样本的距离集中,筛选最大距离作为所述密度值最大的样本的密度距离值;For the sample having the largest density value, the maximum distance is selected as the density distance value of the sample having the largest density value from the distance concentration of the sample having the largest density value;
    对于第二样本集中任意一个样本,确定密度值大于所述任意一个样本的密度值的样本;根据所述任意一个样本的距离集,从密度值大于所述任意一个样本的密度值的样本中确定与所述任意一个样本最近的距离,将与所述任意一个样本最近的距离确定为所述任意一个样本的密度距离值,所述第二样本集包括所述样本数据中除去密度值最大的样本的其他样本。And determining, for any one sample in the second sample set, a sample whose density value is greater than a density value of the any one of the samples; determining, according to the distance set of the any one of the samples, a sample having a density value greater than a density value of the any one of the samples a distance closest to the any one of the samples, a distance closest to the any one of the samples is determined as a density distance value of the arbitrary one of the samples, and the second sample set includes a sample having the largest density value removed from the sample data Other samples.
  18. 如权利要求1所述的存储介质,其特征在于,所述根据每个样本的密度值及每个样本的密度距离值,确定至少一个聚类中心包括:The storage medium according to claim 1, wherein the determining the at least one cluster center according to the density value of each sample and the density distance value of each sample comprises:
    根据每个样本的密度值及每个样本的密度距离值,计算每个样本的聚类度量值;Calculating a cluster metric for each sample based on the density value of each sample and the density distance value of each sample;
    根据每个样本的聚类度量值,确定至少一个聚类中心,所述每个样本的聚类度量值等于每个样本的密度值与每个样本的密度距离值的乘积。Based on the cluster metric values for each sample, at least one cluster center is determined, the cluster metric value of each sample being equal to the product of the density value of each sample and the density distance value of each sample.
  19. 如权利要求18所述的存储介质,其特征在于,所述根据每个样本的聚 类度量值,确定至少一个聚类中心包括:The storage medium of claim 18, wherein the determining the at least one cluster center based on the aggregated metric value of each sample comprises:
    根据每个样本的聚类度量值,从大到小进行排序,从排序后的聚类度量值中,筛选聚类度量值排序前预设位数的样本作为聚类中心点;According to the clustering metric value of each sample, the sorting is performed from large to small, and from the sorted clustering metric values, the sample of the preset number of bits before the sorting of the clustering metric value is selected as the clustering center point;
    根据每个样本的聚类度量值,筛选聚类度量值大于阈值的样本作为聚类中心点。According to the cluster metric of each sample, the sample with the cluster metric greater than the threshold is filtered as the cluster center point.
  20. 如权利要求15所述的存储介质,其特征在于,所述至少一个指令被处理器执行时还实现以下步骤:The storage medium of claim 15 wherein said at least one instruction is further executed by said processor to:
    将与所述至少一个聚类中心每个聚类中心的距离超过距离阈值的样本确定为错误样本。A sample having a distance from each cluster center of the at least one cluster center exceeding a distance threshold is determined as an error sample.
PCT/CN2018/100157 2018-04-18 2018-08-13 Sample data classification method, model training method, electronic device and storage medium WO2019200782A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810350730.2A CN108595585B (en) 2018-04-18 2018-04-18 Sample data classification method, model training method, electronic equipment and storage medium
CN201810350730.2 2018-04-18

Publications (1)

Publication Number Publication Date
WO2019200782A1 true WO2019200782A1 (en) 2019-10-24

Family

ID=63613517

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/100157 WO2019200782A1 (en) 2018-04-18 2018-08-13 Sample data classification method, model training method, electronic device and storage medium

Country Status (2)

Country Link
CN (1) CN108595585B (en)
WO (1) WO2019200782A1 (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109378003B (en) * 2018-11-02 2021-10-01 科大讯飞股份有限公司 Method and system for training voiceprint model
CN109299279B (en) * 2018-11-29 2020-08-21 奇安信科技集团股份有限公司 Data processing method, device, system and medium
CN109508750A (en) * 2018-12-03 2019-03-22 斑马网络技术有限公司 The clustering method of user origin and destination, device and storage medium
CN109671007A (en) * 2018-12-27 2019-04-23 沈阳航空航天大学 Taxi taking difficulty assessment method near a kind of railway station based on multi-dimensional data
CN109856307B (en) * 2019-03-27 2021-04-16 大连理工大学 Metabolic component molecular variable comprehensive screening technology
CN109993234B (en) * 2019-04-10 2021-05-28 百度在线网络技术(北京)有限公司 Unmanned driving training data classification method and device and electronic equipment
CN110141226B (en) * 2019-05-29 2022-03-15 清华大学深圳研究生院 Automatic sleep staging method and device, computer equipment and computer storage medium
CN111079830A (en) * 2019-12-12 2020-04-28 北京金山云网络技术有限公司 Target task model training method and device and server
CN111414952B (en) * 2020-03-17 2023-10-17 腾讯科技(深圳)有限公司 Noise sample recognition method, device, equipment and storage medium for pedestrian re-recognition
CN112131362B (en) * 2020-09-22 2023-12-12 腾讯科技(深圳)有限公司 Dialogue sentence generation method and device, storage medium and electronic equipment
CN112132239B (en) * 2020-11-24 2021-03-16 北京远鉴信息技术有限公司 Training method, device, equipment and storage medium
CN112884040B (en) * 2021-02-19 2024-04-30 北京小米松果电子有限公司 Training sample data optimization method, system, storage medium and electronic equipment
CN113035347A (en) * 2021-03-15 2021-06-25 武汉中旗生物医疗电子有限公司 Electrocardio data disease identification method, device and storage medium
CN112990337B (en) * 2021-03-31 2022-11-29 电子科技大学中山学院 Multi-stage training method for target identification
CN113837000A (en) * 2021-08-16 2021-12-24 天津大学 Small sample fault diagnosis method based on task sequencing meta-learning
CN115979891B (en) * 2023-03-16 2023-06-23 中建路桥集团有限公司 Detection method for high-pressure liquid-gas mixed fluid jet crushing and solidified clay

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140270495A1 (en) * 2013-03-14 2014-09-18 Microsoft Corporation Multiple Cluster Instance Learning for Image Classification
CN105653598A (en) * 2015-12-22 2016-06-08 北京奇虎科技有限公司 Related news determination method and device
CN106874923A (en) * 2015-12-14 2017-06-20 阿里巴巴集团控股有限公司 A kind of genre classification of commodity determines method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106447676B (en) * 2016-10-12 2019-01-22 浙江工业大学 A kind of image partition method based on fast density clustering algorithm
CN107180075A (en) * 2017-04-17 2017-09-19 浙江工商大学 The label automatic generation method of text classification integrated level clustering

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140270495A1 (en) * 2013-03-14 2014-09-18 Microsoft Corporation Multiple Cluster Instance Learning for Image Classification
CN106874923A (en) * 2015-12-14 2017-06-20 阿里巴巴集团控股有限公司 A kind of genre classification of commodity determines method and device
CN105653598A (en) * 2015-12-22 2016-06-08 北京奇虎科技有限公司 Related news determination method and device

Also Published As

Publication number Publication date
CN108595585B (en) 2019-11-12
CN108595585A (en) 2018-09-28

Similar Documents

Publication Publication Date Title
WO2019200782A1 (en) Sample data classification method, model training method, electronic device and storage medium
US10438091B2 (en) Method and apparatus for recognizing image content
WO2019200781A1 (en) Receipt recognition method and device, and storage medium
US11238310B2 (en) Training data acquisition method and device, server and storage medium
WO2019119505A1 (en) Face recognition method and device, computer device and storage medium
JP6144839B2 (en) Method and system for retrieving images
US10013637B2 (en) Optimizing multi-class image classification using patch features
CN107209861A (en) Use the data-optimized multi-class multimedia data classification of negative
CN109817339B (en) Patient grouping method and device based on big data
WO2020114108A1 (en) Clustering result interpretation method and device
CN111401339B (en) Method and device for identifying age of person in face image and electronic equipment
CN109840413B (en) Phishing website detection method and device
CN110751027B (en) Pedestrian re-identification method based on deep multi-instance learning
US10423817B2 (en) Latent fingerprint ridge flow map improvement
CN111291817A (en) Image recognition method and device, electronic equipment and computer readable medium
CN114359738A (en) Cross-scene robust indoor population wireless detection method and system
CN110287311A (en) File classification method and device, storage medium, computer equipment
CN111401343B (en) Method for identifying attributes of people in image and training method and device for identification model
Shoohi et al. DCGAN for Handling Imbalanced Malaria Dataset based on Over-Sampling Technique and using CNN.
Liong et al. Automatic traditional Chinese painting classification: A benchmarking analysis
CN107944363A (en) Face image processing process, system and server
WO2022088411A1 (en) Image detection method and apparatus, related model training method and apparatus, and device, medium and program
CN112632000A (en) Log file clustering method and device, electronic equipment and readable storage medium
CN110414562B (en) X-ray film classification method, device, terminal and storage medium
JP7341962B2 (en) Learning data collection device, learning device, learning data collection method and program

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18915611

Country of ref document: EP

Kind code of ref document: A1