CN111507380B

CN111507380B - Picture classification method, system, device and storage medium based on clustering

Info

Publication number: CN111507380B
Application number: CN202010237384.4A
Authority: CN
Inventors: 王淦
Original assignee: Ping An Property and Casualty Insurance Company of China Ltd
Current assignee: Ping An Property and Casualty Insurance Company of China Ltd
Priority date: 2020-03-30
Filing date: 2020-03-30
Publication date: 2023-10-31
Anticipated expiration: 2040-03-30
Also published as: CN111507380A

Abstract

The invention provides a clustering-based picture classification method, which is applied to an electronic device and comprises the following steps: acquiring all sample pictures within a preset time to establish a sample database; acquiring a positive sample, a negative sample and an unlabel sample from the sample database, and establishing a model training data set according to the acquired positive sample, negative sample and unlabel sample; preprocessing the model training data set, and acquiring weight values of all samples in the model training data set through preprocessing; according to the weight value of each sample in the model training data set, an effective classification model is established by adopting a preset classifier; and classifying the pictures to be classified according to the effective classification model. The invention provides a clustering-based picture classification method which can effectively avoid the occurrence of wrong classification in the picture classification process.

Description

Picture classification method, system, device and storage medium based on clustering

Technical Field

The present invention relates to the field of image recognition technologies, and in particular, to a method and apparatus for classifying images based on clustering, and a computer readable storage medium.

Background

In the technical field of picture identification, an effective classification model is often used for judging whether a picture is truly effective, the effective classification model is used for judging whether a sample is effective, traditional classification is carried out manually, for some special samples, whether the sample is effective cannot be judged manually, whether the sample is effective is judged randomly according to experience of staff, and extremely poor experience is brought to customers, so that the effective classification model becomes a necessary link in the field of sample classification. The effective classification model is to give the probability of effective sample through the model, and if the probability is higher than a certain threshold value, the sample is judged to be effective. The valid classification model must be trained using raw samples, including both valid and invalid samples that are manually determined.

However, conventional effective classification models typically use classification or outlier detection; regarding the effective classification model of the traditional classification form, a classification model for judging whether the sample is effective is generally trained by using a positive sample and a negative sample which are manually determined as sample data, so that the classification recognition of a newly acquired sample in the later stage is realized, however, in a historical database, whether all the samples are effective can be judged manually, and the traditional effective classification model cannot be trained by using the samples in the training process due to the uncertainty of the samples; therefore, the traditional effective classification model is not applied to all samples in the historical database in the training process, so that the classification effect of the traditional effective classification model is not high.

Based on the above-mentioned problems, there is a need for a high-precision picture classification method capable of modeling using all existing sample pictures.

Disclosure of Invention

The invention provides a clustering-based picture classification method, an electronic device and a computer storage medium, and mainly aims to solve the problems that a traditional effective classification model is used for classifying pictures, the precision is low and mistakes are easy to occur.

In order to achieve the above object, the present invention provides an electronic device, and a cluster-based picture classification method, applied to an electronic device, wherein the method includes:

acquiring all sample pictures in a preset time period to establish a sample database, wherein the preset time period is determined according to the real-time efficiency of sample picture generation;

acquiring a positive sample, a negative sample and an unlabel sample from the sample database, and establishing a model training data set according to the acquired positive sample, negative sample and unlabel sample; the unlabel sample is a sample picture which cannot be determined whether to be a positive sample and a negative sample;

dividing positive samples in the model training set into at least a plurality of groups according to different scenes, acquiring potential positive samples and pure negative samples in unlabel samples in the model training data set based on a random forest algorithm according to the divided groups of the positive samples, and then carrying out weight distribution on the positive samples, the negative samples, the potential positive samples and the pure negative samples in the model training data set according to preset rules; wherein the potential positive samples have a greater probability of being positive samples and the pure negative samples have a greater probability of being negative samples;

According to the weight value of each sample in the model training data set, an effective classification model is established by adopting a preset classifier;

and classifying the pictures to be classified according to the effective classification model.

Preferably, the process of obtaining the positive sample, the negative sample and the unlabel sample from the sample database includes:

extracting characteristic information of all sample pictures in the sample database to obtain various characteristic information of the sample pictures, wherein the sample pictures are pictures obtained through electronic illegal photographing; the characteristic information at least comprises: the distance between the wheels of the target automobile and the solid line in the sample picture, whether the target automobile runs a red light, the distance between the target automobile and nearby pedestrians and whether the automobile is in a reverse running state;

judging whether a target automobile in a sample picture violates regulations or not according to the characteristic information through a preset judging rule, and if so, recording the sample picture as a positive sample; if the target automobile is judged not to violate regulations, the sample picture is recorded as a negative sample, and if the target automobile in the sample picture cannot be accurately judged to be violated regulations according to the preset judging rule, the sample picture is recorded as an unlabel sample;

And acquiring all positive samples, negative samples and unlabel samples from the sample database.

Preferably, the process of dividing the positive samples in the model training set into at least a plurality of families according to different scenarios comprises:

normalizing the characteristic values of the characteristic information of the positive sample by using a min-max method so as to normalize the characteristic values of all the characteristic information of the positive sample to be between 0 and 1;

the positive samples are subjected to a clustering process using a k-means algorithm to divide the positive samples within the model training set into at least a plurality of families.

Preferably, the process of performing the clustering processing on the positive samples by using a k-means algorithm includes:

k positive samples are selected as initial clustering centers according to different scenes, wherein the minimum value of K is 20;

calculating the distance from each positive sample to each cluster center, and distributing each positive sample to the cluster center closest to the positive sample, wherein the cluster center and all positive samples distributed to the positive sample form a cluster together;

wherein, the formula for calculating the distance is as follows;

where xi is a positive sample, xu is a cluster center, and d is the dimension of the positive sample.

Preferably, after each positive sample is assigned to its nearest cluster center, the method further comprises: when each positive sample is allocated to a corresponding cluster center, the cluster center of the cluster is recalculated according to the existing positive samples in the cluster, wherein the calculation formula is as follows:

Wherein ui represents the center point of the ith cluster, ci represents the ith cluster, and x represents the samples belonging to the ith cluster;

until the loss function of the set of positive samples reaches a minimum;

wherein, the expression formula of the loss function is as follows:

where ui denotes the center point of the ith cluster, and ci denotes the ith cluster.

Preferably, the process of acquiring the potential positive samples and the pure negative samples in the unlabel samples in the model training data set based on the random forest algorithm according to the divided positive sample family comprises the following steps:

independent scores are carried out on the unlabel samples by using an independent random forest algorithm, approximate scores are carried out on the unlabel samples according to the clustering of the positive samples, and the sum of the independent scores and the approximate scores is calculated to be used as a total score;

calculating the average score of a positive sample, judging whether the total score of the unlabel sample is larger than the average score of the positive sample, if the total score of the unlabel sample is larger than the average score of the positive sample, recording the unlabel sample as a potential positive sample, and if the total score of the unlabel sample is smaller than a preset super-parameter beta, recording the unlabel sample as a pure negative sample, wherein the maximum value of the preset super-parameter beta is 0.2.

In addition, in order to achieve the above object, the present invention further provides a cluster-based picture classification system, which includes:

the sample picture acquisition unit is used for acquiring all sample pictures in a preset time period to establish a sample database, wherein the preset time period is determined according to the real-time efficiency of sample picture generation;

the training data set establishing unit is used for acquiring a positive sample, a negative sample and an unlabel sample from the sample database and establishing a model training data set according to the acquired positive sample, negative sample and unlabel sample; the unlabel sample is a sample picture which cannot be determined whether to be a positive sample and a negative sample;

the preprocessing unit is used for dividing positive samples in the model training set into at least a plurality of groups according to different scenes, acquiring potential positive samples and pure negative samples in unlabel samples in the model training data set based on a random forest algorithm according to the groups of the divided positive samples, and then carrying out weight distribution on the positive samples, the negative samples, the potential positive samples and the pure negative samples in the model training data set according to preset rules; wherein the potential positive samples have a greater probability of being positive samples and the pure negative samples have a greater probability of being negative samples;

The effective classification model building and applying unit is used for building an effective classification model by adopting a preset classifier according to the weight value of each sample in the model training data set; and classifying the pictures to be classified according to the effective classification model.

In addition, in order to achieve the above object, the present invention further provides a method for classifying pictures based on clustering, the method comprising:

acquiring a positive sample, a negative sample and an unlabel sample from the sample database, and establishing a model training data set according to the acquired positive sample, negative sample and unlabel sample;

the unlabel sample is a sample picture which cannot be determined whether to be a positive sample and a negative sample;

In addition, in order to achieve the above object, the present invention further provides a computer readable storage medium, in which a cluster-based picture classification program is stored, which when executed by a processor, implements the steps of the foregoing cluster-based picture classification method.

The invention provides a clustering-based picture classification method, a clustering-based picture classification system, an electronic device and a computer-readable storage medium, wherein a sample picture is firstly divided into a positive sample, a negative sample and an unlabel sample by a manual or traditional picture identification method; then preprocessing the model training data set, acquiring the weight value of each sample in the model training data set through preprocessing, then establishing a high-precision effective classification model by adopting a preset classifier according to the weight value of each sample in the model training data set, and finally classifying the pictures to be classified with high precision through the effective classification model, so that the occurrence of wrong classification in the picture classification process can be effectively avoided.

Drawings

FIG. 1 is a schematic diagram of an electronic device according to an embodiment of the invention;

FIG. 2 is a flowchart of a preferred embodiment of a cluster-based picture classification method according to an embodiment of the present invention;

FIG. 3 is a flow chart of a preferred embodiment of a preprocessing process for model training data sets according to an embodiment of the present invention;

fig. 4 is an internal logic diagram of a cluster-based picture classification procedure according to an embodiment of the present invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The invention provides a clustering-based picture classification method, which is applied to an electronic device 70. Referring to fig. 1, a schematic structure of an electronic device 70 according to a preferred embodiment of the invention is shown.

In this embodiment, the electronic device 70 may be a terminal device with an operation function, such as a server, a smart phone, a tablet computer, a portable computer, or a desktop computer.

The electronic device 70 includes: a processor 71 and a memory 72.

Memory 72 includes at least one type of readable storage medium. At least one type of readable storage medium may be a non-volatile storage medium such as flash memory, a hard disk, a multimedia card, a card memory, etc. In some embodiments, the readable storage medium may be an internal storage unit of the electronic device 70, such as a hard disk of the electronic device 70. In other embodiments, the readable storage medium may also be an external memory of the electronic device 1, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the electronic device 70.

In the present embodiment, the readable storage medium of the memory 72 is generally used to store a cluster-based picture classification program 73 installed on the electronic device 70. The memory 72 may also be used to temporarily store data that has been output or is to be output.

The processor 72 may in some embodiments be a central processing unit (Central Processing Unit, CPU), microprocessor or other data processing chip for running program code or processing data stored in the memory 72, such as a cluster-based picture sorting program 73 or the like.

In some embodiments, the electronic device 70 is a terminal device of a smart phone, tablet computer, portable computer, or the like. In other embodiments, the electronic device 70 may be a server.

Fig. 1 shows only an electronic device 70 having components 71-73, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead.

Optionally, the electronic device 70 may further comprise a user interface, which may comprise an input unit such as a Keyboard (Keyboard), a voice input device such as a microphone or the like with voice recognition function, a voice output device such as a sound box, a headset or the like, and optionally a standard wired interface, a wireless interface.

Optionally, the electronic device 70 may also include a display, which may also be referred to as a display screen or display unit. In some embodiments, the display may be an LED display, a liquid crystal display, a touch-control liquid crystal display, an Organic Light-Emitting Diode (OLED) touch device, or the like. The display is used to display information processed in the electronic device 70 and to display a visual user interface.

Optionally, the electronic device 70 may also include a touch sensor. The area provided by the touch sensor for a user to perform a touch operation is referred to as a touch area. Further, the touch sensor herein may be a resistive touch sensor, a capacitive touch sensor, or the like. The touch sensor may include not only a contact type touch sensor but also a proximity type touch sensor. Further, the touch sensor may be a single sensor or may be a plurality of sensors arranged in an array, for example.

The area of the display of the electronic device 70 may be the same as or different from the area of the touch sensor. Optionally, a display is layered with the touch sensor to form a touch display screen. The device detects a touch operation triggered by a user based on a touch display screen.

Optionally, the electronic device 70 may further include Radio Frequency (RF) circuitry, sensors, audio circuitry, etc., which are not described herein.

In addition, fig. 2 is a flowchart of a preferred embodiment of the multi-cluster-based picture classifying method according to the present invention, and is shown in conjunction with fig. 1 and fig. 2, in the embodiment of the apparatus shown in fig. 1, an operating system and a cluster-based picture classifying program 73 may be included in a memory 72 as a computer storage medium; the processor 71, when executing the cluster-based picture classification program 73 stored in the memory 72, performs the following steps:

S110: and obtaining all sample pictures in a preset time to establish a sample database, wherein the real-time efficiency of the sample pictures in the preset time period is determined, and the total number of the sample pictures in the preset time period is at least 10000 in order to ensure the precision of a model established in the later period.

In order to more clearly illustrate the content of the invention, the invention selects the electronic violation photo classification scene as a specific embodiment of the invention, along with the development of transportation industry, the number of various automobiles is increased, the following automobile violation phenomenon is more common, in order to discover the violation vehicles on a highway in time, the electronic violation photo taking device is arranged beside each street in the city, and whether the automobiles in a picture sample violate rules or not is judged by manual or traditional effective classification models, however, the traditional effective classification models are used for judging whether the automobiles in the picture sample violate rules or not, the condition of missing judgment or misjudgment often occurs, and the travel of people is hindered to a certain extent; therefore, the invention adopts a picture classification method based on clustering to classify the electronic illegal photo.

It should be noted that, the sample pictures are pictures generated when the car is taken by the camera on the highway and is in violation of regulations, and various information for judging whether the car is in violation of regulations in the pictures can be obtained from the sample pictures, for example, the two-hand behavior of an agent (a main driver) in the sample pictures, the distance between the car wheel and a solid line, whether the car runs a red light, the position relationship between the car and nearby pedestrians, whether the car is in reverse running, and the like.

In addition, due to the fact that the illegal behaviors which are easy to occur in different electronic illegal photographing positions are slightly different, for example, red light running is easy to occur at an intersection; the phenomenon that the agent does not tie a safety belt or play a mobile phone easily occurs in the middle position of the road; at school gates, the phenomenon of no-gift pedestrians is easy to occur. Therefore, sample pictures can be grouped in advance according to specific positions of electronic illegal photographing, and the sample pictures are pre-classified in a grouping mode, so that a multi-level sample database is built, and when required positive samples, negative samples and unlabel samples are acquired in the sample database in the later period, the same number of positive samples, negative samples and unlabel samples can be acquired from each group. By the method, the acquired sample types can be more balanced, so that the classification accuracy of an effective separation model established in a later period is improved.

S120: acquiring a positive sample, a negative sample and an unlabel sample from a sample database, and establishing a model training data set according to the acquired positive sample, negative sample and unlabel sample;

the positive sample is a sample picture for confirming the rule violation in a preliminary way, the negative sample is a sample picture for confirming the rule violation in a preliminary way, and the unlabel sample is a sample picture for judging whether the rule violation cannot be confirmed in a preliminary way.

It should be noted that, because the sample picture is a picture of a car violation photographed by a camera on a highway, the judging condition for judging whether the sample picture is illegal is whether the car in the picture is illegal, for example, whether the car breaks a red light, whether the car is in a stop, whether the car is in a smoking state, whether the car is in a phone call for playing a mobile phone, whether the primary and secondary drivers tie a safety belt, whether the pedestrian is present, whether the car is in a meeting, whether the lamp is not turned off, and the like. And (3) judging that the rule breaks and marks the sample picture as a positive sample, and judging that the rule breaks and marks the sample picture as a negative sample. However, in the actual photographing process, due to the speed of the automobile, weather conditions (such as haze, rain and snow, etc.), real-time light conditions, etc., the photographed sample picture is unclear, and the marks (such as safety belts and license plates, etc.) are blocked, so that whether the sample picture is a violation picture cannot be judged through the traditional picture recognition technology, and at the moment, the picture is marked as an unlabel sample.

Specifically, the process of preliminarily confirming and judging whether the sample picture is illegal can be carried out by using manpower, or the sample picture can be judged by adopting a traditional picture identification method; if the judgment is performed manually, judging whether the illegal behaviors exist in the sample picture according to the naked eyes of the human.

If the sample picture is judged by adopting a traditional picture identification method, extracting characteristic information of all the sample pictures to obtain various characteristic information of the sample picture, wherein the characteristic information comprises: the distance between the wheels of the target automobile and the solid line, whether the target automobile runs a red light, the distance between the target automobile and nearby pedestrians, whether the automobile is in a reverse running state and the like in the sample picture.

Judging whether the target automobile in the sample picture violates regulations or not according to the characteristic information through a preset judging rule, and if so, recording the sample picture as a positive sample; if the target automobile is judged not to violate regulations, the sample picture is recorded as a negative sample, and if the target automobile in the sample picture cannot be accurately judged to be illegal according to a preset judging rule, the sample picture is recorded as an unlabel sample.

The preset determination rule is related to the rule-breaking behavior, for example, whether the distance between the wheels of the target automobile and the solid line is zero, whether the characteristic value of the target automobile running the red light is 1 (the characteristic value is 1, namely the occurrence of an event), whether the characteristic value of the automobile driving in reverse is 1, and the like.

S130: preprocessing the model training data set, and obtaining the weight value of each sample in the model training data set through preprocessing.

In a specific implementation of the present invention, fig. 3 is a flowchart of a preferred example of a preprocessing procedure for model training data set according to an embodiment of the present invention, and as can be seen from fig. 3, the preprocessing procedure includes;

s131: for positive samples within the model training dataset, there are significant differences between positive samples within the model training dataset, e.g., some positive samples are determined by a car rushing a red light, some positive samples are determined by an agent unbelting, some positive samples are determined by a car line, etc. Thus, a positive sample cannot be simply considered as a class; however, the characteristic information of the positive sample in different car violation scenes (such as red light running) often has a great difference, the characteristic information of the positive sample in the same fraud scene is relatively smaller, for example, the distance between car wheel tracks and solid lines is larger, and the corresponding numerical difference in different violation scenes is larger. Thus, positive samples can be partitioned into multiple cluster classes (at least 20) according to their violation scenarios, each cluster class including similar positive samples therein.

S132: for an unlabel sample, because the potential positive sample is quite different from the negative sample, and similar to the known positive sample, the potential positive sample and the pure negative sample in the unlabel sample in the model training data set are obtained according to the family of the positive samples and based on a random forest algorithm, wherein the potential positive sample has a high probability of being the positive sample, and the pure negative sample has a high probability of being the negative sample.

It should be noted that, since the purpose of the present solution is to obtain an effective classification model and the negative sample is a non-fraudulent sample that is confirmed by human, no processing may be performed on the negative sample.

S133: and carrying out weight distribution on the positive samples, the negative samples, the potential positive samples and the pure negative samples in the model training data set according to a preset rule.

Specifically, in step S131, firstly, the min-max method is used to normalize the feature values of the feature information of the positive sample, so that the feature values of all the feature information of the positive sample can be normalized to be between 0 and 1, and the problem that the feature values cannot be used in the same formula at the same time due to different dimensions is avoided. It should be noted that, since the normalization method is the prior art, the description is not repeated here.

Secondly, K positive samples are randomly selected as initial clustering centers, wherein the minimum value of K is 20.

Then, calculating the distance between each positive sample and each cluster center, and distributing each positive sample to the cluster center closest to the positive sample, wherein the cluster center and all positive samples distributed to the positive sample form a cluster together, and the formula for calculating the distance is as follows;

In the formula, xi is a positive sample, xu is a clustering center, and d is the dimension of the positive sample;

regarding the above formula, essentially, the sum of squares of distances of all normalized eigenvalues of a certain positive sample and the cluster center is taken as the distance of two points.

In order to improve the accuracy of the cluster center, each positive sample is allocated to the cluster center nearest to the positive sample, and each sample is allocated, the cluster center of the cluster is recalculated according to the existing objects in the cluster, wherein the calculation formula is as follows:

in the above formula, ui represents the center point of the ith cluster, ci represents the ith cluster, x represents the sample belonging to the cluster, where the center point of the cluster is determined in advance, and then the center point is used as a new cluster center, so that refreshing is realized on the cluster center, and the accuracy of the cluster center is improved.

Finally, until the loss function of the whole positive sample set (including all positive samples) reaches a minimum, wherein the expression formula of the loss function is:

in the above formula, ui represents the center point of the ith cluster, ci represents the ith cluster, here, the sum of distances from each sample in each cluster to the cluster center is calculated first, then the sum of the distances of all clusters is calculated as a loss function, and the accuracy of the positive sample cluster center can be further improved through setting the minimum value of the loss function.

It should be emphasized that the clusters mentioned in the above-mentioned clustering process are the families in the family S131.

In step S132, the process of obtaining potential positive and pure negative samples from positive samples in the model training dataset according to family and based on random forest algorithm includes:

and (3) carrying out independent scores on the unlabel samples by using an independent random forest algorithm, carrying out approximate scores on the unlabel samples according to the clustering of the positive samples, and calculating the sum of the independent scores and the approximate scores as a total score.

Calculating the average score of a positive sample, judging whether the total score of an unlabel sample is larger than the average score of the positive sample, if the total score of the unlabel sample is larger than the average score of the positive sample, defining the unlabel sample as a potential positive sample, and defining the unlabel sample with the total score smaller than a certain preset super parameter beta as a pure negative sample, wherein the beta value is set by people, and the smaller the value is, the more reliable the selected sample is, and the maximum value of beta set by the invention is 0.2.

Specifically, since the positive samples in the unlabel samples are few and very different from each other, independent random forests can be used to give independent scores to the unlabel samples, and since the positive samples are less frequently scored from the root, the negative samples are more frequently scored from the root. To obtain an independent score for each sample point, the samples are passed from the root of the tree until the leaf nodes are reached, so that the path length of each tree can be obtained, and then the independent random forest average path length can be calculated. Based on the average path length, an independent score IS (x) can be calculated that can describe the probability that the sample IS a positive sample.

IS(X)＝E(h(x))

Where h (x) represents the path length of a tree. The higher IS (x) score, the higher the likelihood that x IS a positive sample.

On the other hand, the closer a sample is to the cluster center of a known positive sample, the more likely it is to be a potential positive sample, so we calculate the approximate score SS (x) of an unlabel sample x with its nearest cluster center as follows:

to filter the potential positive and pure negative samples in unlabel samples, we have to consider the sum of the potential independent scores and the approximate scores together, the total score formula is as follows:

TS(x)＝θIS(X)+(1-θ)SS(x)

where θ ranges from 0,1 to 0.5 as a default, it can be used to balance the importance of independent scores and approximate scores.

Specifically, calculateRepresenting the known positive sample average score. By taking samples with a TS (x) value greater than alpha as potential positive samples, when TS (x) values less than beta as pure negative samples, beta is a super parameter, and the smaller the value, the more reliable the selected sample.

Specifically, in S133, a corresponding preset rule may be set according to the scores of the various samples, and weights corresponding to the potential positive sample, the pure negative sample, the known positive sample and the known negative sample obtained from the unlabel sample are given, where, since the known positive sample and the known negative sample have been confirmed in advance, the weights of all the known positive samples are 1, the weights of all the known negative samples are 0, and the weight calculation formula for the potential positive sample is as follows:

For pure negative samples, the smaller the score, the higher the weight, the following formula is calculated:

s140: and establishing an effective classification model by adopting a preset classifier according to the weight value of each sample in the model training data set.

Specifically, the classifier has various kinds, such as: linear regression, SVM (support vector machine), decision Tree (DT) and catboost, the proposal selects the catboost classifier which relatively accords with the classification of pictures as an effective classification model, and trains the catboost classifier by using potential positive samples, pure negative samples, known positive samples and known negative samples as sampling samples, thereby establishing the effective classification model.

More specifically, the step S140 may specifically include the steps of:

various samples are collected from the training data set according to the weight values of the various samples to serve as training samples, wherein the sampling probability of potential positive samples and pure negative samples is set to be the weight values of the potential positive samples and the pure negative samples, the sampling probability of known positive samples is set to be 1, and the sampling probability of known negative samples is defaults to be 0.

And training a catboost model by using the acquired training samples, wherein each training sample consists of characteristics and labels.

And (3) iterating the step S131 and the step S132, and stopping after iterating the n_iter for a plurality of times, wherein the default value of the n_iter is 30.

And (3) model parameter adjustment, namely parameter adjustment on a plurality of parameters such as the items, depth, scale_pos_weight and the like which influence the model result, so that the model effect is optimal.

It should be noted that, the catabol model is an existing commonly used classification model, and the innovation of the invention is mainly to preprocess the data training set, so as to significantly improve the accuracy of the final classification model, so that the specific training process and the model parameter adjusting process of the catabol model are not repeated.

S150: and classifying the pictures to be classified according to the effective classification model, so as to accurately judge whether the automobiles in the pictures to be classified violate regulations.

It should be emphasized that the method for classifying pictures based on clustering provided by the invention not only can be used for classifying electronic illegal photos, but also can be used in other suitable scenes for classifying the pictures, such as classifying pictures or videos used for judging whether players are illegal in sports (gymnastics, diving, long jump, etc.), such as daily safety monitoring, monitoring of workplaces, monitoring of farms, etc.

In addition, the invention also provides a picture classification system based on clustering, which comprises:

The sample picture acquisition unit is used for acquiring all sample pictures in a preset time period to establish a sample database, wherein the preset time period is determined according to the real-time efficiency generated by the sample pictures;

the preprocessing unit is used for dividing positive samples in the model training set into at least a plurality of groups according to different scenes, acquiring potential positive samples and pure negative samples in unlabel samples in the model training data set based on a random forest algorithm according to the divided groups of the positive samples, and then carrying out weight distribution on the positive samples, the negative samples, the potential positive samples and the pure negative samples in the model training data set; the potential positive samples have a high probability of being positive samples, and the pure negative samples have a high probability of being negative samples;

In other embodiments, fig. 4 is an internal logic diagram of a cluster-based picture classification program according to an embodiment of the invention, and as shown in fig. 4, the cluster-based picture classification program 73 may also be divided into one or more modules, and one or more modules are stored in the memory 72 and executed by the processor 71 to complete the invention. The invention may refer to a series of computer program instruction segments capable of performing a specified function. Referring to FIG. 4, a block diagram of a preferred embodiment of the cluster-based picture classification routine 73 of FIG. 1 is shown. The cluster-based picture classification program 73 may be partitioned into: a sample picture acquisition module 74, a training data set creation module 75, a preprocessing module 76, and an effective classification model creation and application module 77. The functions or operational steps performed by the modules 74-77 are similar to those described above and are not described in detail herein, for example, wherein:

the sample picture obtaining module 74 obtains all sample pictures within a preset time period to build a sample database, wherein the preset time period is determined according to the real-time efficiency of the sample picture generation.

The training data set creating module 75 acquires a positive sample, a negative sample and an unlabel sample from the sample database, and creates a model training data set according to the acquired positive sample, negative sample and unlabel sample;

The unlabel sample is a sample picture which cannot be determined whether to be a positive sample and a negative sample.

The preprocessing module 76 preprocesses the model training data set, and obtains the weight value of each sample in the model training data set through preprocessing.

The effective classification model building and application module 77 builds an effective classification model by using a preset classifier according to the weight value of each sample in the model training data set, and classifies the pictures to be classified according to the effective classification model.

In addition, the invention also provides a clustering-based picture classification method. The method may be performed by an apparatus, which may be implemented in software and/or hardware, the method comprising:

s110: acquiring all sample pictures in a preset time period to establish a sample database, wherein the preset time period is determined according to the real-time efficiency generated by the sample pictures, and the total number of the sample pictures in the preset time period is at least 10000;

S130: preprocessing a model training data set, and acquiring weight values of all samples in the model training data set through preprocessing;

s140: according to the weight value of each sample in the model training data set, an effective classification model is established by adopting a preset classifier;

s150: and classifying the pictures to be classified according to the effective classification model.

In addition, the embodiment of the invention also provides a computer readable storage medium, wherein a picture classification program based on clustering is stored in the computer readable storage medium, and the picture classification program based on clustering realizes the following operations when being executed by a processor:

s110: acquiring all sample pictures in a preset time period to establish a sample database, wherein the preset time period is determined according to the real-time efficiency generated by the sample pictures;

The specific implementation manner of the computer readable storage medium provided by the invention is substantially the same as the specific implementation manner of the multi-machine-room temperature alarm method and the electronic device, and is not repeated here.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments. From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as above, comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server or a network device, etc.) to perform the method of the various embodiments of the present invention.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. A cluster-based picture classification method applied to an electronic device, the method comprising:

classifying the pictures to be classified according to the effective classification model; wherein,,

the process of dividing positive samples within the model training set into at least a plurality of families according to different scenarios includes:

performing clustering processing on the positive samples by using a k-means algorithm to divide the positive samples in the model training set into at least a plurality of groups;

the process of clustering the positive samples using the k-means algorithm includes:

wherein, the formula for calculating the distance is as follows;

wherein x is _i Is a positive sample, x _u For a cluster center, d is the dimension of the positive sample, j ε d, x _ij X is the j-th positive sample _uj The cluster center of the j positive sample;

after assigning each positive sample to its nearest cluster center, the method further comprises: when each positive sample is allocated to a corresponding cluster center, the cluster center of the cluster is recalculated according to the existing positive samples in the cluster, wherein the calculation formula is as follows:

wherein u is _i Representing the center point of the ith cluster, c _i Representing the ith cluster, x representing samples belonging to the cluster;

until the loss function of the set of positive samples reaches a minimum;

wherein, the expression formula of the loss function is as follows:

wherein u is _i Representing the center point of the ith cluster, c _i Representing the ith cluster.

2. The method of cluster-based picture classification according to claim 1, wherein the process of obtaining positive, negative and unlabel samples from the sample database comprises:

3. The cluster-based picture classification method of claim 2 wherein the process of obtaining potential positive and pure negative samples from unlabel samples in the model training dataset based on a random forest algorithm based on the divided positive sample families comprises:

4. A cluster-based picture classification system, the picture classification system comprising:

the preprocessing unit is used for dividing positive samples in the model training set into at least a plurality of groups according to different scenes, acquiring potential positive samples and pure negative samples in unlabel samples in the model training data set based on a random forest algorithm according to the groups of the divided positive samples, and then carrying out weight distribution on the positive samples, the negative samples, the potential positive samples and the pure negative samples in the model training data set; wherein the potential positive samples have a greater probability of being positive samples and the pure negative samples have a greater probability of being negative samples; the process of dividing positive samples within the model training set into at least a plurality of families according to different scenarios includes:

the formula for calculating the distance is as follows;

until the loss function of the set of positive samples reaches a minimum;

wherein, the expression formula of the loss function is as follows:

wherein u is _i Representing the center point of the ith cluster, c _i Representing an ith cluster;

5. An electronic device, comprising: a memory, a processor, and a cluster-based picture classification program stored in the memory and executable on the processor, the cluster-based picture classification program, when executed by the processor, performing the steps of:

Dividing positive samples in the model training set into at least a plurality of groups according to different scenes, acquiring potential positive samples and pure negative samples in unlabel samples in the model training data set based on a random forest algorithm according to the divided groups of the positive samples, and then carrying out weight distribution on the positive samples, the negative samples, the potential positive samples and the pure negative samples in the model training data set; wherein the potential positive samples have a greater probability of being positive samples and the pure negative samples have a greater probability of being negative samples; wherein the process of dividing positive samples in the model training set into at least a plurality of families according to different scenes comprises:

the formula for calculating the distance is as follows;

until the loss function of the set of positive samples reaches a minimum;

wherein, the expression formula of the loss function is as follows:

6. The electronic device of claim 5, wherein preprocessing the model training dataset comprises:

dividing positive samples in the model training set into at least 20 groups according to different violation scenes;

obtaining potential positive samples and pure negative samples in unlabel samples in the model training data set based on a random forest algorithm according to the divided positive sample families, wherein the potential positive samples have a high probability of being positive samples, and the pure negative samples have a high probability of being negative samples;

and performing weight distribution on the positive samples, the negative samples, the potential positive samples and the pure negative samples in the model training data set.

7. A computer readable storage medium, wherein a cluster-based picture classification program is provided in the computer readable storage medium, which when executed by a processor, implements the steps of the cluster-based picture classification method according to any of claims 1 to 3.