CN111310777A

CN111310777A - Method and system for acquiring target category number in K-means algorithm

Info

Publication number: CN111310777A
Application number: CN201911195214.8A
Authority: CN
Inventors: 李健; 郑为民; 王帅; 罗羿
Original assignee: Fujian University of Technology
Current assignee: Fujian University of Technology
Priority date: 2019-11-28
Filing date: 2019-11-28
Publication date: 2020-06-19

Abstract

The application provides a method and a system for acquiring target category number in a K-means algorithm, and relates to the field of algorithms. The method comprises the following steps: obtaining the distance between a first clustering center and other samples in the data set; determining a first sample with the first distance smaller than a preset distance threshold value as a first target class; determining samples for which the distance is greater than the distance threshold as a second cluster center; obtaining a second sample which is less than the distance threshold value from the second cluster center and does not belong to the first target class, and determining the second sample as a second target class. This application makes the training set be divided into a plurality of little training sets, has got rid of the redundant data in the old training set, has promoted the training efficiency and the training precision of training set.

Description

Method and system for acquiring target category number in K-means algorithm

Technical Field

The application belongs to the field of algorithms, and particularly relates to a method and a system for acquiring target category number in a K-means algorithm.

Background

The K-means algorithm is a classical clustering algorithm and can effectively cluster large-scale data, but the traditional K-means algorithm needs to preset the target category number of clustering, the setting of the value is mostly based on experience, and a large amount of data redundancy is generated in a data set trained by the K-means algorithm in the prior art, so that the numerical value of the target category number is difficult to obtain properly and properly.

Disclosure of Invention

The embodiment of the invention mainly aims to provide a method and a system for acquiring the number of target categories in a K-means algorithm.

In a first aspect, a method for obtaining a target class number in a K-means algorithm is provided, which includes:

obtaining the distance between a first clustering center and other samples in the data set;

determining a first sample with the first distance smaller than a preset distance threshold value as a first target class;

determining samples for which the distance is greater than the distance threshold as a second cluster center;

obtaining a second sample which is less than the distance threshold value from the second cluster center and does not belong to the first target class, and determining the second sample as a second target class.

In one possible implementation, the first cluster center point is a designated choice.

In another possible implementation, the first cluster center point is randomly selected from a training set.

In yet another possible implementation, the distance threshold is 0.5.

In a second aspect, a system for obtaining the number of target classes in a K-means algorithm is provided, which includes:

the distance acquisition module is used for acquiring the distance between the first clustering center and other samples in the data set;

the first target class determination module is used for determining a first sample of which the first distance is smaller than a preset distance threshold as a first target class;

a second cluster center determining module for determining the samples with the distance greater than the distance threshold as a second cluster center;

and the second target class determination module is used for acquiring a second sample which has a distance with the second clustering center smaller than the distance threshold and does not belong to the first target class, and determining the second sample as a second target class.

In yet another possible implementation, the first cluster center point is randomly selected from a training set.

In yet another possible implementation, the distance threshold is 0.5.

The beneficial effect that technical scheme that this application provided brought is: the training set is divided into a plurality of small training sets, redundant data in the old training set is removed, and the training efficiency and the training precision of the training set are improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

FIG. 1 is a flowchart of a method for obtaining a target class number in a K-means algorithm according to an embodiment of the present invention;

FIG. 2 is a block diagram of a system for obtaining a number of target classes in a K-means algorithm according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of distance threshold experimental data provided in accordance with another embodiment of the present invention;

FIG. 4 is a schematic diagram of experimental data before and after clustering by a training set according to yet another embodiment of the present invention;

fig. 5 is a schematic diagram of experimental data of different attack detection rates before and after clustering by a training set according to another embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar modules or modules having the same or similar functionality throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, modules, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, modules, components, and/or groups thereof. It will be understood that when a module is referred to as being "connected" or "coupled" to another module, it can be directly connected or coupled to the other module or intervening modules may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The technical solutions of the present application and the technical solutions of the present application, for example, to solve the above technical problems, will be described in detail with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Example one

Fig. 1 is a flowchart of a method for obtaining a target class number in a K-means algorithm according to an embodiment of the present invention, including:

step S101, obtaining the distance between the first clustering center and other samples in the data set.

In the embodiment of the present invention, the distance between samples is a standard for distinguishing the types in the clustering algorithm, and in order to effectively obtain the number of the target types, the distance between the first clustering center in the data set and other samples needs to be obtained first. It should be noted that the first cluster center point may be selected at a designated time or at random.

Step S102, determining the first sample with the first distance smaller than a preset distance threshold value as a first target class.

In the embodiment of the present invention, the distance threshold is the maximum distance allowed between the cluster center and the sample point, the samples within the distance threshold can be determined as the first target class, and the samples exceeding the distance threshold are subjected to the subsequent target class determination.

Step S103, determining the sample with the distance larger than the distance threshold value as a second cluster center.

Step S104, a second sample which is less than the distance threshold value from the second cluster center and does not belong to the first target class is obtained, and the second sample is determined as a second target class.

In the embodiment of the present invention, for the determined second cluster center, according to the above steps, the distances to other samples are respectively obtained, if the distance is smaller than the distance threshold and the sample does not belong to the first target class, the sample is determined as the second target class, and if the distance is greater than the distance threshold, according to the above steps, the third cluster center, the fourth cluster center, and the like are determined, so as to determine the corresponding third target class, the fourth target class, and the like.

It should be noted that, for the K-means algorithm, the smaller the distance threshold is, the more the number of the obtained object categories is, and therefore, for the distance threshold, the setting can be performed according to the actual use requirement, and the value of the distance threshold is not limited in the present application. Preferably, the distance threshold is 0.5.

According to the embodiment of the invention, the distances between the clustering center and other samples are obtained, the samples with the distances smaller than the distance threshold are determined as the target classes, the samples with the distances larger than the distance threshold are set as new clustering centers, and the steps are circulated until all the samples are traversed to determine all the target classes. The training set is divided into a plurality of small training sets, redundant data in the old training set is removed, and the training efficiency and the training precision of the training set are improved.

For example, the following steps are carried out:

the flow of the K-means algorithm per se is as follows:

(1) setting a distance threshold value less than phi;

(2) randomly selecting a sample from the data set as an initial clustering center C1, and setting a parameter k as 1;

(3) reading the next sample point S;

(4) for any Ci, i is more than or equal to 1 and less than or equal to k, if Cj belongs to { Ci }, enabling the distance between S and Cj to be less than phi, dividing S into a point group j, and jumping to the step (6);

(5) creating a new point group by taking S as a clustering center, and adding 1 to k;

(6) and (5) repeating the steps (3) to (5) until all the samples in the data set are taken.

Example two

Fig. 2 is a structural diagram of a system for obtaining a target class number in a K-means algorithm according to an embodiment of the present invention, where the system includes:

a distance obtaining module 201, configured to obtain distances between a first cluster center in the data set and other samples.

A first target class determination module 202, configured to determine a first sample, of which the first distance is smaller than a preset distance threshold, as a first target class.

A second cluster center determining module 203, configured to determine a sample with the distance greater than the distance threshold as a second cluster center.

A second target class determination module 204, configured to obtain a second sample that is less than the distance threshold from the second cluster center and does not belong to the first target class, and determine the second sample as a second target class.

Fig. 3 is a schematic diagram of distance threshold experimental data according to another embodiment of the present invention.

Fig. 4 is a schematic diagram of experimental data before and after clustering in a training set according to another embodiment of the present invention.

Fig. 5 is a schematic diagram of experimental data of different attack detection rates before and after clustering by the training set according to another embodiment of the present invention.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for obtaining the number of target categories in a K-means algorithm is characterized by comprising the following steps:

2. The method of claim 1, wherein the first cluster center point is a designated choice.

3. The method of claim 1, wherein the first cluster center point is randomly selected from a training set.

4. The method of claim 1, wherein the distance threshold is 0.5.

5. A system for obtaining the number of target categories in a K-means algorithm is characterized by comprising the following steps:

6. The system of claim 5, wherein the first cluster center point is a designated choice.

7. The system of claim 5, wherein the first cluster center point is randomly selected from a training set.

8. The system of claim 5, wherein the distance threshold is 0.5.