CN111310777A - Method and system for acquiring target category number in K-means algorithm - Google Patents

Method and system for acquiring target category number in K-means algorithm Download PDF

Info

Publication number
CN111310777A
CN111310777A CN201911195214.8A CN201911195214A CN111310777A CN 111310777 A CN111310777 A CN 111310777A CN 201911195214 A CN201911195214 A CN 201911195214A CN 111310777 A CN111310777 A CN 111310777A
Authority
CN
China
Prior art keywords
distance
distance threshold
target class
sample
cluster center
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911195214.8A
Other languages
Chinese (zh)
Inventor
李健
郑为民
王帅
罗羿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian University of Technology
Original Assignee
Fujian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian University of Technology filed Critical Fujian University of Technology
Priority to CN201911195214.8A priority Critical patent/CN111310777A/en
Publication of CN111310777A publication Critical patent/CN111310777A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The application provides a method and a system for acquiring target category number in a K-means algorithm, and relates to the field of algorithms. The method comprises the following steps: obtaining the distance between a first clustering center and other samples in the data set; determining a first sample with the first distance smaller than a preset distance threshold value as a first target class; determining samples for which the distance is greater than the distance threshold as a second cluster center; obtaining a second sample which is less than the distance threshold value from the second cluster center and does not belong to the first target class, and determining the second sample as a second target class. This application makes the training set be divided into a plurality of little training sets, has got rid of the redundant data in the old training set, has promoted the training efficiency and the training precision of training set.

Description

Method and system for acquiring target category number in K-means algorithm
Technical Field
The application belongs to the field of algorithms, and particularly relates to a method and a system for acquiring target category number in a K-means algorithm.
Background
The K-means algorithm is a classical clustering algorithm and can effectively cluster large-scale data, but the traditional K-means algorithm needs to preset the target category number of clustering, the setting of the value is mostly based on experience, and a large amount of data redundancy is generated in a data set trained by the K-means algorithm in the prior art, so that the numerical value of the target category number is difficult to obtain properly and properly.
Disclosure of Invention
The embodiment of the invention mainly aims to provide a method and a system for acquiring the number of target categories in a K-means algorithm.
In a first aspect, a method for obtaining a target class number in a K-means algorithm is provided, which includes:
obtaining the distance between a first clustering center and other samples in the data set;
determining a first sample with the first distance smaller than a preset distance threshold value as a first target class;
determining samples for which the distance is greater than the distance threshold as a second cluster center;
obtaining a second sample which is less than the distance threshold value from the second cluster center and does not belong to the first target class, and determining the second sample as a second target class.
In one possible implementation, the first cluster center point is a designated choice.
In another possible implementation, the first cluster center point is randomly selected from a training set.
In yet another possible implementation, the distance threshold is 0.5.
In a second aspect, a system for obtaining the number of target classes in a K-means algorithm is provided, which includes:
the distance acquisition module is used for acquiring the distance between the first clustering center and other samples in the data set;
the first target class determination module is used for determining a first sample of which the first distance is smaller than a preset distance threshold as a first target class;
a second cluster center determining module for determining the samples with the distance greater than the distance threshold as a second cluster center;
and the second target class determination module is used for acquiring a second sample which has a distance with the second clustering center smaller than the distance threshold and does not belong to the first target class, and determining the second sample as a second target class.
In one possible implementation, the first cluster center point is a designated choice.
In yet another possible implementation, the first cluster center point is randomly selected from a training set.
In yet another possible implementation, the distance threshold is 0.5.
The beneficial effect that technical scheme that this application provided brought is: the training set is divided into a plurality of small training sets, redundant data in the old training set is removed, and the training efficiency and the training precision of the training set are improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.
FIG. 1 is a flowchart of a method for obtaining a target class number in a K-means algorithm according to an embodiment of the present invention;
FIG. 2 is a block diagram of a system for obtaining a number of target classes in a K-means algorithm according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of distance threshold experimental data provided in accordance with another embodiment of the present invention;
FIG. 4 is a schematic diagram of experimental data before and after clustering by a training set according to yet another embodiment of the present invention;
fig. 5 is a schematic diagram of experimental data of different attack detection rates before and after clustering by a training set according to another embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar modules or modules having the same or similar functionality throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, modules, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, modules, components, and/or groups thereof. It will be understood that when a module is referred to as being "connected" or "coupled" to another module, it can be directly connected or coupled to the other module or intervening modules may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The technical solutions of the present application and the technical solutions of the present application, for example, to solve the above technical problems, will be described in detail with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
Example one
Fig. 1 is a flowchart of a method for obtaining a target class number in a K-means algorithm according to an embodiment of the present invention, including:
step S101, obtaining the distance between the first clustering center and other samples in the data set.
In the embodiment of the present invention, the distance between samples is a standard for distinguishing the types in the clustering algorithm, and in order to effectively obtain the number of the target types, the distance between the first clustering center in the data set and other samples needs to be obtained first. It should be noted that the first cluster center point may be selected at a designated time or at random.
Step S102, determining the first sample with the first distance smaller than a preset distance threshold value as a first target class.
In the embodiment of the present invention, the distance threshold is the maximum distance allowed between the cluster center and the sample point, the samples within the distance threshold can be determined as the first target class, and the samples exceeding the distance threshold are subjected to the subsequent target class determination.
Step S103, determining the sample with the distance larger than the distance threshold value as a second cluster center.
Step S104, a second sample which is less than the distance threshold value from the second cluster center and does not belong to the first target class is obtained, and the second sample is determined as a second target class.
In the embodiment of the present invention, for the determined second cluster center, according to the above steps, the distances to other samples are respectively obtained, if the distance is smaller than the distance threshold and the sample does not belong to the first target class, the sample is determined as the second target class, and if the distance is greater than the distance threshold, according to the above steps, the third cluster center, the fourth cluster center, and the like are determined, so as to determine the corresponding third target class, the fourth target class, and the like.
It should be noted that, for the K-means algorithm, the smaller the distance threshold is, the more the number of the obtained object categories is, and therefore, for the distance threshold, the setting can be performed according to the actual use requirement, and the value of the distance threshold is not limited in the present application. Preferably, the distance threshold is 0.5.
According to the embodiment of the invention, the distances between the clustering center and other samples are obtained, the samples with the distances smaller than the distance threshold are determined as the target classes, the samples with the distances larger than the distance threshold are set as new clustering centers, and the steps are circulated until all the samples are traversed to determine all the target classes. The training set is divided into a plurality of small training sets, redundant data in the old training set is removed, and the training efficiency and the training precision of the training set are improved.
For example, the following steps are carried out:
the flow of the K-means algorithm per se is as follows:
(1) setting a distance threshold value less than phi;
(2) randomly selecting a sample from the data set as an initial clustering center C1, and setting a parameter k as 1;
(3) reading the next sample point S;
(4) for any Ci, i is more than or equal to 1 and less than or equal to k, if Cj belongs to { Ci }, enabling the distance between S and Cj to be less than phi, dividing S into a point group j, and jumping to the step (6);
(5) creating a new point group by taking S as a clustering center, and adding 1 to k;
(6) and (5) repeating the steps (3) to (5) until all the samples in the data set are taken.
Example two
Fig. 2 is a structural diagram of a system for obtaining a target class number in a K-means algorithm according to an embodiment of the present invention, where the system includes:
a distance obtaining module 201, configured to obtain distances between a first cluster center in the data set and other samples.
In the embodiment of the present invention, the distance between samples is a standard for distinguishing the types in the clustering algorithm, and in order to effectively obtain the number of the target types, the distance between the first clustering center in the data set and other samples needs to be obtained first. It should be noted that the first cluster center point may be selected at a designated time or at random.
A first target class determination module 202, configured to determine a first sample, of which the first distance is smaller than a preset distance threshold, as a first target class.
In the embodiment of the present invention, the distance threshold is the maximum distance allowed between the cluster center and the sample point, the samples within the distance threshold can be determined as the first target class, and the samples exceeding the distance threshold are subjected to the subsequent target class determination.
A second cluster center determining module 203, configured to determine a sample with the distance greater than the distance threshold as a second cluster center.
A second target class determination module 204, configured to obtain a second sample that is less than the distance threshold from the second cluster center and does not belong to the first target class, and determine the second sample as a second target class.
In the embodiment of the present invention, for the determined second cluster center, according to the above steps, the distances to other samples are respectively obtained, if the distance is smaller than the distance threshold and the sample does not belong to the first target class, the sample is determined as the second target class, and if the distance is greater than the distance threshold, according to the above steps, the third cluster center, the fourth cluster center, and the like are determined, so as to determine the corresponding third target class, the fourth target class, and the like.
It should be noted that, for the K-means algorithm, the smaller the distance threshold is, the more the number of the obtained object categories is, and therefore, for the distance threshold, the setting can be performed according to the actual use requirement, and the value of the distance threshold is not limited in the present application. Preferably, the distance threshold is 0.5.
According to the embodiment of the invention, the distances between the clustering center and other samples are obtained, the samples with the distances smaller than the distance threshold are determined as the target classes, the samples with the distances larger than the distance threshold are set as new clustering centers, and the steps are circulated until all the samples are traversed to determine all the target classes. The training set is divided into a plurality of small training sets, redundant data in the old training set is removed, and the training efficiency and the training precision of the training set are improved.
Fig. 3 is a schematic diagram of distance threshold experimental data according to another embodiment of the present invention.
Fig. 4 is a schematic diagram of experimental data before and after clustering in a training set according to another embodiment of the present invention.
Fig. 5 is a schematic diagram of experimental data of different attack detection rates before and after clustering by the training set according to another embodiment of the present invention.
It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (8)

1. A method for obtaining the number of target categories in a K-means algorithm is characterized by comprising the following steps:
obtaining the distance between a first clustering center and other samples in the data set;
determining a first sample with the first distance smaller than a preset distance threshold value as a first target class;
determining samples for which the distance is greater than the distance threshold as a second cluster center;
obtaining a second sample which is less than the distance threshold value from the second cluster center and does not belong to the first target class, and determining the second sample as a second target class.
2. The method of claim 1, wherein the first cluster center point is a designated choice.
3. The method of claim 1, wherein the first cluster center point is randomly selected from a training set.
4. The method of claim 1, wherein the distance threshold is 0.5.
5. A system for obtaining the number of target categories in a K-means algorithm is characterized by comprising the following steps:
the distance acquisition module is used for acquiring the distance between the first clustering center and other samples in the data set;
the first target class determination module is used for determining a first sample of which the first distance is smaller than a preset distance threshold as a first target class;
a second cluster center determining module for determining the samples with the distance greater than the distance threshold as a second cluster center;
and the second target class determination module is used for acquiring a second sample which has a distance with the second clustering center smaller than the distance threshold and does not belong to the first target class, and determining the second sample as a second target class.
6. The system of claim 5, wherein the first cluster center point is a designated choice.
7. The system of claim 5, wherein the first cluster center point is randomly selected from a training set.
8. The system of claim 5, wherein the distance threshold is 0.5.
CN201911195214.8A 2019-11-28 2019-11-28 Method and system for acquiring target category number in K-means algorithm Pending CN111310777A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911195214.8A CN111310777A (en) 2019-11-28 2019-11-28 Method and system for acquiring target category number in K-means algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911195214.8A CN111310777A (en) 2019-11-28 2019-11-28 Method and system for acquiring target category number in K-means algorithm

Publications (1)

Publication Number Publication Date
CN111310777A true CN111310777A (en) 2020-06-19

Family

ID=71150686

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911195214.8A Pending CN111310777A (en) 2019-11-28 2019-11-28 Method and system for acquiring target category number in K-means algorithm

Country Status (1)

Country Link
CN (1) CN111310777A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111522968A (en) * 2020-06-22 2020-08-11 中国银行股份有限公司 Knowledge graph fusion method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111522968A (en) * 2020-06-22 2020-08-11 中国银行股份有限公司 Knowledge graph fusion method and device
CN111522968B (en) * 2020-06-22 2023-09-08 中国银行股份有限公司 Knowledge graph fusion method and device

Similar Documents

Publication Publication Date Title
CN103218817B (en) The dividing method of plant organ point cloud and system
CN107909344B (en) Workflow log repeated task identification method based on relation matrix
CN107305577B (en) K-means-based appropriate address data processing method and system
CN109858476B (en) Tag expansion method and electronic equipment
CN105045927B (en) Construction project labor and materials machine data automatic coding and system
CN110692101A (en) Method for aligning targeted nucleic acid sequencing data
CN112116950B (en) Protein folding identification method based on depth measurement learning
CN110991527A (en) Similarity threshold determination method considering voltage curve average fluctuation rate
CN111310777A (en) Method and system for acquiring target category number in K-means algorithm
CN105224962B (en) A kind of similar vehicle license plate extraction method and device
CN103324888A (en) Method and system for automatically extracting virus characteristics based on family samples
CN112463564B (en) Method and device for determining associated index influencing host state
CN115865099B (en) Huffman coding-based multi-type data segment compression method and system
CN111740921A (en) Network traffic classification method and system based on improved K-means algorithm
CN115508615A (en) Load transient characteristic extraction method based on induction motor
CN112016466B (en) Face recognition method, face recognition system, electronic equipment and computer storage medium
CN112445835B (en) Business data processing method and device, network management server and storage medium
CN105868220B (en) Data processing method and device
CN109104494B (en) DNA-based children missing or losing positioning method and system under wireless sensor network
CN110309139B (en) High-dimensional neighbor pair searching method and system
CN112598041A (en) Power distribution network cloud platform data verification method based on K-MEANS algorithm
CN111681131A (en) Water resource management method and management system based on artificial intelligence
CN105608638A (en) Method and system for evaluating synchronous state of meter code data of intelligent terminal and electric energy meter
CN114735013B (en) Method and system for extracting vehicle speed curve of typical working condition of whole vehicle, vehicle and storage medium
CN110083864A (en) A kind of short-term wind speed forecasting method based on empirical mode decomposition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200619